Bug 85976 - [RFE] Add a "remove duplicate records" command
Summary: [RFE] Add a "remove duplicate records" command
Status: ASSIGNED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: high enhancement
Assignee: Sahil Gautam
URL:
Whiteboard:
Keywords: difficultyInteresting, easyHack, skillCpp, topicDebug
: 109519 124758 144744 159980 (view as bug list)
Depends on:
Blocks: Calc-Enhancements Data-Filter
  Show dependency treegraph
 
Reported: 2014-11-06 17:31 UTC by Mr. Bugz
Modified: 2024-03-01 09:16 UTC (History)
31 users (show)

See Also:
Crash report or crash signature:


Attachments
Remove Duplicates button in Data Ribbon in Excel 2016 (171.06 KB, image/png)
2018-10-22 13:03 UTC, Pedro
Details
Only Office also acquired the Remove Duplicates functionality (36.12 KB, image/png)
2022-01-10 15:58 UTC, Pedro
Details
Remove Duplicates in Google Sheets (1.15 MB, image/png)
2022-01-10 16:08 UTC, Pedro
Details
Why remove the whole row? (102.34 KB, image/png)
2023-01-14 19:52 UTC, gmolleda
Details
WPS Worksheets Manage Duplicates menu (50.50 KB, image/png)
2023-02-23 15:54 UTC, Pedro
Details
Dialog (33.66 KB, image/png)
2023-02-23 15:56 UTC, Pedro
Details
Highlight values (25.63 KB, image/png)
2023-02-23 15:56 UTC, Pedro
Details
Fetch unique values (30.61 KB, image/png)
2023-02-23 16:00 UTC, Pedro
Details
License changed to MPL2.0 (50.65 KB, image/png)
2023-03-08 10:06 UTC, Pedro
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mr. Bugz 2014-11-06 17:31:54 UTC
Excel has a feature where you can remove duplicates with a single click. This would be very helpful to have in LibreOffice as well.
Comment 1 Joel Madero 2014-11-06 18:00:34 UTC Comment hidden (obsolete)
Comment 2 libreoffice 2014-11-24 17:54:39 UTC
Yes, that is what I was looking for. But I still think it should be an optional button/dialog available on the main toolbar without having to go through all that.

Like this:
https://thinkandbegin.files.wordpress.com/2012/05/remove-dup-2.png
Comment 3 m_a_riosv 2015-07-29 22:02:51 UTC
*** Bug 92990 has been marked as a duplicate of this bug. ***
Comment 4 Erelyn Alves 2015-07-31 15:21:38 UTC
Below also another program that easily performs the procedure.

WPS Spreadsheets
http://pt-br.tinypic.com/r/2h663gj/8
Comment 5 Mike Kaganski 2018-09-07 08:20:10 UTC
Additional detail: in Excel, the function *removes* duplicates in-place, while filter in LO only allows hiding in-place, or removing by copying to a different location. So actually, LO lacks the functionality.
Comment 6 Roman Kuznetsov 2018-09-07 08:22:39 UTC
(In reply to Joel Madero from comment #1)
> Like this:
> http://milospjanic.blogspot.com/2011/10/how-to-remove-duplicates-in-
> libreoffice.html ?

IMHO, in Excel it made more clear for users. Different dialogue with very simple options. And in result in Excel we have only rows without any duplicates and without copying of result to another range...
Comment 7 Cor Nouws 2018-10-18 14:42:37 UTC
*** Bug 109519 has been marked as a duplicate of this bug. ***
Comment 8 Pedro 2018-10-22 13:01:28 UTC
I would also like to request that "Remove Duplicates" is added to Calc.
In Excel, removing duplicates from a column is a one click action. In Calc I have to go through multiple steps to obtain this. It's a waste of time if it's encessary to do it in multiple spreadsheets, it's a waste of time to make a macro and other spreadsheets offer this by default.

There is an extension that adds this function to Calc so I propose that this is added to the default installation, that a .uno command is created for this button to be integrated in the Notebookbars as well.

https://extensions.libreoffice.org/extensions/remove-duplicates
Comment 9 Pedro 2018-10-22 13:03:37 UTC
Created attachment 145904 [details]
Remove Duplicates button in Data Ribbon in Excel 2016
Comment 10 Pedro 2018-10-22 13:08:28 UTC Comment hidden (obsolete)
Comment 11 Pedro 2018-10-30 18:06:45 UTC Comment hidden (no-value)
Comment 12 Thomas Lendo 2018-11-06 08:45:42 UTC
*** Bug 73712 has been marked as a duplicate of this bug. ***
Comment 13 Pedro 2018-12-18 12:25:57 UTC Comment hidden (obsolete)
Comment 14 V Stuart Foote 2019-04-15 19:37:25 UTC
*** Bug 124758 has been marked as a duplicate of this bug. ***
Comment 15 Roman Kuznetsov 2019-11-09 22:42:18 UTC
(In reply to Pedro from comment #13)
> Well, Muhammet Kara after checking the extension confirmed that it is
> written in Basic, meaning that it is a macro.
> 
> If someone would translate it to C++, it would be awesome.

Now there is faster extension https://extensions.libreoffice.org/extensions/remove-duplicates-fast for it
Comment 16 Pedro 2019-11-11 10:03:13 UTC
Couldn't this extension be added by default to LibreOffice?
Comment 17 Pedro 2019-11-11 10:04:29 UTC Comment hidden (no-value)
Comment 18 Mark Leo 2020-05-17 18:37:52 UTC Comment hidden (spam)
Comment 19 Heiko Tietze 2020-06-05 11:34:20 UTC
Since we have a working solution with the standard filter dialog it should be easy to add a new UNO command and run the filter procedure with a predefined setting. 

Code pointer

sc/source/ui/view/tabvwshc.cxx #308
sc/source/ui/dbgui/sfiltdlg.cxx

Would understand this as a medium to interesting difficulty.
Comment 20 Pedro 2020-06-06 09:58:02 UTC
Just to mention: this is not just filtering. It also deletes the cells with duplicate values. But I guess that is one minor thing to add.
Comment 21 Andrea Winslet 2020-06-22 06:05:32 UTC Comment hidden (spam)
Comment 22 stutisharma900 2020-09-10 02:57:40 UTC Comment hidden (spam)
Comment 23 Pedro 2020-11-23 10:24:01 UTC Comment hidden (obsolete)
Comment 25 Buovjaga 2020-11-23 11:26:08 UTC Comment hidden (obsolete)
Comment 26 Pedro 2020-11-23 13:09:53 UTC Comment hidden (obsolete)
Comment 27 Evgeniy 2021-02-28 18:28:14 UTC Comment hidden (me-too)
Comment 28 breetlee9211 2021-03-19 07:39:15 UTC Comment hidden (spam)
Comment 29 Richard Swayar 2021-07-05 09:12:34 UTC Comment hidden (spam)
Comment 30 Pedro 2022-01-10 15:58:41 UTC
Created attachment 177433 [details]
Only Office also acquired the Remove Duplicates functionality

Only Office also got the Remove Duplicates functionality by default.
Comment 31 Pedro 2022-01-10 16:08:54 UTC
Created attachment 177434 [details]
Remove Duplicates in Google Sheets

This feature is also present in Google Sheets.
Comment 32 Heiko Tietze 2022-01-11 08:44:57 UTC
*** Bug 144744 has been marked as a duplicate of this bug. ***
Comment 33 Mike Kaganski 2022-01-11 09:00:06 UTC
Note that any feature request having a working extension with compatible license is already an easy hack. Just use its source code as the template that provides the required logic - it uses UNO commands, which may be easily converted to a C++ code, and assign a new UNO command to that new function.

(Given that Remove Duplicates Fast is based on Remove Duplicates (https://github.com/ACTom/lo-extension-removeduplicates) which is GPLv3, which doesn't allow to use its code in LO directly because we need an MPL-compatible license, interested parties may ask the author for changed license in a github issue.)
Comment 34 Pedro 2022-01-11 12:09:10 UTC
What MPL specific license do you recommend?
Comment 35 Pedro 2022-01-11 12:13:59 UTC
Asked the dev in a new issue.
Comment 36 Pedro 2022-01-11 12:28:55 UTC
The developer already changed the license to MPL 2.0. Hopefully this allows someone to pick this up. :D
Comment 37 gmolleda 2023-01-14 19:37:54 UTC
Important, also the values of the duplicate selected cells should be deleted and not the entire rows. Moving the remaining data up to leave the blank spaces below.
Comment 38 gmolleda 2023-01-14 19:52:54 UTC
Created attachment 184664 [details]
Why remove the whole row?

The filter hides the entire 8th row, including the letter f (view image attached).
The correct behavior would be to remove values 1 and 4 from row 8. Do not filter by hiding row 8 but move values 2 and 1 from row 9 up one cell and leave B9 and C9 empty.
I know this was a standard filter, but the button that they put specifically to remove duplicates should not be a filter, but actually remove the duplicate values within the selection and not the entire row.
Comment 39 Pedro 2023-02-23 15:54:17 UTC
Created attachment 185548 [details]
WPS Worksheets Manage Duplicates menu

WPS Worksheets is vastly superior in managing duplicates to any other office suite.
It not only allows for removal, but also highlighting and also fetching unique values or highlight them.
Comment 40 Pedro 2023-02-23 15:56:01 UTC
Created attachment 185549 [details]
Dialog

It allows selecting duplicates in selected range, in two ranges within the sheet, in different sheets in the worksheet and in different worksheets!
Comment 41 Pedro 2023-02-23 15:56:43 UTC
Created attachment 185550 [details]
Highlight values
Comment 42 Pedro 2023-02-23 16:00:15 UTC
Created attachment 185551 [details]
Fetch unique values

This is a feature already available in MSO since at least 2007 and it's a feature that's been actively worked on in other Office suites as well, to a point where it is very well designed.

All in all, this is a feature that is sorely lacking in Calc for several years now and with the evolution of this feature visible in OnlyOffice and WPS Worksheets (available in Linux as well), and with the RemoveDuplicates extension having compatible license with LibO it's hard to understand why this has been overlooked for so long now.
Comment 43 Eike Rathke 2023-02-28 16:18:46 UTC
(In reply to Pedro from comment #42)
> with the RemoveDuplicates
> extension having compatible license with LibO it's hard to understand why
> this has been overlooked for so long now.
0. both extensions
https://extensions.libreoffice.org/en/extensions/show/remove-duplicates
https://extensions.libreoffice.org/en/extensions/show/remove-duplicates-fast
are licensed GPL (whatever version) and thus are *not* compatible with LibreOffice licensing.

1. even if they were, that tells nothing about the source code whether it would fit into LO core code (or even be in C++ that it could).

2. if those extensions fulfil the requirements, then why not use them.
Comment 44 Heiko Tietze 2023-03-01 08:24:21 UTC
If we realize bug 149933 searching for duplicates could be one option too.
Comment 45 Pedro 2023-03-08 10:05:28 UTC
(In reply to Eike Rathke from comment #43)
> (In reply to Pedro from comment #42)
> > with the RemoveDuplicates
> > extension having compatible license with LibO it's hard to understand why
> > this has been overlooked for so long now.
> 0. both extensions
> https://extensions.libreoffice.org/en/extensions/show/remove-duplicates
> https://extensions.libreoffice.org/en/extensions/show/remove-duplicates-fast
> are licensed GPL (whatever version) and thus are *not* compatible with
> LibreOffice licensing.
> 
> 1. even if they were, that tells nothing about the source code whether it
> would fit into LO core code (or even be in C++ that it could).
> 
> 2. if those extensions fulfil the requirements, then why not use them.

Eike Rathke, the developer changed the license on his github repo to MPL 2.0.
Comment 46 Pedro 2023-03-08 10:06:05 UTC
Created attachment 185837 [details]
License changed to MPL2.0
Comment 47 Pedro 2023-03-08 10:06:51 UTC
The Fast extension was an improvement done by Mike Kaganski and Kompilainnen I believe. They did not change the license on their extension yet.
Comment 48 Eyal Rozenberg 2023-03-10 19:39:26 UTC
Am definitely missing this in Calc right now.
Comment 49 Mike Kaganski 2023-10-15 13:27:44 UTC
(In reply to Pedro from comment #47)
> The Fast extension was an improvement done by Mike Kaganski and Kompilainnen
> I believe. They did not change the license on their extension yet.

Since our extension was based on the previous one, our license was necessarily the same. Since the old extension's license is now MPL 2.0, I am glad to re-license my work under MPL 2.0.

Roman's turn.
Comment 50 Eyal Rozenberg 2023-10-15 17:16:16 UTC
I'm assuming this bug is about adding the command. If we also want to simplify/alter the filtering dialog - that should be a separate bug.

(If I'm wrong - please change the title, clarify the bug's scope in a comment, and refer to the comment in the title)
Comment 51 Mike Kaganski 2023-10-16 15:32:27 UTC
(In reply to Eyal Rozenberg from comment #50)

The original request was to implement a feature *like Excel's "remove duplicates"*. Filters are orthogonal to that, they never remove any duplicates, only hide or do a partial copy.
Comment 52 JosephGill 2023-11-29 13:17:47 UTC Comment hidden (spam)
Comment 53 Heiko Tietze 2024-01-15 16:49:59 UTC
Should the function run based on values or formulas? In other words is =1+1 the same as =2?
Comment 54 gmolleda 2024-01-15 17:03:24 UTC
(In reply to Heiko Tietze from comment #53)
> Should the function run based on values or formulas? In other words is =1+1
> the same as =2?

Values: =2 or =1+1 and ="b" or =char(98) are the same. Only first cell must remain.
Comment 55 Rafael Lima 2024-01-15 17:44:09 UTC
(In reply to Heiko Tietze from comment #53)
> Should the function run based on values or formulas? In other words is =1+1
> the same as =2?

FYI Excel does consider =2 and =1+1 as duplicates. It seems Excel only considers the cell value that is being shown, regardless of the formula.

TBH I find it a bit intrusive. But I believe many users will want this feature to behave similarly to what Excel does.
Comment 56 Pedro 2024-01-20 09:34:24 UTC
The objective of Remove Duplicates is to remove duplicates of values, not formulas or calculations. There's a reason this is in the Data tab of excel and not in Formulas.

Initially, keeping the scope focused on having a Remove Duplicates that simply removes duplicates of values is the most important.
If Sahil Gautam is motivated to keep working on this afterwards then maybe this can be expanded upon in the future much like WPS Office did in their Worksheets module (their Excel equivalent). WPS Office has the implementation with more functionalities of Remove Duplicates.
Comment 57 Heiko Tietze 2024-01-22 08:58:41 UTC
(In reply to Pedro from comment #56)
> The objective of Remove Duplicates is to remove duplicates of values, not
> formulas or calculations.

Do you argue that =1+2 is not the same as =2+1? Or =1+3 != =2+2. Or ="b" != =char(98).

And I'm against a dialog here to fine-tune the operation. Makes the workflow heavy.
Comment 58 gmolleda 2024-01-22 10:00:47 UTC
(In reply to Heiko Tietze from comment #57)
> (In reply to Pedro from comment #56)
> > The objective of Remove Duplicates is to remove duplicates of values, not
> > formulas or calculations.
> 
> Do you argue that =1+2 is not the same as =2+1? Or =1+3 != =2+2. Or ="b" !=
> =char(98).
> 
> And I'm against a dialog here to fine-tune the operation. Makes the workflow
> heavy.

Being able to put a dialog where you can mark if you want to look at formulas instead of values: I think that in case of looking at formulas and not values, the formulas that are repeated changing only the relative references, should be considered the same. I think that =A3*2 in row 5 should be the same as =A4*2 in row 6. Before checking if they are equal, the relative references part should be removed from the formulas (without $ before) for checking.
Comment 59 Mike Kaganski 2024-01-22 10:05:51 UTC
(In reply to Heiko Tietze from comment #57)
> And I'm against a dialog here to fine-tune the operation. Makes the workflow
> heavy.

Oh :-D LOL. You simply can't have this without a dialog. At all. The "duplicate" concept is *SO COMPLEX*, that you simply can't make all agree on your definition of it. See text import dialog for a similar complexity. Or sort.

People might want to remove duplicates based on some subset of columns (but remove all the cells in the area). They might want formulas to make the difference. They might want to treat equality of numbers with epsilon, or use "text as shown". They might want to work by rows or by columns. They might want to shift up or right.
Comment 60 Heiko Tietze 2024-01-22 10:16:53 UTC
(In reply to Mike Kaganski from comment #59)
> The "duplicate" concept is *SO COMPLEX*...
This is exactly what I mean. You cannot implement a swiss-army knife for every scenario. If the one inbuilt function is not sufficient in some _rare_ use cases, those need to be accomplished by alternative methods. But the primary workflow should be supported as easy as possible- ie. one click to remove duplicates, as the function label says.
Comment 61 Mike Kaganski 2024-01-22 10:39:21 UTC
(In reply to Heiko Tietze from comment #60)
> This is exactly what I mean. You cannot implement a swiss-army knife for
> every scenario.

Yes you can. And you need to. Just because we are the office suite, and not a tool for one single task. See how we *do* try to implement it in case of filtering, or text import, or file format support.

See how other office suite implements it. Excel provides a dialog. Google Sheets provides a dialog. People here expect a dialog. It is simply unavoidable.
Comment 62 Pedro 2024-01-22 10:47:06 UTC
> This is exactly what I mean. You cannot implement a swiss-army knife for every scenario. If the one inbuilt function is not sufficient in some _rare_ use cases, those need to be accomplished by alternative methods. But the primary workflow should be supported as easy as possible- ie. one click to remove duplicates, as the function label says.

Heiko did you even try to use this function in Excel or any other office suite before commenting?
A dialog is REQUIRED if not even for the case that you need to have a "this data has headers" check mark. The button in Excel opens two sequential dialogs. Even the Calc extension requires a dialog. 
If you don't feel this is essential because it doesn't fit your needs please don't try to cripple something that is crucial and sorely missing from Calc. 

The Excel function simplifies things by referring to Remove Duplicate Values. This does not include formulas since the formula is not a value but provides you with one.
Take into consideration that in Excel if you expand duplicate removal to multiple columns, you also need to select the columns that count for duplicates. If you select multiple columns, it will only remove rows when both selected columns have duplicates.
Comment 63 Heiko Tietze 2024-01-22 11:06:16 UTC Comment hidden (off-topic)
Comment 64 Sahil Gautam 2024-01-22 12:33:44 UTC
So We concluded on "dialog needed".
Comment 65 Rafael Lima 2024-01-22 12:36:41 UTC
(In reply to Pedro from comment #62)
> A dialog is REQUIRED if not even for the case that you need to have a "this
> data has headers" check mark. The button in Excel opens two sequential
> dialogs.

Indeed Excel opens a dialog before removing duplicates, but the dialog does not offer many options... f.i. there's nothing about how to handle cells with formulas.

I understand the need for a dialog here... but there should also be a way to simply click a button and remove duplicates without a dialog disrupting the workflow, similarly to what we have with "Sort" (which has a dialog) and "Sort ascending" (no dialog needed).
Comment 66 Mike Kaganski 2024-01-22 12:48:09 UTC
(In reply to Rafael Lima from comment #65)
> but there should also be a way to
> simply click a button and remove duplicates without a dialog disrupting the
> workflow, similarly to what we have with "Sort" (which has a dialog) and
> "Sort ascending" (no dialog needed).

No. This *might* turn out to be useful - when you have implemented the dialog, and then start seeing requests to "just make my Click the toolbar - then press Enter difficult sequence easier, because pressing Enter is do much tiresome" requests. It may be justified by the user demand - but not made pro-actively: multiplying UNO commands without the sizable demand is exactly the bloat that should be avoided, not the dialogs that user ask for.
Comment 67 Rafael Lima 2024-01-22 13:10:57 UTC
(In reply to Mike Kaganski from comment #66)
> multiplying UNO commands without the sizable demand is
> exactly the bloat that should be avoided, not the dialogs that user ask for.

AFAIK it would be possible to have both (dialog / non-dialog) functionalities with a single UNO command, depending on the arguments passed to it.

Another real use-case of removing duplicates without the dialog would be for writing macros, where all parameters of the UNO command would be provided by the macro and no dialog would be necessary.

In summary... I'm in favor of having a dialog here, but it would also be cool to have the ability to run this UNO command without showing the dialog as well.
Comment 68 Mike Kaganski 2024-01-22 13:56:11 UTC
(In reply to Rafael Lima from comment #67)
> AFAIK it would be possible to have both (dialog / non-dialog)
> functionalities with a single UNO command, depending on the arguments passed
> to it.

This is exactly why I asked for an optional argument for the UNO command, when reviewed the proposed patch.
Comment 69 Mike Kaganski 2024-03-01 09:16:20 UTC
*** Bug 159980 has been marked as a duplicate of this bug. ***