Bug 85976 - [RFE] Add a "remove duplicate records" command
Summary: [RFE] Add a "remove duplicate records" command
Status: ASSIGNED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: high enhancement
Assignee: Sahil Gautam
URL:
Whiteboard: target:25.2.0
Keywords: difficultyInteresting, easyHack, skillCpp, topicDebug
: 109519 124758 144744 159980 (view as bug list)
Depends on:
Blocks: Calc-Enhancements Data-Filter
  Show dependency treegraph
 
Reported: 2014-11-06 17:31 UTC by Mr. Bugz
Modified: 2024-11-06 19:51 UTC (History)
33 users (show)

See Also:
Crash report or crash signature:


Attachments
Remove Duplicates button in Data Ribbon in Excel 2016 (171.06 KB, image/png)
2018-10-22 13:03 UTC, Pedro
Details
Only Office also acquired the Remove Duplicates functionality (36.12 KB, image/png)
2022-01-10 15:58 UTC, Pedro
Details
Remove Duplicates in Google Sheets (1.15 MB, image/png)
2022-01-10 16:08 UTC, Pedro
Details
Why remove the whole row? (102.34 KB, image/png)
2023-01-14 19:52 UTC, gmolleda
Details
WPS Worksheets Manage Duplicates menu (50.50 KB, image/png)
2023-02-23 15:54 UTC, Pedro
Details
Dialog (33.66 KB, image/png)
2023-02-23 15:56 UTC, Pedro
Details
Highlight values (25.63 KB, image/png)
2023-02-23 15:56 UTC, Pedro
Details
Fetch unique values (30.61 KB, image/png)
2023-02-23 16:00 UTC, Pedro
Details
License changed to MPL2.0 (50.65 KB, image/png)
2023-03-08 10:06 UTC, Pedro
Details
Database ranges examples (18.95 KB, application/vnd.oasis.opendocument.spreadsheet)
2024-09-29 17:36 UTC, Regina Henschel
Details
duplicate records dialog evolution (317.12 KB, image/png)
2024-09-29 20:06 UTC, Sahil Gautam
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mr. Bugz 2014-11-06 17:31:54 UTC
Excel has a feature where you can remove duplicates with a single click. This would be very helpful to have in LibreOffice as well.
Comment 1 Joel Madero 2014-11-06 18:00:34 UTC Comment hidden (obsolete)
Comment 2 libreoffice 2014-11-24 17:54:39 UTC
Yes, that is what I was looking for. But I still think it should be an optional button/dialog available on the main toolbar without having to go through all that.

Like this:
https://thinkandbegin.files.wordpress.com/2012/05/remove-dup-2.png
Comment 3 m_a_riosv 2015-07-29 22:02:51 UTC
*** Bug 92990 has been marked as a duplicate of this bug. ***
Comment 4 Erelyn Alves 2015-07-31 15:21:38 UTC
Below also another program that easily performs the procedure.

WPS Spreadsheets
http://pt-br.tinypic.com/r/2h663gj/8
Comment 5 Mike Kaganski 2018-09-07 08:20:10 UTC
Additional detail: in Excel, the function *removes* duplicates in-place, while filter in LO only allows hiding in-place, or removing by copying to a different location. So actually, LO lacks the functionality.
Comment 6 Roman Kuznetsov 2018-09-07 08:22:39 UTC
(In reply to Joel Madero from comment #1)
> Like this:
> http://milospjanic.blogspot.com/2011/10/how-to-remove-duplicates-in-
> libreoffice.html ?

IMHO, in Excel it made more clear for users. Different dialogue with very simple options. And in result in Excel we have only rows without any duplicates and without copying of result to another range...
Comment 7 Cor Nouws 2018-10-18 14:42:37 UTC
*** Bug 109519 has been marked as a duplicate of this bug. ***
Comment 8 Pedro 2018-10-22 13:01:28 UTC
I would also like to request that "Remove Duplicates" is added to Calc.
In Excel, removing duplicates from a column is a one click action. In Calc I have to go through multiple steps to obtain this. It's a waste of time if it's encessary to do it in multiple spreadsheets, it's a waste of time to make a macro and other spreadsheets offer this by default.

There is an extension that adds this function to Calc so I propose that this is added to the default installation, that a .uno command is created for this button to be integrated in the Notebookbars as well.

https://extensions.libreoffice.org/extensions/remove-duplicates
Comment 9 Pedro 2018-10-22 13:03:37 UTC
Created attachment 145904 [details]
Remove Duplicates button in Data Ribbon in Excel 2016
Comment 10 Pedro 2018-10-22 13:08:28 UTC Comment hidden (obsolete)
Comment 11 Pedro 2018-10-30 18:06:45 UTC Comment hidden (no-value)
Comment 12 Thomas Lendo 2018-11-06 08:45:42 UTC
*** Bug 73712 has been marked as a duplicate of this bug. ***
Comment 13 Pedro 2018-12-18 12:25:57 UTC Comment hidden (obsolete)
Comment 14 V Stuart Foote 2019-04-15 19:37:25 UTC
*** Bug 124758 has been marked as a duplicate of this bug. ***
Comment 15 Roman Kuznetsov 2019-11-09 22:42:18 UTC
(In reply to Pedro from comment #13)
> Well, Muhammet Kara after checking the extension confirmed that it is
> written in Basic, meaning that it is a macro.
> 
> If someone would translate it to C++, it would be awesome.

Now there is faster extension https://extensions.libreoffice.org/extensions/remove-duplicates-fast for it
Comment 16 Pedro 2019-11-11 10:03:13 UTC
Couldn't this extension be added by default to LibreOffice?
Comment 17 Pedro 2019-11-11 10:04:29 UTC Comment hidden (no-value)
Comment 18 Mark Leo 2020-05-17 18:37:52 UTC Comment hidden (spam)
Comment 19 Heiko Tietze 2020-06-05 11:34:20 UTC
Since we have a working solution with the standard filter dialog it should be easy to add a new UNO command and run the filter procedure with a predefined setting. 

Code pointer

sc/source/ui/view/tabvwshc.cxx #308
sc/source/ui/dbgui/sfiltdlg.cxx

Would understand this as a medium to interesting difficulty.
Comment 20 Pedro 2020-06-06 09:58:02 UTC
Just to mention: this is not just filtering. It also deletes the cells with duplicate values. But I guess that is one minor thing to add.
Comment 21 Andrea Winslet 2020-06-22 06:05:32 UTC Comment hidden (spam)
Comment 22 stutisharma900 2020-09-10 02:57:40 UTC Comment hidden (spam)
Comment 23 Pedro 2020-11-23 10:24:01 UTC Comment hidden (obsolete)
Comment 25 Buovjaga 2020-11-23 11:26:08 UTC Comment hidden (obsolete)
Comment 26 Pedro 2020-11-23 13:09:53 UTC Comment hidden (obsolete)
Comment 27 Evgeniy 2021-02-28 18:28:14 UTC Comment hidden (me-too)
Comment 28 breetlee9211 2021-03-19 07:39:15 UTC Comment hidden (spam)
Comment 29 Richard Swayar 2021-07-05 09:12:34 UTC Comment hidden (spam)
Comment 30 Pedro 2022-01-10 15:58:41 UTC
Created attachment 177433 [details]
Only Office also acquired the Remove Duplicates functionality

Only Office also got the Remove Duplicates functionality by default.
Comment 31 Pedro 2022-01-10 16:08:54 UTC
Created attachment 177434 [details]
Remove Duplicates in Google Sheets

This feature is also present in Google Sheets.
Comment 32 Heiko Tietze 2022-01-11 08:44:57 UTC
*** Bug 144744 has been marked as a duplicate of this bug. ***
Comment 33 Mike Kaganski 2022-01-11 09:00:06 UTC
Note that any feature request having a working extension with compatible license is already an easy hack. Just use its source code as the template that provides the required logic - it uses UNO commands, which may be easily converted to a C++ code, and assign a new UNO command to that new function.

(Given that Remove Duplicates Fast is based on Remove Duplicates (https://github.com/ACTom/lo-extension-removeduplicates) which is GPLv3, which doesn't allow to use its code in LO directly because we need an MPL-compatible license, interested parties may ask the author for changed license in a github issue.)
Comment 34 Pedro 2022-01-11 12:09:10 UTC
What MPL specific license do you recommend?
Comment 35 Pedro 2022-01-11 12:13:59 UTC
Asked the dev in a new issue.
Comment 36 Pedro 2022-01-11 12:28:55 UTC
The developer already changed the license to MPL 2.0. Hopefully this allows someone to pick this up. :D
Comment 37 gmolleda 2023-01-14 19:37:54 UTC
Important, also the values of the duplicate selected cells should be deleted and not the entire rows. Moving the remaining data up to leave the blank spaces below.
Comment 38 gmolleda 2023-01-14 19:52:54 UTC
Created attachment 184664 [details]
Why remove the whole row?

The filter hides the entire 8th row, including the letter f (view image attached).
The correct behavior would be to remove values 1 and 4 from row 8. Do not filter by hiding row 8 but move values 2 and 1 from row 9 up one cell and leave B9 and C9 empty.
I know this was a standard filter, but the button that they put specifically to remove duplicates should not be a filter, but actually remove the duplicate values within the selection and not the entire row.
Comment 39 Pedro 2023-02-23 15:54:17 UTC
Created attachment 185548 [details]
WPS Worksheets Manage Duplicates menu

WPS Worksheets is vastly superior in managing duplicates to any other office suite.
It not only allows for removal, but also highlighting and also fetching unique values or highlight them.
Comment 40 Pedro 2023-02-23 15:56:01 UTC
Created attachment 185549 [details]
Dialog

It allows selecting duplicates in selected range, in two ranges within the sheet, in different sheets in the worksheet and in different worksheets!
Comment 41 Pedro 2023-02-23 15:56:43 UTC
Created attachment 185550 [details]
Highlight values
Comment 42 Pedro 2023-02-23 16:00:15 UTC
Created attachment 185551 [details]
Fetch unique values

This is a feature already available in MSO since at least 2007 and it's a feature that's been actively worked on in other Office suites as well, to a point where it is very well designed.

All in all, this is a feature that is sorely lacking in Calc for several years now and with the evolution of this feature visible in OnlyOffice and WPS Worksheets (available in Linux as well), and with the RemoveDuplicates extension having compatible license with LibO it's hard to understand why this has been overlooked for so long now.
Comment 43 Eike Rathke 2023-02-28 16:18:46 UTC
(In reply to Pedro from comment #42)
> with the RemoveDuplicates
> extension having compatible license with LibO it's hard to understand why
> this has been overlooked for so long now.
0. both extensions
https://extensions.libreoffice.org/en/extensions/show/remove-duplicates
https://extensions.libreoffice.org/en/extensions/show/remove-duplicates-fast
are licensed GPL (whatever version) and thus are *not* compatible with LibreOffice licensing.

1. even if they were, that tells nothing about the source code whether it would fit into LO core code (or even be in C++ that it could).

2. if those extensions fulfil the requirements, then why not use them.
Comment 44 Heiko Tietze 2023-03-01 08:24:21 UTC
If we realize bug 149933 searching for duplicates could be one option too.
Comment 45 Pedro 2023-03-08 10:05:28 UTC
(In reply to Eike Rathke from comment #43)
> (In reply to Pedro from comment #42)
> > with the RemoveDuplicates
> > extension having compatible license with LibO it's hard to understand why
> > this has been overlooked for so long now.
> 0. both extensions
> https://extensions.libreoffice.org/en/extensions/show/remove-duplicates
> https://extensions.libreoffice.org/en/extensions/show/remove-duplicates-fast
> are licensed GPL (whatever version) and thus are *not* compatible with
> LibreOffice licensing.
> 
> 1. even if they were, that tells nothing about the source code whether it
> would fit into LO core code (or even be in C++ that it could).
> 
> 2. if those extensions fulfil the requirements, then why not use them.

Eike Rathke, the developer changed the license on his github repo to MPL 2.0.
Comment 46 Pedro 2023-03-08 10:06:05 UTC
Created attachment 185837 [details]
License changed to MPL2.0
Comment 47 Pedro 2023-03-08 10:06:51 UTC
The Fast extension was an improvement done by Mike Kaganski and Kompilainnen I believe. They did not change the license on their extension yet.
Comment 48 Eyal Rozenberg 2023-03-10 19:39:26 UTC
Am definitely missing this in Calc right now.
Comment 49 Mike Kaganski 2023-10-15 13:27:44 UTC
(In reply to Pedro from comment #47)
> The Fast extension was an improvement done by Mike Kaganski and Kompilainnen
> I believe. They did not change the license on their extension yet.

Since our extension was based on the previous one, our license was necessarily the same. Since the old extension's license is now MPL 2.0, I am glad to re-license my work under MPL 2.0.

Roman's turn.
Comment 50 Eyal Rozenberg 2023-10-15 17:16:16 UTC
I'm assuming this bug is about adding the command. If we also want to simplify/alter the filtering dialog - that should be a separate bug.

(If I'm wrong - please change the title, clarify the bug's scope in a comment, and refer to the comment in the title)
Comment 51 Mike Kaganski 2023-10-16 15:32:27 UTC
(In reply to Eyal Rozenberg from comment #50)

The original request was to implement a feature *like Excel's "remove duplicates"*. Filters are orthogonal to that, they never remove any duplicates, only hide or do a partial copy.
Comment 52 JosephGill 2023-11-29 13:17:47 UTC Comment hidden (spam)
Comment 53 Heiko Tietze 2024-01-15 16:49:59 UTC
Should the function run based on values or formulas? In other words is =1+1 the same as =2?
Comment 54 gmolleda 2024-01-15 17:03:24 UTC
(In reply to Heiko Tietze from comment #53)
> Should the function run based on values or formulas? In other words is =1+1
> the same as =2?

Values: =2 or =1+1 and ="b" or =char(98) are the same. Only first cell must remain.
Comment 55 Rafael Lima 2024-01-15 17:44:09 UTC
(In reply to Heiko Tietze from comment #53)
> Should the function run based on values or formulas? In other words is =1+1
> the same as =2?

FYI Excel does consider =2 and =1+1 as duplicates. It seems Excel only considers the cell value that is being shown, regardless of the formula.

TBH I find it a bit intrusive. But I believe many users will want this feature to behave similarly to what Excel does.
Comment 56 Pedro 2024-01-20 09:34:24 UTC
The objective of Remove Duplicates is to remove duplicates of values, not formulas or calculations. There's a reason this is in the Data tab of excel and not in Formulas.

Initially, keeping the scope focused on having a Remove Duplicates that simply removes duplicates of values is the most important.
If Sahil Gautam is motivated to keep working on this afterwards then maybe this can be expanded upon in the future much like WPS Office did in their Worksheets module (their Excel equivalent). WPS Office has the implementation with more functionalities of Remove Duplicates.
Comment 57 Heiko Tietze 2024-01-22 08:58:41 UTC
(In reply to Pedro from comment #56)
> The objective of Remove Duplicates is to remove duplicates of values, not
> formulas or calculations.

Do you argue that =1+2 is not the same as =2+1? Or =1+3 != =2+2. Or ="b" != =char(98).

And I'm against a dialog here to fine-tune the operation. Makes the workflow heavy.
Comment 58 gmolleda 2024-01-22 10:00:47 UTC
(In reply to Heiko Tietze from comment #57)
> (In reply to Pedro from comment #56)
> > The objective of Remove Duplicates is to remove duplicates of values, not
> > formulas or calculations.
> 
> Do you argue that =1+2 is not the same as =2+1? Or =1+3 != =2+2. Or ="b" !=
> =char(98).
> 
> And I'm against a dialog here to fine-tune the operation. Makes the workflow
> heavy.

Being able to put a dialog where you can mark if you want to look at formulas instead of values: I think that in case of looking at formulas and not values, the formulas that are repeated changing only the relative references, should be considered the same. I think that =A3*2 in row 5 should be the same as =A4*2 in row 6. Before checking if they are equal, the relative references part should be removed from the formulas (without $ before) for checking.
Comment 59 Mike Kaganski 2024-01-22 10:05:51 UTC
(In reply to Heiko Tietze from comment #57)
> And I'm against a dialog here to fine-tune the operation. Makes the workflow
> heavy.

Oh :-D LOL. You simply can't have this without a dialog. At all. The "duplicate" concept is *SO COMPLEX*, that you simply can't make all agree on your definition of it. See text import dialog for a similar complexity. Or sort.

People might want to remove duplicates based on some subset of columns (but remove all the cells in the area). They might want formulas to make the difference. They might want to treat equality of numbers with epsilon, or use "text as shown". They might want to work by rows or by columns. They might want to shift up or right.
Comment 60 Heiko Tietze 2024-01-22 10:16:53 UTC
(In reply to Mike Kaganski from comment #59)
> The "duplicate" concept is *SO COMPLEX*...
This is exactly what I mean. You cannot implement a swiss-army knife for every scenario. If the one inbuilt function is not sufficient in some _rare_ use cases, those need to be accomplished by alternative methods. But the primary workflow should be supported as easy as possible- ie. one click to remove duplicates, as the function label says.
Comment 61 Mike Kaganski 2024-01-22 10:39:21 UTC
(In reply to Heiko Tietze from comment #60)
> This is exactly what I mean. You cannot implement a swiss-army knife for
> every scenario.

Yes you can. And you need to. Just because we are the office suite, and not a tool for one single task. See how we *do* try to implement it in case of filtering, or text import, or file format support.

See how other office suite implements it. Excel provides a dialog. Google Sheets provides a dialog. People here expect a dialog. It is simply unavoidable.
Comment 62 Pedro 2024-01-22 10:47:06 UTC
> This is exactly what I mean. You cannot implement a swiss-army knife for every scenario. If the one inbuilt function is not sufficient in some _rare_ use cases, those need to be accomplished by alternative methods. But the primary workflow should be supported as easy as possible- ie. one click to remove duplicates, as the function label says.

Heiko did you even try to use this function in Excel or any other office suite before commenting?
A dialog is REQUIRED if not even for the case that you need to have a "this data has headers" check mark. The button in Excel opens two sequential dialogs. Even the Calc extension requires a dialog. 
If you don't feel this is essential because it doesn't fit your needs please don't try to cripple something that is crucial and sorely missing from Calc. 

The Excel function simplifies things by referring to Remove Duplicate Values. This does not include formulas since the formula is not a value but provides you with one.
Take into consideration that in Excel if you expand duplicate removal to multiple columns, you also need to select the columns that count for duplicates. If you select multiple columns, it will only remove rows when both selected columns have duplicates.
Comment 63 Heiko Tietze 2024-01-22 11:06:16 UTC Comment hidden (off-topic)
Comment 64 Sahil Gautam 2024-01-22 12:33:44 UTC
So We concluded on "dialog needed".
Comment 65 Rafael Lima 2024-01-22 12:36:41 UTC
(In reply to Pedro from comment #62)
> A dialog is REQUIRED if not even for the case that you need to have a "this
> data has headers" check mark. The button in Excel opens two sequential
> dialogs.

Indeed Excel opens a dialog before removing duplicates, but the dialog does not offer many options... f.i. there's nothing about how to handle cells with formulas.

I understand the need for a dialog here... but there should also be a way to simply click a button and remove duplicates without a dialog disrupting the workflow, similarly to what we have with "Sort" (which has a dialog) and "Sort ascending" (no dialog needed).
Comment 66 Mike Kaganski 2024-01-22 12:48:09 UTC
(In reply to Rafael Lima from comment #65)
> but there should also be a way to
> simply click a button and remove duplicates without a dialog disrupting the
> workflow, similarly to what we have with "Sort" (which has a dialog) and
> "Sort ascending" (no dialog needed).

No. This *might* turn out to be useful - when you have implemented the dialog, and then start seeing requests to "just make my Click the toolbar - then press Enter difficult sequence easier, because pressing Enter is do much tiresome" requests. It may be justified by the user demand - but not made pro-actively: multiplying UNO commands without the sizable demand is exactly the bloat that should be avoided, not the dialogs that user ask for.
Comment 67 Rafael Lima 2024-01-22 13:10:57 UTC
(In reply to Mike Kaganski from comment #66)
> multiplying UNO commands without the sizable demand is
> exactly the bloat that should be avoided, not the dialogs that user ask for.

AFAIK it would be possible to have both (dialog / non-dialog) functionalities with a single UNO command, depending on the arguments passed to it.

Another real use-case of removing duplicates without the dialog would be for writing macros, where all parameters of the UNO command would be provided by the macro and no dialog would be necessary.

In summary... I'm in favor of having a dialog here, but it would also be cool to have the ability to run this UNO command without showing the dialog as well.
Comment 68 Mike Kaganski 2024-01-22 13:56:11 UTC
(In reply to Rafael Lima from comment #67)
> AFAIK it would be possible to have both (dialog / non-dialog)
> functionalities with a single UNO command, depending on the arguments passed
> to it.

This is exactly why I asked for an optional argument for the UNO command, when reviewed the proposed patch.
Comment 69 Mike Kaganski 2024-03-01 09:16:20 UTC
*** Bug 159980 has been marked as a duplicate of this bug. ***
Comment 70 RV park 2024-09-27 12:51:38 UTC Comment hidden (no-value)
Comment 71 RV park 2024-09-27 12:52:04 UTC Comment hidden (no-value)
Comment 72 Commit Notification 2024-09-27 13:23:08 UTC
Sahil Gautam committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/29fd68bb682006ccaa5aaed516c064b5b6368463

tdf#85976 [RFE] Add a "Remove Duplicate Records" command

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 73 Regina Henschel 2024-09-28 14:22:16 UTC
I see these problems:
(1) On first call, "by column" was selected although that is the less common case.
(2) The meaning of "Orientation" is not really clear. Suggestion:
   Compare:  rows   columns
(3) The purpose of "Items" is unclear.
(4) Help page does not exist.
(5) If a database range is selected, the dialog does not consider the property "Contains column labels" of the database range.
(6) It is not usual to use "Okay", but other dialogs have it named "OK".
(7) Command has no extended tip.
Comment 74 Sahil Gautam 2024-09-28 15:21:34 UTC
The help page patch needs code review https://gerrit.libreoffice.org/c/help/+/173142

I asked Oliver & Ilmari for merging it but they said it can only be done after the feature patch is released.
Comment 75 Commit Notification 2024-09-28 16:46:22 UTC
Sahil Gautam committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/help/commit/0d61990c85ed2135a6064b6caf1e989e820cd65c

tdf#85976 Help page for HandleDuplicateRecords dialog
Comment 76 Olivier Hallot 2024-09-28 16:57:10 UTC
The Help page was approved to provide a landing page for the dialog and UNO command.

A follow up is needed to fix typos and improve style.
Comment 77 Sahil Gautam 2024-09-29 15:45:08 UTC
@Regina, I fixed most of them in https://gerrit.libreoffice.org/c/core/+/174201
I didn't understand the 5th point about database range. I tested it with plain rows and columns in calc. Is it supposed to do more than that? Can you please provide some steps to reproduce the issue...
Comment 78 Buovjaga 2024-09-29 16:27:57 UTC
(In reply to Sahil Gautam from comment #77)
> @Regina, I fixed most of them in
> https://gerrit.libreoffice.org/c/core/+/174201
> I didn't understand the 5th point about database range. I tested it with
> plain rows and columns in calc. Is it supposed to do more than that? Can you
> please provide some steps to reproduce the issue...

https://help.libreoffice.org/latest/en-US/text/scalc/guide/database_define.html

Regina: are label ranges relevant as well?

https://help.libreoffice.org/latest/en-US/text/scalc/01/04070400.html
Comment 79 Regina Henschel 2024-09-29 17:36:25 UTC
Created attachment 196782 [details]
Database ranges examples

to (5):
A database range is an <table:database-range> element in file format. It has the attribute 'table:contains-header' with values 'true' (default) and 'false', and the attribute 'table:orientation' with values 'row' (default) and 'column'. The attributes are only written to file, if the value is not default.

This results in 4 combinations.  The attachment has for each one a database range. Each database range is on a sheet of its own. Database range and sheet are named according the desired combination.

The 'table:orientation="column"' case is not fully implemented in LibreOffice, but only for sorting. It does not exist in the 'table'-feature of Excel.

You should control, whether the settings are imported correctly and adjust them if necessary, before you make tests with the new "remove duplicate" feature.

To see the settings set the cursor into the data range. Then go to menu Data > Sort. The name of the database range and the sheet says you, which setting is intended.

You can see and alter the 'label'-settings too in menu Data > Define Range.
Select a range and open the "Options" part. Depending on the range, the option 'Contains column labels' should be checked or not.

To select a database range use menu Data > Select Range.

From point of a user I expect, that when I select a database range, the new "show/remove" duplicate records dialog has pre-selected the orientation and label settings according to the settings in the database range.


to (3): With renaming items -> records the purpose is still not clear. Perhaps you explain here in the bug report the purpose in detail. Then we can help to find a short description.
Comment 80 Regina Henschel 2024-09-29 17:54:18 UTC
(In reply to Buovjaga from comment #78)
> 
> Regina: are label ranges relevant as well?
> 
> https://help.libreoffice.org/latest/en-US/text/scalc/01/04070400.html

Yes, if the selection is not a database range but has cells which belong to a label range, then the dialog should open with the corresponding settings pre-selected. As user I would expect this, but I don't know how effortful it is to implement.


And a further remark:
(8) A rectangular cell range can have rows/columns, which are directly hidden by the user, or they can be hidden by a filter, or they can be hidden in a collapsed group. Are such hidden rows/columns effected by "remove duplicates"? Are such hidden rows/columns evaluated for the criterion "duplicate"?
Comment 81 Sahil Gautam 2024-09-29 20:06:15 UTC
Created attachment 196787 [details]
duplicate records dialog evolution
Comment 82 Sahil Gautam 2024-09-29 20:17:31 UTC
(8) Yes, the dialog also considers the hidden rows/columns for duplicates comparison.

(3) The UI was designed by Heiko (please refer to the "duplicate records dialog evolution" attachment). It grouped the relevant controls/widgets into sections like "Actions:", "Items:" (now "Records:" in the latest patch (still not final)).

The design to me looked like "Howard Roark's designs from *The Fountainhead*" :), I just couldn't dare to change the masterpiece;
Comment 83 Rafael Lima 2024-09-30 12:04:11 UTC
(In reply to Regina Henschel from comment #79)
> From point of a user I expect, that when I select a database range, the new
> "show/remove" duplicate records dialog has pre-selected the orientation and
> label settings according to the settings in the database range.

Code pointer for Sahil... check ScTabPageSortFields::Init()

https://opengrok.libreoffice.org/xref/core/sc/source/ui/dbgui/tpsort.cxx?r=57c7269f#116

Which is where the dialog checks the database range properties and set them in the Sort dialog.
Comment 84 Regina Henschel 2024-10-02 17:12:06 UTC
In regard to (3) and the help text for it:

The current help has:
<paragraph role="paragraph" id="par_id61725963172527"><emph>Items:</emph> shows the headers for the selected records. If "data includes headers" checkbox is checked, then it contains the headers of the records, else it's either the row number or the column name depending on the orientation. The user can select/unselect the records to be compared. In the column header, it contains a checkbox to toggle state for all the records in the treeview.</paragraph>

From my tests, I would say that the sentence "The user can select/unselect the records to be compared." is wrong.

Content problem:
In the language of databases, the user does not select/unselect "records" but "fields".
Example:
Animal | Location | Year
Deer | West | 2023
Deer | West | 2021
Deer | East | 2021
Deer | East | 2023

If "all" is selected, all four records are different and thus retained.

If only "Animal" and "Location" is selected, then the field "Year" is not used in the comparisons. The comparisons actually use the shortened records:
Deer | West
Dear | West
Deer | East
Deer | East
Thus these records are retained: (assuming first in top-down direction)
Deer | West | 2023
Deer | East | 2021

I have no solution for (3) because the term "field" is related to databases and very technical. But "record" is surely wrong. Perhaps talk to Heiko and Olivier?


Style problem:
The help addresses the user. Therefore "The user can select/unselect..." is an unsuitable wording. Common are wordings with imperative, e.g. "Select foo", "Choose bar", or descriptive "Selects ...", "Shows ...", "Displays", or wordings with "You ...", or constructs of the form "To get result foo, do bar."
Comment 85 Regina Henschel 2024-10-02 17:47:40 UTC
In regard to (8): If I use a filter (Auto, Standard and Advanced) on the range and that hides some rows, and when I then call the "Duplicate Records..." dialog, I get the error message "No Data found to operate on."

This does not happen, when I hide some rows manually or have some rows in a collapsed group.

Is it intended, that the feature does not work for a cell area with active filter? If yes, the error message and the help should tell this.
Comment 86 Regina Henschel 2024-10-02 18:02:04 UTC
If "By Row" is selected, the text for Headers is "Data contains row headers". That is misleading. If "By Row" is selected it compares and deletes rows. Thus the text should be "Range contains column labels". The text should be the same as in the Sort dialog and in the Options in the "Define Database Range" dialog.

And the other way round, if "By Column" is selected, it should be "Range contains row labels".
Comment 87 Sahil Gautam 2024-10-03 08:10:23 UTC
(In reply to Regina Henschel from comment #85)
> [...] I get the error message "No Data found to operate on."
> 
> This does not happen, when I hide some rows manually or have some rows in a
> collapsed group.
> 
> Is it intended, that the feature does not work for a cell area with active
> filter? If yes, the error message and the help should tell this.

The error message only shows up if the the active cell is not on the data (or very next to it). I tried it with autofilter, and couldn't reproduce it. I had 4 rows with 1231 abca (2 columns), and then I hid the second row ( |2|b|), and called duplicate records when the cell focus was on |3|c(here)|, the dialog appeared as expected.
Comment 88 Regina Henschel 2024-10-03 09:54:14 UTC
Steps to reproduce the problem:
1. Open attachment 196782 [details] "Database ranges examples" and go to sheet "VertLabel".
2. Menu Data > Select Range. In that "Select Database Range" dialog select "VertWithLabel". OK. Now the range is selected and the cell cursor is on A1.
3. Menu Data > More Filters... > Standard Filter. Select Field name "Region", Condition = and Value "East". OK. Now the range shows only records with Region "East". Cell cursor is still on A1.
4. Menu Data > Duplicate Records... => Error message.
Comment 89 Commit Notification 2024-10-05 18:01:10 UTC
Sahil Gautam committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/7a1636a24f8a4c856348bb6781aef4a494227691

tdf#85976 change labels as suggested in comment 73 on the ticket

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 90 Pedro 2024-10-06 11:45:41 UTC
If possible add this action to the Data tab of the Tabbed UI. Thank you for your  great work Sahil!
Comment 91 Mike Kaganski 2024-10-06 12:16:18 UTC
The "Records" element group (the "All" and the list with checkboxes) is completely unclear from the user point of view. If I didn't know what it means, I will never guess that it's "what to compare to decide if two records (which actually contain all columns, not only checked) are duplicates or not".

The help (still mentioning old "Items" term) doesn't help to understand that, too. It talks what it shows, but that's actually unimportant; the important is that the checkboxes define what would be "primary key" in SQL.
Comment 92 Mike Kaganski 2024-10-06 12:19:24 UTC
Some variation of "Compare by:" or "Fields to compare:" would be much better label.
Comment 93 Sahil Gautam 2024-10-27 20:10:13 UTC
Patch https://gerrit.libreoffice.org/c/core/+/175704 changes "Compare: " to "Compare by: " and "Records: " to "Rows: " (if compare by columns is selected), and "Columns: " (if compare by rows is selected). Suggestions for better labels are always welcome!
Comment 94 Sahil Gautam 2024-10-29 00:00:35 UTC
patch https://gerrit.libreoffice.org/c/core/+/175764 adds "Handle Duplicate Records" button to the notebookbar. The label "Handle Duplicate Records" is very long, takes so much space on the notebookbar. I have "Handle Duplicates" in my mind, need more suggestions...
Comment 95 Heiko Tietze 2024-10-29 09:00:38 UTC
(In reply to Sahil Gautam from comment #94)
> patch https://gerrit.libreoffice.org/c/core/+/175764 adds "Handle Duplicate
> Records" button to the notebookbar. The label "Handle Duplicate Records" is
> very long, takes so much space on the notebookbar. I have "Handle
> Duplicates" in my mind, need more suggestions...

Just "Duplicates"? Besides, bug 163117
Comment 96 Pedro 2024-10-29 09:14:49 UTC
I would keep the name as close as possible to other Office suites.
This is only a "Remove Duplicates" action ?
Then name it Remove Duplicates as that is the convention in other office suites and users will be familiarized to look for that.name of this action in :
Google Sheets - Remove Duplicates
Excel - Remove Duplicates
Only Office - Remove Duplicates

WPS Spreadsheets (more features besides removing duplicates) - Manage duplicates

Therefore call it "Remove Duplicates" or "Manage Duplicates". 

Just "Duplicates" doesn't say an action to take.
Comment 97 Sahil Gautam 2024-10-29 13:06:28 UTC
(In reply to Pedro from comment #96)
> I would keep the name as close as possible to other Office suites.
> This is only a "Remove Duplicates" action ?

It also selects duplicates (other than removing them) depending on which option is selected in the "Action" section of the dialog.
Comment 98 Pedro 2024-10-29 15:15:20 UTC
(In reply to Sahil Gautam from comment #97)
> (In reply to Pedro from comment #96)
> > I would keep the name as close as possible to other Office suites.
> > This is only a "Remove Duplicates" action ?
> 
> It also selects duplicates (other than removing them) depending on which
> option is selected in the "Action" section of the dialog.

I would name it "Manage Duplicates" like WPS Spreadsheets then.
Comment 99 Commit Notification 2024-10-30 18:40:05 UTC
Sahil Gautam committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/22949c16e65fccd40d6313c6b6c0d7906f72a999

tdf#85976 Change label from "Handle Duplicate Records" to "Duplicates"

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 100 Commit Notification 2024-10-31 09:09:56 UTC
Sahil Gautam committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/965287a9edb982b4f1857e7a57a73f0bdfd7e330

tdf#85976 Make labels more intuitive in "Duplicate Records Dialog"

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 101 Timur 2024-10-31 09:55:40 UTC
Please add to https://wiki.documentfoundation.org/ReleaseNotes/25.2.
Comment 102 Sahil Gautam 2024-11-01 01:58:58 UTC
(In reply to Mike Kaganski from comment #92)
> Some variation of "Compare by:" or "Fields to compare:" would be much better
> label.

@Mike I think you suggested these labels as a replacement for "Records: " and not the first label "Compare: " []rows []columns? I changed "Compare: " to "Compare by: ", and (while reading the updated labels for updating help page) I felt something was wrong.
Comment 103 Commit Notification 2024-11-06 19:51:23 UTC
Sahil Gautam committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/608e1452c51efa4f9bbcea8ed9a538ff974eed28

tdf#85976 Add 'Handle Duplicate Records' button to the notebookbar

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.