165566 – EDITING: New "Remove Duplicate" feature in Release 24.2.1 is very slow

Bug 165566 - EDITING: New "Remove Duplicate" feature in Release 24.2.1 is very slow

Summary: EDITING: New "Remove Duplicate" feature in Release 24.2.1 is very slow

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	25.2.1.2 release
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	Calc-Enhancements Performance
	Show dependency tree / graph

Reported:	2025-03-04 11:04 UTC by Hartmut
Modified:	2025-10-06 12:09 UTC (History)
CC List:	6 users (show)

See Also:	166121
Crash report or crash signature:

Attachments
ODS File with an example on which to run the "Remove Duplicates" function (104.32 KB, application/octet-stream) 2025-03-04 12:24 UTC, Hartmut	Details
My Python Script for "Remove Duplicates" (27.75 KB, text/plain) 2025-03-05 09:29 UTC, Hartmut	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Hartmut 2025-03-04 11:04:56 UTC

Description:
Referring to the new feature "Remove Duplicates" - see https://bugs.documentfoundation.org/show_bug.cgi?id=85976

This is more an enhancement of the new feature than a bug. But due to the fact, that the new feature is much too slow, this "bugzilla post" can be seen as a bug as well.

I had an own version for Remove Duplicates based on a self-developed Python script.

In an example with appr. 10.000 rows and 2 columns, my script needs 0.2 seconds.
The new built-in function needs 28 seconds for the same. This is a time multiplier of 133.

A subsequent 'undo' takes no time in case of my version (let's assume 0.05 secs), and with the built-in version 86 seconds (time multiplier: 1720).

This looks like that the built-in function needs a big improvement to become usable.

I can share my python script if needed. The main way of working is, that the data are read from the spreadsheet into a python table, all work is done in that table and then the table is written back to the spreadsheet.

It also contains an option to automatically sort the data which makes a further big speed up. In case of 10.000 rows this is not needed with a low number of columns. In case of 100.000 rows this should be used to avoid long waiting time.

Only issue of my script is a limitation due to the "selection.getData" function which stops working in case of more than 262.144 rows (2^18).

The limitation of this function should be filed in another bug report if needed.


Steps to Reproduce:
Use of "Remove Duplicate"
1. Create a spreadsheet with 10000 rows and 2 columns
2. Run new Built-In "Remove Duplicate" function

Undo
1. Press Ctrl-Z directly after the above Remove Duplicates

Actual Results:
Very slow :  > 20 seconds for Remove Duplicates resp. > 80 seconds for undo

Expected Results:
Both steps should be in less than one second


Reproducible: Always


User Profile Reset: No

Additional Info:
I think, all is mentioned in the "Description" already.

Comment 1 Xisco Faulí 2025-03-04 11:34:21 UTC Comment hidden (obsolete)

Thank you for reporting the bug. Please attach a sample document, as this makes it easier for us to verify the bug. 
I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' once the requested document is provided.
(Please note that the attachment will be public, remove any sensitive information before attaching it. 
See https://wiki.documentfoundation.org/QA/FAQ#How_can_I_eliminate_confidential_data_from_a_sample_document.3F for help on how to do so.)

Comment 2 Hartmut 2025-03-04 12:24:32 UTC

Created attachment 199598 [details]
ODS File with an example on which to run the "Remove Duplicates" function

Comment 3 m_a_riosv 2025-03-04 22:33:01 UTC

Really slow with
Version: 25.2.1.2 (X86_64) / LibreOffice Community
Build ID: d3abf4aee5fd705e4a92bba33a32f40bc4e56f49
CPU threads: 16; OS: Windows 11 X86_64 (10.0 build 26100); UI render: Skia/Raster; VCL: win
Locale: es-ES (es_ES); UI: en-US
Calc: CL threaded
only a bit quicker
Version: 25.8.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 1622d672b8cc721d5f9917931f6d8d999f218f7a
CPU threads: 16; OS: Windows 11 X86_64 (build 26100); UI render: Skia/Raster; VCL: win
Locale: en-US (es_ES); UI: en-GB
Calc: CL threaded

Comment 4 Rafael Lima 2025-03-05 00:37:51 UTC

(In reply to Hartmut from comment #0)
> I had an own version for Remove Duplicates based on a self-developed Python
> script.

Can you please share your Python script for analysis?

Comment 5 nobu 2025-03-05 08:35:25 UTC

On my PC, it can be done in about 1.7 seconds in Basic and 0.07 seconds in Python.
However, Python ignores cell formatting (same as Unique function)
It must be at least faster than what is done in Basic.

Comment 6 Hartmut 2025-03-05 09:29:08 UTC

Created attachment 199613 [details]
My Python Script for "Remove Duplicates"

This is my Python script for "Remove Duplicates". The needed dialogue boxes are created within the script.

I started with a BASIC script first, but this was too slow. Then I re-coded it in Python and used this approach to learn Python.

Comment 7 Buovjaga 2025-10-06 12:09:20 UTC

Hartmut: you may enjoy the improvements committed toward bug 166121