Bug 77382 - Poor performance with find & replace with empty value on large data set
Summary: Poor performance with find & replace with empty value on large data set
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
(earliest affected) release
Hardware: Other All
: medium normal
Assignee: Not Assigned
Whiteboard: BSA
Keywords: bibisected, perf, regression
: 82013 (view as bug list)
Depends on:
Reported: 2014-04-13 02:42 UTC by wes
Modified: 2015-12-15 11:15 UTC (History)
7 users (show)

See Also:
Crash report or crash signature:

attachment-2559-0.html (3.86 KB, text/html)
2014-04-13 20:59 UTC, wes
attachment-2559-1.dat (1 bytes, multipart/alternative)
2014-04-13 20:59 UTC, wes
bug.77382.sample.file.ods (419.88 KB, application/vnd.oasis.opendocument.spreadsheet)
2014-04-13 20:59 UTC, wes
Sample file with data for verification. (87.52 KB, application/vnd.oasis.opendocument.spreadsheet)
2014-04-13 22:20 UTC, m_a_riosv
bibisectlog (1.44 KB, text/x-log)
2014-07-01 02:48 UTC, Bryan Quigley

Note You need to log in before you can comment on or make changes to this bug.
Description wes 2014-04-13 02:42:03 UTC
Problem description: 

Steps to reproduce:
1. Open on .ods file in Calc with large amount of data, e.g., 584 columns & 755 rows
2. Run Search & Replace to replace contents of ~20% of cells
3. Program freezes, requires a kill process (TERM command) to unfreeze

Current behavior:
So, that's one way to describe the issue in operationalizable terms, but it's really that when a large file is opened, many operations in Calc can be _greatly_ lagging.  E.g., scrolling right through columns can start to lag behind a continually-pressed key (viz., right arrow) such that Calc takes several seconds to catch up after the key is let up.  I'm not in any way a programmer, but seems like maybe an issue of how objects are held and manipulated in memory?

In a sense, I'm asking for a lot (and, yeah, sorry but probably through a very poorly-submitted & -formatted bug report; this isn't my bailiwick), but, yeah, I _do_ ask a lot of the spreadsheet program of an office suite.  I believe it's where most people make their decision about what office suite to use.  Word processors, presentation software, etc. are simpler and usually contain less importantly rich information that needs quicker processing.

Expected behavior:
Faster responsiveness on larger files.  Perfection right now!  :  )

Oh, and maybe knowing (& finding out how to) better submit bug reports and help improve LO qua end user.
Operating System: Ubuntu
Version: rc
Comment 1 m_a_riosv 2014-04-13 15:10:03 UTC
Hi Wes, thanks for reporting.

Please can you attach a sample file.
Comment 2 wes 2014-04-13 20:59:11 UTC
Created attachment 97320 [details]

Hi There,

Of course.

The original file contains highly sensitive information (personal,
identifiable data from a longitudinal study of at-risk adolescents), so
I've used the sed command to replace the the original contents with
nonsense.  But, I've replicated it with that new, attached file (and found
it with other large files, too).  I was trying to replace the cells
containing NA with empty cells (and vice-versa).

This is may be nothing more than the patient trying to tell the doctor how
to do his job, but it seems related to how RAM is used, and perhaps
includes elements from the document history: Larger files on which many,
complex operations are done (e.g., copying in formulae that operate on
several cells into _all_ cells in a column) will crash it, and this seems
both to get a bit worse as the file has been worked on for a while and to
get a bit better after LO is restarted.

Unrelatedly, I am interested in helping out with LO, but don't have many
skills that are useful.  So, if you know of things an average Joe end user
can do (beta testing, language clean-up in online guides, etc.), please let
me know.  I've looked around a bit on the site to find this, but probably
more through lack of sufficient effort haven't found things I personally
can do to help.

Sorry for the longiloquence, and thank you so much for the help!


On Sun, Apr 13, 2014 at 11:10 AM, <bugzilla-daemon@freedesktop.org> wrote:

>  m.a.riosv <mariosv@miguelangel.mobi> changed bug 77382<https://bugs.freedesktop.org/show_bug.cgi?id=77382>
>  What Removed Added  Status UNCONFIRMED NEEDINFO  CC
> mariosv@miguelangel.mobi  Ever confirmed   1
>  *Comment # 1 <https://bugs.freedesktop.org/show_bug.cgi?id=77382#c1> on
> bug 77382 <https://bugs.freedesktop.org/show_bug.cgi?id=77382> from
> m.a.riosv <mariosv@miguelangel.mobi> *
> Hi Wes, thanks for reporting.
> Please can you attach a sample file.
>  ------------------------------
> You are receiving this mail because:
>    - You reported the bug.
Comment 3 wes 2014-04-13 20:59:12 UTC
Created attachment 97321 [details]
Comment 4 wes 2014-04-13 20:59:12 UTC
Created attachment 97322 [details]
Comment 5 m_a_riosv 2014-04-13 22:19:47 UTC
Reproducible with:
Version: Build ID: 05dceb5d363845f2cf968344d7adab8dcfb2ba71
Version: Build ID: d7dbbd7842e6a58b0f521599204e827654e1fb8b
Version: Build ID: 3be8cda0bddd8e430d8cda1ebfd581265cca5a0f
Version: Build ID: 6c3586f855673fa6a1576797f575b31ac6fa0ba3

With the attached sample file in the next comment.
About  11 seconds to replace AN for nothing on 16384 cells
About  36 seconds to replace AN for nothing on 32768 cells
About 168 seconds to replace AN for nothing on 65536 cells

The number of replacements increases the time to replace in a exponential way.

With Version: Build ID: a59ce81388f477fc89db57f0c27f222f31884eb
about 15 seconds to replace 65536 cells
Comment 6 m_a_riosv 2014-04-13 22:20:35 UTC
Created attachment 97324 [details]
Sample file with data for verification.
Comment 7 Kohei Yoshida 2014-05-24 00:35:02 UTC
So, it's the "replace with empty" operation that's slow.  When you replace all with non-empty value, it's not slow.
Comment 8 Bryan Quigley 2014-07-01 02:48:18 UTC
Created attachment 102039 [details]

Bibisected to source-hash-022c54742e7997bf46a608f1ab0b500f2537f7f5, commit 0461728e0d5cd5e24954864881357a5f370c4714, WaE: private field 'mrCells' is not used
Comment 9 Bryan Quigley 2014-07-17 18:41:08 UTC
My tests indicate the slowdown is in the bool ScTable::Search function (or in how it's called) in sc/source/core/data/table6.cxx.  When comparing the speed of the function with FindAll vs ReplaceAll the performance for Replace All starts at about 141 times the Find all.   By the end of the search they are about at the same level though.
Comment 10 Bryan Quigley 2014-07-19 04:27:53 UTC
This call in (table6.cxx)
aCol[nCol].SetString(nRow, nTab, aString, pDocument->GetAddressConvention());  is very slow, and appears to be further to the cause... 

Also in a ReplaceAll, for an reason I don't understand SetString is also called somewhere here:
ScAddress aAdr( nCol, nRow, nTab );
aCell.commit(*pUndoDoc, aAdr);
Comment 11 Bryan Quigley 2014-07-21 22:36:20 UTC
Alright so I was wrong on the last count it's actually.
-> SearchCell -> The last else statement -> aCol[nCol].SetString(nRow, nTab, aString, pDocument->GetAddressConvention());
-> aNewCell.release(*this, nRow);  (In SetString, column3.cxx)
-> ScCellValue::release (cellvalue.cxx), it falls to default in the switch
-> rColumn.Delete(nRow); -> ScColumn::Delete( SCROW nRow ) (column3.css)
Final in LibO - maCells.set_empty(nRow, nRow); Which is actually defined in mdds

[1] Link to before the commits that changes this around for reference: http://cgit.freedesktop.org/libreoffice/core/tree/sc/source/core/data/column3.cxx?id=b139f6fedfcf3cbed0eadeb007e2155b576413d2
Comment 12 Bryan Quigley 2014-07-25 17:20:08 UTC
I believe the issue was introduced specifically in this commit (by Kohei Yoshida):

Specifically in the need to change how "Delete" works.
Comment 13 Kohei Yoshida 2014-08-02 01:27:34 UTC
*** Bug 82013 has been marked as a duplicate of this bug. ***
Comment 14 Kohei Yoshida 2014-08-02 01:29:00 UTC
(In reply to comment #12)
> I believe the issue was introduced specifically in this commit (by Kohei
> Yoshida):

Of course I broke it!  That's not news.  How to fix it is the question.
Comment 15 Jean-Baptiste Faure 2014-09-28 08:31:23 UTC
I think it is useless to fill a window with all cells changed when they are thousands, the user will not read them. I think it is enough to say him how many cells have been replaced when the amount exceeds something like 100 or 1000.

Best regards. JBF
Comment 17 Björn Michaelsen 2014-11-27 17:31:50 UTC
resolving as per comment 16.
Comment 18 Robinson Tryon (qubit) 2015-12-15 11:15:23 UTC
Migrating Whiteboard tags to Keywords: ( perf bibisected)