ODS file opens slow but after on/off Wrap text automatically it opens very fast
Steps to Reproduce:
1. Open file from attach
2. It opens slow
3. Select all (Ctrl+A) on sheet "результат"
4. Enable Wrap text then disable Wrap text (use Format->Cells->Alignment->Wrap text automatically or use icon on toolbar)
5. Save file with another name
6. Close current file
7. Open saved file -> it opens fast
file opens slow
file opens fast
User Profile Reset: No
Версия: 184.108.40.206.alpha0+ (x64)
ID сборки: ccfbe8b478f3daa8b5ec07a7e48dd5fbf8556811
Потоков ЦП: 4; ОС:Windows 10.0 Build 18362; Отрисовка ИП: по умолчанию; VCL: win;
Локаль: ru-RU (ru_RU); Язык интерфейса: ru-RU
Created attachment 155074 [details]
Problem ODS file
Julien, could you make a beautiful FlameGraph here?
No pb, I'll give a try after my daytime job.
Created attachment 155096 [details]
On pc Debian x86-64 with master sources updated today (+enable-symbols), I retrieved this Flameware.
Here are the steps I did:
- open Calc
- launch data collect
- open attached file
- close file
I noticed 2 parts in Flamegraph:
- ScMatrixImpl::GetDouble calling mdds::multi_type_matrix<matrix_trait>::get_numeric
- ScMatrixImpl::IsValueOrEmpty calling mdds::multi_type_matrix<matrix_trait>::get_type
Kohei: sorry to put you in cc, I know you've been working hard on Mdds for some time now and have less time for LO. Anyway, thought you might have some opinion about knowing if the bottleneck is in mdds or LO.
I can't repro it in LO 6.0.4 but I can in 6.1.4 =>regression
seems to have started with:
commit 693953dd4699887bd3f5bca2c3582b5fae1d6992 [log]
author Vasily Melenchuk <Vasily.Melenchuk@cib.de>
Fri Apr 06 20:19:10 2018 +0300
committer Katarina Behrens <Katarina.Behrens@cib.de>
Mon Oct 22 23:30:23 2018 +0200
parent fad764c02c7a9cd210bfa44ea0ce1ac5354d6427 [diff]
tdf#62268: allow row height recalculation on document load
During document load rows with style:use-optimal-row-height="true"
should recalculate it's height.
* includes: Row height tolerance level increase for unittest
* tdf#118086: calc: invalid row autoheight fixed
$ git bisect bad 40ab4a5cf85d27950e409bd4af0086cd98213719 is the first bad commit
Author: Norbert Thiebaud <email@example.com>
Date: Mon Oct 22 14:51:42 2018 -0700
:040000 040000 109429f0b4e7074293591b4bb614714854730480 80631ed7ef70e4de2eefa29d10c39c893f835b37 M instdir
$ git bisect log
# bad: [75d131082ce51ed5a898d97bdc2b7a9fe5ddb340] source 5b3765f4d881e7ddefd0c4aad6886a46f000b4fc
# good: [29d08f54c2f71ffee4fe12dbb24c5f5cbedecfd2] source 6eeac3539ea4cac32d126c5e24141f262eb5a4d9
git bisect start 'master' 'oldest'
# good: [6227e15df9be101688e37cd891817cd858b49e03] source b8b7f8a8f8d97088181d287bb75e74facece16c6
git bisect good 6227e15df9be101688e37cd891817cd858b49e03
# good: [50b236fe0d359b9d5cc9998d2e72009a90a11d08] source b6025e6cffe2024fefebd161ea739188b4b4fdaf
git bisect good 50b236fe0d359b9d5cc9998d2e72009a90a11d08
# good: [d59609a1bfbbb4f924492755719b7d340a51de1b] source eeaf4b0b7ad21da879554bdd93c9a9b97b8268d6
git bisect good d59609a1bfbbb4f924492755719b7d340a51de1b
# good: [75029ade8fd0bbe5abc530394b85b346b499bc55] source ec79d0127f90d65d722e46688b6cfcf2f5e59794
git bisect good 75029ade8fd0bbe5abc530394b85b346b499bc55
# bad: [ab0db3bf58bf2ed341d05f62c0beefc85c134af3] source 6ed498a71482109fea731bb84f288f978bea12dc
git bisect bad ab0db3bf58bf2ed341d05f62c0beefc85c134af3
# good: [7a121b163b7e4eab33c26d4d301b24f0a99d7bf8] source a42c65176f2791cf5e48578a8898bf03185adc89
git bisect good 7a121b163b7e4eab33c26d4d301b24f0a99d7bf8
# good: [5d779faf3839da9d060d6fb410ad30d0db1f3e66] source ac39aba9b2d08b061b0eef651f5ebc7a84391171
git bisect good 5d779faf3839da9d060d6fb410ad30d0db1f3e66
# bad: [e6b0819442248baf59e616887cf216f1da32be59] source 8b6f1d8460b931950b98b5968ff7734f3c128a4d
git bisect bad e6b0819442248baf59e616887cf216f1da32be59
# bad: [84f4d9d1c1b9d0a1c0f0307dafefb33cfbd78c65] source d6f563b37d8a694c6c1d4c9ef3ba746c7f019517
git bisect bad 84f4d9d1c1b9d0a1c0f0307dafefb33cfbd78c65
# good: [00b943e017c148697e3b4ab3c938edcb07e6a33a] source e67ca59e293c4dd37795150cf871e36ca1affb76
git bisect good 00b943e017c148697e3b4ab3c938edcb07e6a33a
# bad: [15b060764ee7934c58786891fab4d0f38a09498e] source 5de85be43198804573787d4186b156b5931c4a9f
git bisect bad 15b060764ee7934c58786891fab4d0f38a09498e
# good: [180656b1b8aed5295a44cbacded98f37e45f5f1d] source fad764c02c7a9cd210bfa44ea0ce1ac5354d6427
git bisect good 180656b1b8aed5295a44cbacded98f37e45f5f1d
# bad: [40ab4a5cf85d27950e409bd4af0086cd98213719] source 693953dd4699887bd3f5bca2c3582b5fae1d6992
git bisect bad 40ab4a5cf85d27950e409bd4af0086cd98213719
# first bad commit: [40ab4a5cf85d27950e409bd4af0086cd98213719] source 693953dd4699887bd3f5bca2c3582b5fae1d6992
I can still reproduce this issue on latest 7.1 branch and also 7.2 branch.
The slowness seems to be in adapting row height. I guess it was "on/off Wrap text" which removed the style:use-optimal-row-height='true', that is why at that case is opens fast.
The "adapting row height" is very annoying. For me a 5MB ODS file may take one minute to open which stops at the adapting row height stage.
I would argue this is not fixable (as a reported regression), since the underlying problem is the slow row height calculation. Now, _not_ re-calculating row height creates another set of bugs (see bug 62268), so how about we re-purpose this as a increase-row-height-calculation-speed?
As it stands now, I'd otherwise be inclined to close as WONTFIX (better be slow & correct in layout, than the other way round).
At least Calc should not add the style:use-optimal-row-height="true" attribute to each row for documents newly created.
Currently, Calc add this attribute by default to each row. This attribute should be set only if the user explicitly select a row / some rows, then right-click and select "Optimal Row Height", or if the user select rows and then double-click on the row headers. Otherwise this attribute is useless.
I see the main reason the row height get re-calculated is that if this attribute is true is that for programly generated documents if this attribute is explicitly added it is true that the row heights should be recalculated, see bug 62268.
The style:use-optimal-row-height attribute specifies that a row height should be recalculated automatically if content in the row changes.
The defined values for the style:use-optimal-row-height attribute are:
•false: row height should not be recalculated automatically if content in the row changes.
•true: row height should be recalculated automatically if content in the row changes.
Well, from this ODF standard, this attribute defines that the row height should be recalculated if the content in this row is changed (i.e., at the time of editing), rather than when the document is open. At time of editing when the row height is recalculated, a new row height value is determined and is saved to the ODF file.
At file open, the application should use the defined row height value directly rather than recalculating each of them. For all new ODF spreadsheet docs there is a defined row height for each row.
The rational of recalculating the row heights at file open as in the commit of bug 62268 is that for some mannully generated ODF files there is no row height value defined in the xml file, normally because those programs are not professional OpenDocument Producers and they simply want an ODF Viewer like Calc to calculate the row height for them when the file is open by the user. And I agree that in such case (if there is no defined row height value) Calc should help to recalculate.
So, consider both bug 62268 and the properly generated ODF filed, the fix should be:
if rowHeight and rowHeight>0:
finalRowHeight = rowHeight
(recalculate row height)
finalRowHeight = reCalculatedRowHeight
finalRowHeight = defaultRowHeight
The improvement of the speed for the recalculation of row height is another issue. However, for large spreadsheets even the fastest calculation may still cost a lot of time and CPU circles as it need to loop into each row and each cell.
There is a lot to unpack here, so this will be a bit lengthy.
First off, as Kevin correctly explained, that use-optimal-row-height flag corresponds to the flag to re-calculate optimal height when a cell value changes during editing, and it is not a flag to indicate whether the row heights should be re-calculated on load. Doing so would (as we now know) cause a noticeable performance degradation during file load for everyone, which is not a good look. If the standard is not clear about when this flag should trigger recalculation in my opinion that point should be further clarified in the standard.
As for improving performance on row height re-calculation, to me it’s a lost cause since that process is already known to be very expensive involving getting font metric information as well as other attributes of the text for every character involved. Caching certain font metric information was attempted in the past, which may improve performance in certain situations, but it’s unclear how much that would help with the row height re-calculation. Still, any attempt to speed it up would not come anywhere close to not running it.
As for a potential fix, the logic Kevin suggested is a reasonable approach, though I’m not sure whether we should check for the row height being 0. That’s a corner case that would not happen when Calc is the generator since setting the height to 0 would set the hidden flag while leaving the original height unchanged. I would just leave it as the generator’s responsibility to never set the height to zero, or leave out the value in case the desired value is not clear. But it’s just my opinion. I think either approach is fine.
Now, here is the bad news. The current ODF import filter code is notoriously hard to work with since it was built on (IMO) the wrong architectural basis of basing it on (mostly) UNO API. UNO API is designed for run-time automation with change notifications firing everywhere. Making it built on UNO API unfortunately resulted in significant performance issue not to mention making it very very difficult to follow, understand, and make significant design changes to the code since UNO promotes the idea of decoupling all the moving parts.
My hope at the time was to slowly switch from populating the content via UNO API to doing the same directly with ScDocument via its import-time specialized accessor ScDocumentImport. You see some trace of my earlier attempt in this part of Calc’s code. If someone is up for it, my suggestion would be to try to populate the content via ScDocumentImport instead of using UNO API. ScDocumentImport has direct access to ScDocument’s private parts, and is designed to populate the document content without unnecessary change notifications etc. I did make quite some inroads toward using ScDocumentImport to speed up loading for other, non-UNO based import filters, but unfortunately I only made small progress with the native ODF import filter. It may be a good idea for someone to pick up the torch to continue further.
Alternatively, it may be actually simpler to introduce an internal configuration option to toggle row height re-calculation on load (defaults to off, of course) to satisfy the use case in tdf#62268. Fixing the import filter code would be the ideal approach, but I’m not sure if anyone would want to even touch that code… I wouldn’t, at least not willingly. ;-) Only those with enough bandwidth could tackle that code, and I don’t have much bandwidth these days unfortunately.
As a final aside, my frustration with working with this code also motivated me to re-architect the ODF import filter in orcus, which can be turned on in Calc with some effort (right now it’s disabled). But that filter is only 10 to 15% complete, so using that would be a long shot. Maybe someday it will become somewhat feature complete, but who knows.
I was not able to reproduce original testcase: in my dev environment in debug build I let Calc to load document for ~ 30 mins without any success and then stopped.
But I do not think that this performance issue is related to row height recalculation. See previously attached perf.svg.bz2:
>ScFormulaCell::InterpretFormulaGroupThreading (917 samples, 56.57%)
>ScDocRowHeightUpdater::update (1 samples, 0.06%)
First one is a multiplication of matrix done in multiple threads most of the time.
Second is a row height calculation during load, it is almost impossible to find on this svg.
14 sec for the opening the file in
Version: 220.127.116.11.alpha0+ / LibreOffice Community
Build ID: e4d23c27288b99c3ed3cfa332ff308b31c01f97d
CPU threads: 4; OS: Linux 5.14; UI render: default; VCL: gtk3
Locale: ru-RU (ru_RU.UTF-8); UI: en-US
Calc: threaded Jumbo
Intel Core 2 Quad 9450 here