Bug 148010 - Large ODS open 100% slower compared to XLSX format
Summary: Large ODS open 100% slower compared to XLSX format
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
7.4.0.0 alpha0+
Hardware: All All
: medium minor
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: perf
Depends on:
Blocks: Performance
  Show dependency treegraph
 
Reported: 2022-03-15 15:39 UTC by Telesto
Modified: 2023-10-26 18:31 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Example file (27.48 MB, application/vnd.oasis.opendocument.spreadsheet)
2022-03-15 15:39 UTC, Telesto
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Telesto 2022-03-15 15:39:08 UTC
Description:
Large ODS open far slower compared to XLSX format

Steps to Reproduce:
1. Open the attached file (measure the time taken)
2. Save as XLSX
3. Close the file
4. Open the XLSX and measure the time taken

Actual Results:
XLSX opens 50% faster

Expected Results:
Ideally faster opening


Reproducible: Always


User Profile Reset: No



Additional Info:
Version: 7.4.0.0.alpha0+ (x64) / LibreOffice Community
Build ID: 3ccc4c123f5e78e0204d11abeab2d1a74278ca3e
CPU threads: 4; OS: Windows 6.3 Build 9600; UI render: Skia/Raster; VCL: win
Locale: nl-NL (nl_NL); UI: en-US
Calc: CL
Comment 1 Telesto 2022-03-15 15:39:57 UTC
Created attachment 178898 [details]
Example file
Comment 2 Buovjaga 2022-03-20 17:55:52 UTC
ODS 29 secs
XLSX 14 secs

Version: 7.4.0.0.alpha0+ (x64) / LibreOffice Community
Build ID: 59a7c40255b836ed75e64686fabb3e9938b755f0
CPU threads: 2; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: fi-FI (fi_FI); UI: en-US
Calc: threaded Jumbo
Comment 3 Telesto 2022-03-20 19:25:15 UTC
(In reply to Buovjaga from comment #2)
> ODS 29 secs
> XLSX 14 secs
> 
> Version: 7.4.0.0.alpha0+ (x64) / LibreOffice Community
> Build ID: 59a7c40255b836ed75e64686fabb3e9938b755f0
> CPU threads: 2; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL:
> win
> Locale: fi-FI (fi_FI); UI: en-US
> Calc: threaded Jumbo

There was a bug about this somewhere.. However not able to find it..
There was something about XLSX storing less information (formatting?) compared to ODS

With jumbo sheet support the amount of data increases, but native file-format performance doesn't win awards..

So this might go in the direction of optimized version ods for large sheet. So this not the parser being slow, but more about the amount of data being the bottleneck..
Comment 4 Telesto 2022-04-12 09:06:50 UTC
@Michael
One my pokes :-). In my perception some long term goal should be improving the ODS format or having some (new) optimized format for large chunks of data

I wouldn't consider storing jumbo sheets in ODS format because it's - to me - unbearable slow (user perspective); especially when having some alternative format being a lot faster. [Side note: I'n not using Calc that often and not having hug amounts of data the parse]

In my view this becomes an even bigger bottleneck with Jumbo sheet support.

It could give some alternative frames/arguments
* It's not a real showcase for the ODF format
* It might increase e-waste (opting for more powerful machines to improve loading speed) and adding costs (by replacing existing stuff)
* The trend of right to repair; which theoretically increase the use of older hardware. So software optimizations do matter (more)
* It might add into the costs (more CPU time = more electricity) 
* The environmental perspective (CO2) for producing electricity 
* It increases the unproductive time of employees (waiting for file to open)

Note: I'm aware of contradictions in arguments: that older hardware not always that energy efficient. Or new CPU's having CPU Extension Instruction Sets which being far more efficient. Or that unproductive time can be used productively

There was (or is some ticket) which said it's there the ODS holds more data (elements) to parse compared to XLSX.

Other perspective: people are abusing Calc/Excel for some database of (science data). And actually should use a different tool instead of using Jumbo sheets..
But harm is already done with the existence of XLSX sheets..  

---

Sometimes - but likely idealizing  - the data should be store in something like SQLite format (or inspired by). This probably makes possible to incrementally save without writing everything to disk. Which makes save & auto-save a lot faster. Also the part of only loading data (different sheets in Calc file) on request sounds interesting to me.
Comment 5 Buovjaga 2022-04-12 09:56:20 UTC
(In reply to Telesto from comment #4)
> @Michael
> One my pokes :-). In my perception some long term goal should be improving
> the ODS format or having some (new) optimized format for large chunks of data
> 
> I wouldn't consider storing jumbo sheets in ODS format because it's - to me
> - unbearable slow (user perspective); especially when having some
> alternative format being a lot faster. [Side note: I'n not using Calc that
> often and not having hug amounts of data the parse]

But what do you have to show proving that ODS format itself makes it impossible to have performant implementations?
Comment 6 Telesto 2022-04-12 11:34:01 UTC
(In reply to Buovjaga from comment #5)
> But what do you have to show proving that ODS format itself makes it
> impossible to have performant implementations?

Lets start with: I'm don't know any of the specs of the ODS format :-) So don't know what's allowed to do within the file format itself.

I'm only having a *vague* recollection of XLSX being faster because of less data.. I think Kohei said so, but well can't find it.. 

However I can illustrate it: the ODS has a content.xml of 952.596.194 bytes (for the attached file) and the exported XLSX has an Sheet.xml of 406.909.860 bytes (+/-50% the size). Which is an reasonable explanation for the difference in import time. 

Parsing is smaller xml file faster by definition :-). You can optimize all you want, it always be slower. 

So XLSX does store less, but apparently this isn't big issue 

--

The SQLite topic is more some bold idea (of a novice). If you're reconsidering the file format anyhow, you should evaluate the alternatives. The SQLite database has some advantages.. 

But well this might be wishful thinking an unrealistic.. Third party software probably struggles with such a shift for example.  And likely really really complex to implement.

-----

The other end of the topic is the current implementation: the "architectural basis of basing it on (mostly) UNO API."

Which is mentioned in 
bug 128204 comment 13 (Noel)
bug 108284 comment 15 (Kohei)

But well I assume the same UNO API implementation is shared across ODS and XLSX, so this shouldn't make much difference, IMHO.

I find the XLSX speed reasonable, but likely someone disagrees
Comment 7 Michael Meeks 2022-04-12 16:31:35 UTC
I'm afraid the ODS format does make life much more difficult for many optimizations and for exposing parallelism - since it is a single, huge stream for a whole workbook - so each sheet cannot be parsed in separate threads (as with XLSX). It is certainly true ODS uses more verbose tags too -but- then our parsing (while it is faster then populating our core model) is effectively free since it happens in another thread so I don't think we need to get too hung up on long attribute names until we have a profile that says that tokenizing them is in fact really slow. Even in this case - possibly we could do something clever to parallelize the XML parsing into chunks if we wrote our own XML parser (not something I'm hyper-eager to do - though in reality we support a rather small subset of the XML feature-bloat in ODF files).

Anyhow - there are a very large number of ways to continue to significantly improve ODS import performance - but all of them are expensive in terms of developer time; and (as yet) I don't see significant demand for this from Collabora customers - but no problem with having this ticket open.

As/when someone has credible resources to invest here, and a profile - I'm really happy to help out with some ideas of how we can try to improve things substantially.

Beyond that tweaking the ODF format to use eg. 'R1C1' formulae, and to split up sheets, shared-formulae etc. into more ZIP streams, and (ideally) to save sheets vertically not horizontally would be great (though a standards process is usually extremely slow) - and I don't really see us being able to tweak our core to be able to defer loading of sheets in any sensible horizon: not sure it's even that useful in the modern world.
Comment 8 Telesto 2022-04-14 09:22:23 UTC
(In reply to Michael Meeks from comment #7)
>and (as yet) I don't see significant demand for this from
> Collabora customers - but no problem with having this ticket open.

Well demand is in this case more or less prerequisite :-(

I do ask myself, are SME & Enterprises users working around it by using XLSX or are because the are using XLSX anyhow.. Or is there really no demand..

The performance is surely not good for the adaption of the OpenDocument Format and or/ propagating Open Document Format as serious competitor in SME or Enterprise setups. 

Lovely chicken or egg situation. Is ODS adaption needed first, creating demand for performance improvements. Or needs ODS an improvement before adaption OpenDocument is even realistic. 

And the lovely mixture of ideology (making ODS competitive) and commercial aspects (SME/Enterprise). 

If it would be a tender, people will complain sponsoring SME/Enterprise issues.
However without ODS can't be taken seriously (maybe overstating it a little :-), slowing down adoption.

> Beyond that tweaking the ODF format to use eg. 'R1C1' formulae, and to split
> up sheets, shared-formulae etc. into more ZIP streams, and (ideally) to save
> sheets vertically not horizontally would be great (though a standards
> process is usually extremely slow) - and I don't really see us being able to
> tweak our core to be able to defer loading of sheets in any sensible
> horizon: not sure it's even that useful in the modern world.

I have no clue how the process of the ODF format change works. Who takes the initiative. OASIS on it's own. Or is always needed that some eco-system partner filing a change proposal, writing the drafting with proof of concept/benchmarks etc. (which likely pretty expensive)

I prefer those idea's being somewhere on the todo list of OASIS :-). However it's no use if OASIS acts based on detailed change proposals.
Comment 9 Michael Meeks 2022-04-14 10:19:44 UTC
> I do ask myself, are SME & Enterprises users working around it by
> using XLSX or are because the are using XLSX anyhow.

No idea - I would expect that if a document takes >3 seconds to load, people context switch and do something else while it happens and then tend to keep it open rather than closing it and re-opening it a lot. So not sure there is a huge issue here.

> The performance is surely not good for the adaption ...
> without ODS can't be taken seriously (maybe overstating it a little :-), 

Performance can always be improved - it is a bottomless pit. What hard data do you have to make me think this is a serious issue in the absence of any customer feedback.

The 16k columns issue - we get feedback on all the time - and is now being fixed by Collabora =)

> I have no clue how the process of the ODF format change works.

Very slowly, and with great effort - backed up by lots of technical time spent on proposals etc. the POC for R1C1 referencing is quite easy - and there should be a good impact on file-size too I think.

> I prefer those idea's being somewhere on the todo list of OASIS :-).
> However it's no use if OASIS acts based on detailed change proposals.

OASIS can't do anything without an implementation - ideally two; so its best to do something in LibreOffice first. To be fair - we could flush our ZIP stream compression at a point and store some binary check-points for streams in a separate stream - as a hack/overlay on what we do already that might do it - there are lots of options - but you need several sharp engineers and lots of their time to make progress.
Comment 10 Kevin Suo 2022-08-26 05:54:40 UTC
Should the slowness happen adapting row height? If that is the case, then it is bug 128204.
Comment 11 peterho0218 2023-10-26 17:56:29 UTC
I have reproduced the problem. Let me provide the system information first.

Version: 7.6.2.1 (X86_64) / LibreOffice Community
Build ID: 56f7684011345957bbf33a7ee678afaf4d2ba333
CPU threads: 12; OS: Windows 10.0 Build 19045; UI render: Skia/Vulkan; VCL: win
Locale: zh-HK (zh_HK); UI: zh-TW
Calc: CL threaded

I have also found that some people have discussed the problem in the forum.
https://ask.libreoffice.org/t/slow-loading-first-spredsheet/95202

Someone found that the loading time could be affected by the system default printer setting. When the default printer is offline, the files require a very long time to be loaded. Some solutions have been provided in that page but only 1 of them is working for me. Here is my findings:
1. Changing the default printer to an online printer or 'Microsoft Print to PDF' solves the problem.
2. Unchecking the box 'Load printer settings with the document' in the options doesn't solves the problem.
3. The problem occurs on ODS files but not XLSX files, even when the files are small. On my computer, opening an ODS file is much much slower (>1 munite) than opening a XLSX file (<1 second).
Comment 12 Michael Meeks 2023-10-26 18:31:08 UTC
This is quite a technical issue about file-format problems with ODS that make it somewhat slower as a format. If there is some un-related printer issue - please split that to another ticket to continue the discussion.
Thanks; and CC Regina who might be interested :-)