144521 – FILESAVE: Stop storing empty cells

Bug 144521 - FILESAVE: Stop storing empty cells

Summary: FILESAVE: Stop storing empty cells

Status:	RESOLVED WONTFIX

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	7.2.0.4 release
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2021-09-15 17:33 UTC by robert
Modified:	2024-04-16 04:28 UTC (History)
CC List:	4 users (show)

See Also:	153527 159598
Crash report or crash signature:

Attachments
ODS file showing empty cells (and having lots of other issues) (1.19 MB, application/vnd.oasis.opendocument.spreadsheet) 2021-09-15 20:38 UTC, robert	Details
Original organically grown ODS file (1.27 MB, application/vnd.oasis.opendocument.spreadsheet) 2021-09-28 14:41 UTC, robert	Details
Manually debloated copy of the organically grown ODS file (1.19 MB, application/vnd.oasis.opendocument.spreadsheet) 2021-09-28 14:42 UTC, robert	Details
Result of just saving the manually debloated ODS file (1.27 MB, application/vnd.oasis.opendocument.spreadsheet) 2021-09-28 14:43 UTC, robert	Details
Last rows of sheet "Rides" (45.56 KB, image/png) 2021-09-28 20:19 UTC, Mike Kaganski	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description robert 2021-09-15 17:33:46 UTC

This is not a bug, and it's very well hidden by the fact that ODS files are effectively ZIP files.

When working with a test file, I inserted columns/rows, deleted columns/rows, changed the format of entire columns/rows, and eventually decided to have a look at "content.xml" and where the XML writer does some (pretty minimal) optimisations, for storing repeated cells, but it misses out on plenty cases, where the resulting "content.xml" contains code like:

<table:table-cell table:number-columns-repeated="11"/>
<table:table-cell table:number-columns-repeated="45"/>
<table:table-cell table:number-columns-repeated="724"/>
</table:table-row>

In other words, a row that ends with 780 empty columns, that are still "stored" in the file, and need to be parsed when reading it in. Same happens for rows, with is an order of magnitude worse, as there can be a million of them.

It would/might be nice if the XML writing code could stop this from happening.

Comment 1 Mike Kaganski 2021-09-15 18:04:19 UTC

Please provide specific steps to reproduce, like "create an empty spreadsheet document; select region A1 to B2; apply formatting Bold on toolbar; select rows 2 to 3; use menu Foo to delete rows; ...". Having specific steps to reproduce specific results could enable to see, debug, and possibly fix it. Otherwise, there's nothing that could be done here.

Thanks.

Comment 2 robert 2021-09-15 20:36:49 UTC

I have absolutely no clue how to explicitly reproduce this. As I wrote, I just had a look at a test ODS file and I came across such cases. I'll see if I can recreate something in good time, and "good time" is for now undefined, as I just got pretty bad news from the other side of Europe about my mother.

I have however attached a "real" file, "lift & v100+.ods" that contains truckloads of 

<table:table-cell table:number-columns-repeated="nn"/>

and a fair amount of 

<table:table-row table:style-name="ro3" table:number-rows-repeated="104nnnn">
<table:table-cell table:number-columns-repeated="nn"/>
</table:table-row>

tags, and it's grown organically, starting in the days of OO, sometimes I added temporary columns, and set the format for all rows, and later deleted them again. Sometimes I put multiple tables on a sheet, only to later move them to a new sheet, etc, etc, etc. 

Another of the (many) other "problems" with this(?) particular sheet is that some columns are given the same width (Rides.Q and Rides.R), but that they end up with different widths in "contents.xml" ("co11": 8.895 cm, "co12": 8.901cm), but those are probably better left to another bug report.

Comment 3 robert 2021-09-15 20:38:56 UTC

Created attachment 175040 [details]
ODS file showing empty cells (and having lots of other issues)

Comment 4 robert 2021-09-28 14:40:54 UTC

I've attached three spreadsheets. All of them open and display exactly the same output, but unzip the ODS files, use, like I did, xmllint <http://xmlsoft.org/xmllint.html> to format them, and WinMerge to compare the "content.xml" files, and shiver.

The first file is "original-lift & v100+.ods" 

This is an organically grown spreadsheet. In other words, it's never been updated with anything other than Calc. I think, but I'm not sure any more that it started with a single sheet, and in the course of the past decade or so, it has grown. In many cases I've added tables to existing sheets, only to later move them to their own sheets, I've inserted columns, deleted columns, ditto for rows, used different fonts, updated entire columns with "ctrl-1", updated selected ranges with "ctrl-1", added conditional formats, deleted them, in other words, I've used a fair amount of the functions in Calc.

The second file is "manual-lift & v100+.ods"

This file is the result of having spent too much time in hospital waiting rooms. It's the result of unzipping "original-lift & v100+.ods", and manually editing "content.xml", after inserting "cr/lf" sequences between all "><", and (obviously) deleting them again before zipping the changed code.

Among the manual changes? 

- replacing nearly duplicate column widths with equal ones
- removing unused fonts
- removing "asian" and "complex" attributes
- removing lots of "<table:table-cell table:number-columns-repeated="XX"/>" immediately before "</table:table-row>" tags
- probably, not sure any more, changing times with 59.9999999 seconds, setting the seconds to 00 and adding one minute
- removed the "<table:database-ranges>" section (no clue how it got there, and even less how to "officially" remove it)

all in all removing about one megabyte of "bloat".

Then I opened the file in Calc, and ***without doing anything to it*** saved it as the third attached file, "saved-manual-lift & v100+.ods" gave that the unzip/xmllint treatment, and burst into tears...

Among the tear-inducing changes?

- more than 1,000 "<table:table-cell table:number-columns-repeated="2"/>" redundant entries
- the never used fonts are back
- columns widths are suddenly changed by minute amounts

And numerous others, the WinMerge generated diff files are huge.

I'd like to end this diatribe with 

1) "The fastest code is the code that is never executed" and 
2) <http://www.ncdm.com/bloat/bloat.htm>

Comment 5 robert 2021-09-28 14:41:46 UTC

Created attachment 175323 [details]
Original organically grown ODS file

Comment 6 robert 2021-09-28 14:42:36 UTC

Created attachment 175324 [details]
Manually debloated copy of the organically grown ODS file

Comment 7 robert 2021-09-28 14:43:11 UTC

Created attachment 175325 [details]
Result of just saving the manually debloated ODS file

Comment 8 Mike Kaganski 2021-09-28 15:14:39 UTC

(In reply to robert from comment #4)

TLDR; could you please provide something without unnecessary prose, giving some specific things to look at? Like: open file Foo.xml inside attachment ABC; remove ...; save into a new ODF; open in LO; save; check that the thing re-appeared; why?

Comment 9 robert 2021-09-28 15:24:34 UTC

Unzip all three ODS files, format the main "content.xml" of all three with xmllint, compare the files with WinMerge.

Comment 10 Mike Kaganski 2021-09-28 15:45:02 UTC

(In reply to robert from comment #9)
> Unzip all three ODS files, format the main "content.xml" of all three with
> xmllint, compare the files with WinMerge.

LOL. Compare something like *that* with WinMerge? Not everyone has that much time.
Every report should be reasonable, allowing one to see the problem.

Anyway, likely all that was about something as easy as:

1. Create a new spreadsheet.
2. Put "1" to C2.
3. Put "1" to D4.
4. Save and inspect content.xml, to see that:

a. It has first two rows with all four repeated columns:
> <table:table-row table:style-name="ro1"
                   table:number-rows-repeated="2">
>     <table:table-cell table:number-columns-repeated="4"/>
> </table:table-row>
b. It has second row with two trailing repeated columns:
> <table:table-row table:style-name="ro1">
>     <table:table-cell/>
>     <table:table-cell office:value-type="float"
                        office:value="1"
                        calcext:value-type="float">
>         <text:p>1</text:p>
>     </table:table-cell>
>     <table:table-cell table:number-columns-repeated="2"/>
> </table:table-row>
c. It has the third row without trailing repeated columns.

One might notice that all three rows have four elements each. One may conclude that there is one algorithm, taking the used rectangle, and outputting the rows one by one, merging completely empty rows and columns into a single "repeated" element.

One may imagine that not having those in the file would be problematic. But (1) having more code to avoid that would make more code that can have bugs; (2) it is unclear how robust that would be, given that e.g. ODF tells [1]:

> The behavior of a consumer when a cell is referenced but not declared is implementation-dependent.
which means that there *might* be complications when some cells are not declared in the file - if not for LibreOffice (are we sure in that?), then for some other standard-compliand consumers that have their "implementation-dependent" behavior for such case.

Anyway, I assume this issue in its current form, without anything specific, filled with rant, emotions, subjective opinions, but no manageable stuff, to be INVALID.

I'm not closing it; maybe erAck has something to say?

[1] https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part3-schema/OpenDocument-v1.3-os-part3-schema.html#__RefHeading__1415630_253892949

Comment 11 Mike Kaganski 2021-09-28 15:46:17 UTC

(In reply to Mike Kaganski from comment #10)
> One may imagine that not having those in the file would be problematic.

A thinko: I meant "One may imagine that not having those in the file would be beneficial".

Comment 12 robert 2021-09-28 16:05:48 UTC

"... there *might* be complications when some cells are not declared..."

Such as here, in content.xml of the "original-lift & v100+.ods" original file? PMABIHTP!

<table:table-row table:style-name="ro1" table:number-rows-repeated="142">
<table:table-cell table:number-columns-repeated="5"/>
<table:table-cell table:style-name="ce12"/>
<table:table-cell table:number-columns-repeated="6"/>
<table:table-cell table:style-name="ce9"/>
<table:table-cell table:number-columns-repeated="2"/>
<table:table-cell table:style-name="ce153"/>
<table:table-cell table:style-name="ce11"/>
<table:table-cell table:number-columns-repeated="47"/>
</table:table-row>
<table:table-row table:style-name="ro1" table:number-rows-repeated="1043575">
<table:table-cell table:number-columns-repeated="64"/>
</table:table-row>
<table:table-row table:style-name="ro2" table:number-rows-repeated="11">
<table:table-cell table:number-columns-repeated="64"/>
</table:table-row>
<table:table-row table:style-name="ro2">
<table:table-cell table:number-columns-repeated="64"/>
</table:table-row>

I think a repetition of more than a million empty rows might very well cause problems for other software!

Comment 13 Mike Kaganski 2021-09-28 16:14:44 UTC

(In reply to robert from comment #12)
> I think a repetition of more than a million empty rows might very well cause
> problems for other software!

There was exactly *no* repetition, the data was defined in non-redundant way, providing the reference for different styles (referenced by table:style-name attributes) only as needed.

If a user has created elements having different styles in the very last rows of the sheet, it's not the software's task to discard that data.

Comment 14 robert 2021-09-28 19:15:34 UTC

Have you actually read <http://www.ncdm.com/bloat/bloat.htm>, especially the penultimate line: 

Bloat is not a technical issue, but verily a way of thinking, a "state of mind". Its cure is a simple refusal to accept, and a well directed, resounding "clean up your act and clean up your code!"

It's not just storing empty cells, or "If a user has created elements having different styles in the very last rows of the sheet, it's not the software's task to discard that data." The user hasn't created them, the user, in casu me, has deleted 1 million rows and it's Calc that fills them with "standard" styles. 

Or why do I get "asian" or "complex" styles? 

Or fonts that are never used?

"state of mind"...

Comment 15 Mike Kaganski 2021-09-28 20:19:24 UTC

Created attachment 175330 [details]
Last rows of sheet "Rides"

(In reply to robert from comment #14)
> Have you actually read <http://www.ncdm.com/bloat/bloat.htm>, especially the
> penultimate line: 
> 
> Bloat is not a technical issue, but verily a way of thinking, a "state of
> mind". Its cure is a simple refusal to accept, and a well directed,
> resounding "clean up your act and clean up your code!"

Please keep the issue clean of unrelated subjective stuff.

> It's not just storing empty cells, or "If a user has created elements having
> different styles in the very last rows of the sheet, it's not the software's
> task to discard that data." The user hasn't created them, the user, in casu
> me, has deleted 1 million rows and it's Calc that fills them with "standard"
> styles. 

LOL. Have you tried to look at *your* file? Let me show it - on the attached screenshot. Look at the heights of the rows 1048565 to 1048576 (last 12 rows) vs the rows above? *This* is what *you* did (and you *definitely*) had set the height manually, at least for the million+ rows 1048564 and above. How that happened is what only you can tell.

> Or why do I get "asian" or "complex" styles? 

Because every style includes full definition that allows you to type *any* character into the cell having this style; and for ODF, that means that to type an Asian character, LibreOffice would need to use a different font. That you didn't enable display of those settings doesn't mean that LibreOffice should save incomplete styles, making *your* file behave differently in two your friends both typing Chinese characters, simply because they would use some different defaults because your file didn't explicitly specify what to use for Asian text.

This is prefectly OK.

> Or fonts that are never used?

A style that is defined in your file is the *content* of the file, even if you didn't apply it to anything.

Comment 16 Eike Rathke 2021-09-29 11:55:07 UTC

So the document has a style applied to a couple of last rows that is different from all the preceding rows. It is expected that this is stored in file.
If a reading software can't cope with attribution of that many rows it's free to omit them and possibly inform the user. But certainly the generating/writing software should not suppress that information.

Comment 17 robert 2021-09-29 14:30:33 UTC

Both the replies of the two LO developers and their attitude are like those of Paul Eggert: 

https://www.theregister.com/2021/09/28/time_zone_database_controversy/

"We know best, now go away!" 

For what it's worth, Ctrl-End goes to the last cell in the rectangle of the used cells of a sheet. Nothing outside that rectangle to the right or bottom needs to be saved, because there is nothing to save!

PMABIHTP...

Comment 18 Mike Kaganski 2021-09-29 15:00:35 UTC

(In reply to robert from comment #17)
> PMABIHTP...

I have no idea what this might mean. But I clearly see how robert acts as if they surely know better, and anything that contradicts their opinion is indeed "We know best, now go away!". *This* is what called arrogant behaviour, when users don't tolerate any facts / opinions that don't align with their views.

> Nothing outside that rectangle to the right or bottom needs to be saved, because there is nothing to save!

Anyone who would prepare and save a template (say, a table for a year, to fill in e.g. fitness results for the following year, with pre-formatted number formats, row sizes, conditional formatting, etc.) would be very glad to hear that.

Comment 19 Goldtree Millers 2023-09-06 09:53:53 UTC Comment hidden (spam)

The Gold Tree Millers strongly believe in the strength, purity, and freshness of nature. Our business is a diversified, expansion-focused agribusiness. Our main businesses are selling grains, processing flour, and exporting rice. based in Lahore, Pakistan. We are a group of young experts dedicated to delivering to our clients foods of the highest calibre. We seek to strengthen local communities and broaden people's access to and involvement with real foods. one and only We provide Multigrain Flour with 16 Grains in Pakistan.

Comment 20 Stéphane Guillou (stragu) 2023-11-07 22:45:44 UTC

Given the analysis and the reference to the ODF specs given by Mike, and the opinions of Mike and Eike on the suggestion, I understand the current behaviour to be justified and I am closing as "won't fix".

Robert, please feel free to open a new report if you can suggest file format optimisation, but please be concise and refrain from using language that only you understand.

Thanks everyone.