Bug 99015 - FILESAVE: LO XML grows so complex it's not human-editable
Summary: FILESAVE: LO XML grows so complex it's not human-editable
Status: RESOLVED DUPLICATE of bug 86988
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
5.1.1.3 release
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-01 07:00 UTC by Luke Kendall
Modified: 2017-05-26 11:50 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
A sample short document showing the XML fragmentatiion developing (18.29 KB, application/vnd.oasis.opendocument.text)
2016-04-26 09:11 UTC, Luke Kendall
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kendall 2016-04-01 07:00:15 UTC
https://bugs.documentfoundation.org/show_bug.cgi?id=90540 reports horribly complex HTML produced (or XHTML) from LibreOffice.  Every version since I have reported the problem is basically as bad as the previous.  The HTML produced, and indeed the XML of the .odt file itself if you unzip it and open it to examine it, is horribly complex.

Note that the person investigating could not reproduce the complex HTML.  Could someone please investigate further?  Perhaps it is not every single edit operation that generates fresh spans and fresh paragraph styles, but something is certainly causing "text fragmentation" that's reminiscent of Windows' horrible disk fragmentation which it was famous for (unlike Linux's clever anti-fragmentation logic).

I'm confident that any LO document which you edit for a while, if you have a look at the XML used for its save format, will be far, far mopre complex than it needs to be.

It appears that every(?) edit breaks text into new spans.  Many of these new spans are assigned a specially-created new paragraph style that's redundant (identical to the original). So in a long document you end up with thousands of paragraph styles, even though there may be genuinely only a handful of different styles. 

The results of this are that:

- If you wish to go into the XML file to workaround some LO bug (e.g. to avoid https://bugs.documentfoundation.org/show_bug.cgi?id=62603#add_comment), generally speaking it's not feasible: the text is so broken up into separate text spans with paragraphs styles defined elsewhere that you can't do any useful regexp searches or fixes.

- The files are much bigger than they need to be (wasting bandwidth when transmitted - especially if LO was used to generate HTML pages) as well as local storage space.

- I believe this would also cause a performance hit within LO, since instead of having, say, a single paragraph with all the text upon which you're going to do some operation, and knowing it's genuinely a single style, the text may be broken up into dozens of separate spans which must be iterated over, no doubt often needing to check whether the style has changed (when usually it will not have).

- It confuses many other systems that expect relatively simple XML, HTML, or XHTML as input - especially since the auto-generated redundant paragraph styles used in the spans are defined elsewhere in the document, so if you paste just a selection of text, strange things can happen.

So please look into this matter.  I think there will be a lot iof flow-on benefits from addressing it; even if it is just to introduce a new edit operation called "Defragment" (or "optimise" or something else that's less embarrassing than "Defragment").  Best of all, IMHO, would be if LO used similar smarts to the ext2 and later filesystems to avoid breaking units of data into smaller pieces unnecessarily.  E.g., one useful step would be not to create a new text style until it's needed: set the style to use the current style until some style change is actually made.  Better still would be not to split the text into a fresh span unless the style is changed.

Pleas, I beg someone, look into this!
Comment 1 How can I remove my account? 2016-04-12 11:53:59 UTC
Good one, you already had us for a moment!
Comment 2 Buovjaga 2016-04-12 11:58:56 UTC
If producing HTML via LibreOffice, we can't expect human-editable output. Closing as WONTFIX.
Comment 3 Luke Kendall 2016-04-12 12:18:36 UTC
(In reply to Buovjaga from comment #2)
> If producing HTML via LibreOffice, we can't expect human-editable output.
> Closing as WONTFIX.

I'm sorry to see you focused on just the human editability problems caused by the behavior, in making your decision.

I also think it's a wrong, and short-sighted decision.

Perhaps, though, if someone addresses the more fundamental issue of the same "fragmentation" of text spans within LO's native XML save file format, the problem will go away, however.
Comment 4 Luke Kendall 2016-04-12 12:21:38 UTC
Oh: I just realised this *is* the bug report for LO's XML file format.

The HTML generation is a mere side issue.

Please reconsider.  If it would help, I can re-submit this bug after removing any mention of HTML and just focusing on the XML format?
Comment 5 Luke Kendall 2016-04-12 12:23:23 UTC
(In reply to Tor Lillqvist from comment #1)
> Good one, you already had us for a moment!

BTW, what did the above comment mean?  It sounds like you are treating this report as if it's a joke.  Have I misunderstood your comment, Tor?
Comment 6 How can I remove my account? 2016-04-12 12:27:21 UTC
Given that the bug was filed on April 1, and its contents, it was an understandable misunderstanding.
Comment 7 Luke Kendall 2016-04-12 12:35:39 UTC
(In reply to Tor Lillqvist from comment #6)
> Given that the bug was filed on April 1, and its contents, it was an
> understandable misunderstanding.

Oh!  I genuinely hadn't noticed.

Well, yes, I can see the possibility.  Thanks for clearing that up.  But I do think this problem is quite a serious one, in its flow-on effects.  In some ways, it makes the LO XML as unmanipulatable as a binary file format would be.

Regards,
luke
Comment 8 How can I remove my account? 2016-04-12 12:48:21 UTC
Yes, a more specific bug report with a minimal annotated example would be great. Of course, we cannot promise that anybody will actually work on the issue (developers work on what they are interested in, or what their paying customers want them to work on), or even see it as something that needs fixing, but it would help your case.
Comment 9 Luke Kendall 2016-04-12 13:06:40 UTC
Okay, will do.  I have some deadlines, so it probably won't be till this weekend, though.  Thanks for the suggestion, Tor.
Comment 10 Luke Kendall 2016-04-26 09:11:11 UTC
Created attachment 124640 [details]
A sample short document showing the XML fragmentatiion developing

The document is only a few pages, and the text of the document itself describes my experiments, which pretty quickly revealed what seems to be causing the problem: style changes. (Ctrl-I)

I hope you don't mind that I've also made some simple suggestions which might be helpful in approaches to preventing the XML fragmentation.  Obviously I'm unaware of the internals though, so my suggestions may well be naive.

Anyway, I hope this example file is helpful.

If you can solve the problem, it will make the XML files produced by LO much more amenable to manipulation with traditional Unix tools, not to mention make the XML files smaller (especially if you avoid breaking spans to change from one style to an identical style, just because the text was once edited at that point).
Comment 11 Regina Henschel 2017-05-26 11:50:07 UTC
Reason are the RSIDs. See also bug 81420 and bug 86988.
To disable creating RSIDs go to Tools > Options > Writer > Comparison and disable saving random numbers.

*** This bug has been marked as a duplicate of bug 86988 ***