Bug 163789 - Excessive/gratuitous information saved in trivial FODT
Summary: Excessive/gratuitous information saved in trivial FODT
Status: UNCONFIRMED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
25.2.0.0 alpha0+
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: ODF-Flat
  Show dependency treegraph
 
Reported: 2024-11-06 18:36 UTC by Eyal Rozenberg
Modified: 2024-11-06 22:26 UTC (History)
0 users

See Also:
Crash report or crash signature:


Attachments
"hello world" FODT file (29.30 KB, application/vnd.oasis.opendocument.text-flat-xml)
2024-11-06 18:36 UTC, Eyal Rozenberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eyal Rozenberg 2024-11-06 18:36:50 UTC
Created attachment 197457 [details]
"hello world" FODT file

Suppose you create a new Writer document, write down "hello world", and save the document as an FODT. How long should this document be?

Well, as of (a nightly of) version 25.02, it is 302 lines, and 30 KB. Yikes!

There should be a lot less content in such a file.
Comment 1 Mike Kaganski 2024-11-06 19:01:26 UTC
Why?
It must have all the info to show the same document the same way regardless of an ODF reader, and its defaults; and it must have all the metadata (including styles). What a user's idea how long should it be internally has to do with the real multitude of possible applications?
Comment 2 Eyal Rozenberg 2024-11-06 20:39:25 UTC
So, what do we have in there?

1. An opening XML header with 42 attributes, 40 of which are xmlns values. The other two are "office:version" and "office:mimetype".
2. A 30-line <config:config-item-set config:name="ooo:view-settings"> element
3. A ~115-line <config:config-item-set config:name="ooo:configuration-settings"> element.
4. A 4-line <scripts> element, linking to an ooo:libraries-wrapped URL.
5. A 9-line <font-face-decl> element, which declares 7 fonts, while only one is used
6. ~105-line <styles> element, containing 17 distinct styles, including a graphics style, outline numbering, footnote and endnote configuration, a color scheme that's unused, 
7. A 13-line <body> element, mostly containing a 7-line <text:sequence-decls> element with unused sequences. 


(In reply to Mike Kaganski from comment #1)
> Why?

For several reasons:

1. Much of that is not necessary to reproduce the document as created and entered.
2. Some of that has no effect even in principle and is just redundant
3. The user may not intend for all of this information to be embedded into the document.

Now, reason (3.) is only true sometimes / for some users. And, in fact, the question of whether to include unnecessary information, but which, when used, will make additional edits to the document feel more like they would have on the original author's machine - that's another matter users might have a preference about. Pure redundancies are separate, and I believe we are also see a few of those.

> it must have all the metadata (including styles).

Why must it, for example, have meta-data that's not used, e.g. styles that aren't used?
Comment 3 Eyal Rozenberg 2024-11-06 20:54:52 UTC
(In reply to Mike Kaganski from comment #1)
> Why?

And there is another semi-related reason: The use of FODT for LibreOffice QA. I would like to generate documents with just enough stuff in them to reproduce a bug.
Comment 4 Mike Kaganski 2024-11-06 21:44:33 UTC
(In reply to Eyal Rozenberg from comment #2)
> So, what do we have in there?
> 
> 1. An opening XML header with 42 attributes, 40 of which are xmlns values.
> The other two are "office:version" and "office:mimetype".

And what is the problem in these 40? In fact, these are one of the few (two more down the road) that really could be minimized - because their set is basically fixed, to simplify the code; but how do they harm *in reality*? Note that I ask this, and - possibly - one of a *couple* people who really *does* clean up FODF files when preparing minimized bug docs and unit tests; so - I really could benefit from that - but it would never outweigh the complexity of the code.

> 2. A 30-line <config:config-item-set config:name="ooo:view-settings"> element
> 3. A ~115-line <config:config-item-set
> config:name="ooo:configuration-settings"> element.

And they must be: these settings are what defines the ... settings; including compatibility, view, etc. Some of them are controllable using settings (and may be not exported).

> 4. A 4-line <scripts> element, linking to an ooo:libraries-wrapped URL.

Again: this is a boilerplate for simplicity. Could be stripped, at the expense of code complexity. Doesn't hurt.

> 5. A 9-line <font-face-decl> element, which declares 7 fonts, while only one
> is used

No - please stop this useless argument. As long as the font is in a style, it is used. Period.

> 6. ~105-line <styles> element, containing 17 distinct styles, including a
> graphics style, outline numbering, footnote and endnote configuration, a
> color scheme that's unused, 

... And styles **are** the document content. No matter what you think.

> 7. A 13-line <body> element, mostly containing a 7-line
> <text:sequence-decls> element with unused sequences. 

The last boilerplate for simplicity. Could be stripped, at the expense of code complexity. Doesn't hurt.

> (In reply to Mike Kaganski from comment #1)
> > Why?
> 
> For several reasons:
> 
> 1. Much of that is not necessary to reproduce the document as created and
> entered.

Wrong - as said, not much. The three elements I marked as "could be stripped" constitute very little.

> 2. Some of that has no effect even in principle and is just redundant

No.

> 3. The user may not intend for all of this information to be embedded into
> the document.

Then they shouldn't use WYSIWYG tools.

In general, the repeated "I don't like much information that user is not expected to look at" is tiresome and useless. It is a complex document format, created by a complex software, which has much more interesting things to do, than playing "let's make it eye-pleasing internal code to please one geek's eye" games. It's not even "if someone volunteers to do, then please" - for the most part, it is simply no-go. Remove a style, and you break someone's workflow. Remove a setting, and you break the document behavior / look.

I simply close this useless issue. I know that you like to reopen, and try to force your PoV no matter what. Of course, I won't engage into that again - if you do, it's your playground.
Comment 5 Eyal Rozenberg 2024-11-06 22:26:17 UTC
(In reply to Mike Kaganski from comment #4)
> And what is the problem in these 40? In fact, these are one of the few (two
> more down the road) that really could be minimized - because their set is
> basically fixed, to simplify the code; but how do they harm *in reality*?

They're redundant and gratuitous, at least to some extent. For example:

xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0"

when my document doesn't used 3-dimensional shapes. And many more.

> Note that I ask this, and - possibly - one of a *couple* people who really
> *does* clean up FODF files when preparing minimized bug docs and unit tests;
> so - I really could benefit from that - but it would never outweigh the
> complexity of the code.

Let's ignore for the moment the prospect of skipping namespaces which can be inferred using the office namespace, ODF version, MIME type and some ODF defaults, and just focus on unused namespaces.

Where's the great complexity in using the information which I'm sure we already collect, regarding whether or not a document has 3D drawings, or math formulae, or form controls etc? If we have any, we include the namespace in the office:document element attributes, and if we don't - then we don't.


> And they must be: these settings are what defines the ... settings;
> including compatibility, view, etc. Some of them are controllable using
> settings (and may be not exported).

"Must"? No, it's just a choice, as long as they don't affect the content.

For example, the view information. That might even be a privacy concern. Suppose I save a contract; I don't want people to determine which part of it I was last looking at; and at least I want to be asked about this. Or - take printing-related preferences; who said I want other people reading or editing my document to have my printing preferences? ... generally, and particularly my printer name?!


> Again: this is a boilerplate for simplicity. Could be stripped, at the
> expense of code complexity. Doesn't hurt.

Let users have at least a toggle somewhere to not add boilerplate to their files.


> No - please stop this useless argument. As long as the font is in a style,
> it is used. Period.

1. There's duplication fonts, resulting in dummies.
2. But the file is full of styles which _aren't_ used.

> > 6. ~105-line <styles> element, containing 17 distinct styles, including a
> > graphics style, outline numbering, footnote and endnote configuration, a
> > color scheme that's unused, 
> 
> ... And styles **are** the document content. No matter what you think.

1. I didn't create nor use any of these styles.
2. Some of them are latent styles which aren't even accessible in the UI, e.g.:

   <text:p text:style-name="P1">hello world</text:p>

and style P1, which really doesn't exist as far as the user is concerned, is:

  <style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">

and style "Standard", which also doesn't exist, is:

 <style:style style:name="Standard" style:family="paragraph" style:class="text"/>

... and it's only then that we get to the Default Paragraph Style. Now that - we definitely should have in the document. BUt why the other stuff?

> No.

A useful argument :-)

> > 3. The user may not intend for all of this information to be embedded into
> > the document.
> 
> Then they shouldn't use WYSIWYG tools.

LibreOffice is really not WYSIWYG. In fact, you're making my argument for me, since if WYSIWYG, I should at least not get stuff I don't and can't see.

Moreover - there's no reason WYSIWYG tools should not be able to produce terse and more easily-readable output. Especially for textual formats like flat ODT.

> which has much more interesting
> things to do, than playing "let's make it eye-pleasing internal code to
> please one geek's eye" games.

Terse files with only relevant content also makes parsing and automated processing easier.

Also, "more interesting things to do" just means low priority/severity.


> It's not even "if someone volunteers to do,
> then please" - for the most part, it is simply no-go. Remove a style, and
> you break someone's workflow. Remove a setting, and you break the document
> behavior / look.

1. Fortunately, that's not the case for much of what can be removed.
2. We can offer an option for more parsimonious saving/export, so that nobody's theoretical workflows are broken.

> I simply close this useless issue. I know that you like to reopen, and try
> to force your PoV no matter what.

To close a bug against the reporter's opinion, you need to make a convincing argument for why it merits closure. If you've not convinced the reporter, and they are not some troll/spammer, then you should establish there is wider agreement with your position.

For design bugs, this is "adjudicated", to some extent, by discussions in design meetings. For import/export issues I don't know what the custom is, but "A developer has a different perspective on the matter" cannot be sufficient when the bug report is about outward-facing behavior.