Bug 136434 - FORMATTING: redundancy in content.xml
Summary: FORMATTING: redundancy in content.xml
Status: UNCONFIRMED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 146052 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-09-03 12:43 UTC by Christian Lehmann
Modified: 2024-01-22 03:12 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
odt document that contains redundant markup (10.76 KB, application/vnd.oasis.opendocument.text)
2020-09-03 12:44 UTC, Christian Lehmann
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christian Lehmann 2020-09-03 12:43:06 UTC
Description:
In Bug 136409, I had remarked on redundant markup in content.xml which blows up ODT files. The attached file ‚redundancy_in_xml.odt’ illustrates some of the reasons for this:

File ‘content.xml’
The section “<office:font-face-decls>” contains 9 font faces only one of which is used in this document.
In the section ‘<office:automatic-styles>’, several text styles have the same definition, differing only in the value of the attribute ‘officeooo:rsid’; but this is devoid of useful effects. In this sense, the following styles are the same:
“T1” - “T6”
“T7” - “T12”
“T13” and “T14”.
Styles T4 – T6 contain specifications of fonts which are never used.
The section ‘<text:sequence-decls>’ contains five declarations none of which is used in the text.
The section ‘<office:text>’ contains 13 occurrences of the tag ‘<text:span text:style-name="Kommentarzeichen">’. A style of this name is also listed among the “Applied Styles” in the panel “Character Styles”. It is, in fact, not applied in the document. It has apparently been taken over from the source where this text line was copied from from.

Steps to Reproduce:
Unpack the attached odt file and examine the content.xml.

Actual Results:
The file contains markup not used in the document.

Expected Results:
The file should be clean.


Reproducible: Always


User Profile Reset: No



Additional Info:
The attached file was copied out from the larger file submitted as mentioned in Bug 136409. That is the source of some (though not all) of the redundant markup information. If copying such useless information cannot be avoided while the document is being edited, then LO should offer a function (> Tools > Purge) which streamlines the xml files of a stored odt file. Redundancy and overweight lead to errors and crashes.
Comment 1 Christian Lehmann 2020-09-03 12:44:02 UTC
Created attachment 165087 [details]
odt document that contains redundant markup
Comment 2 Telesto 2020-09-03 19:08:14 UTC
Two topics here
1. Redundant markup in content.xml being imported somehow (copy/paste)
2. LO should offer a function (> Tools > Purge) which streamlines the xml files of a stored odt file.
Comment 3 Regina Henschel 2020-09-20 12:14:36 UTC
Disable creating of 'officeooo:rsid': Tools > Options > Writer > Comparison. Clear the checkbox 'Store it when changing the document'. While 'officeooo:rsid' are written the styles are indeed different.

I support the idea of a "clean-up" tool, which removes all things from the file, which are not actually used.
Comment 4 Heiko Tietze 2020-09-21 09:46:35 UTC
Not a UX topic; and while purge/clean/streamline sounds nice I'm sceptical that it's easy to implement. There are just too many dependencies that require expert knowledge in order to make a decision whether a setting should remain or not.
Comment 5 Mike Kaganski 2020-09-21 10:39:03 UTC
(In reply to Christian Lehmann from comment #0)

To check the claims, please save the file as a FODT, and then inspect the resulting XML, for simplicity - you would have everything in one XML.

> File ‘content.xml’
> The section “<office:font-face-decls>” contains 9 font faces only one of
> which is used in this document.

Wrong. All 9 are used either in styles, or in text.

> In the section ‘<office:automatic-styles>’, several text styles have the
> same definition, differing only in the value of the attribute
> ‘officeooo:rsid’; but this is devoid of useful effects. In this sense, the
> following styles are the same:
> “T1” - “T6”
> “T7” - “T12”
> “T13” and “T14”.

See comment 3.

> Styles T4 – T6 contain specifications of fonts which are never used.

This doesn't make sense. The fonts are "used" as soon as the style is used. If characters present in a text run that uses the style (DF actually) don't need some script, that doesn't mean "the style should be cleaned up, such as when user finally decides to write some Arabic or Chinese characters, they would appear in something else compared to what had been defined originally".

> The section ‘<text:sequence-decls>’ contains five declarations none of which
> is used in the text.

The sequence definitions are data by their own, just like, say, styles or macros. You won't want your macros in a document to disappear on save just because there were no buttons inserted in the document that used the macros. Likewise, prepared sequence definitions or styles are part of data, that must be saved.

That there are several sequence definitions pre-created by default, is a different story ...

> The section ‘<office:text>’ contains 13 occurrences of the tag ‘<text:span
> text:style-name="Kommentarzeichen">’. A style of this name is also listed
> among the “Applied Styles” in the panel “Character Styles”. It is, in fact,
> not applied in the document.

It is, in fact, *is* applied in the document. E.g., to text "¿Bá		jé".

A separate cleanup tool might be interesting thing, indeed.
Comment 6 Christian Lehmann 2020-10-13 16:58:48 UTC
(In reply to Heiko Tietze from comment #4)
> Not a UX topic; and while purge/clean/streamline sounds nice I'm sceptical
> that it's easy to implement. There are just too many dependencies that
> require expert knowledge in order to make a decision whether a setting
> should remain or not.

I'm pursuing and urging this because I'm trying to do serious work with LO, but it is unable to handle reliably the large file that I have submitted with other bug reports. This file weighs 18.8 MB in FODT format, but 1.6 MB when exported into DOCX format. This alone should alert developers about the inflated file size. I have no doubt that the task of correcting a basic design mistake - in this case, obnoxious redundancy in file structure - is deterrent. However, we want to stand the competition with MS Office, don't we? Then it is not a real help to classify the bug as 'not a UX topic' and to discourage people from tackling the task.
Comment 7 Mike Kaganski 2020-10-13 18:39:10 UTC
(In reply to Christian Lehmann from comment #6)
> This file weighs 18.8 MB in FODT format, but 1.6 MB
> when exported into DOCX format. This alone should alert developers about the
> inflated file size. I have no doubt that the task of correcting a basic
> design mistake - in this case, obnoxious redundancy in file structure - is
> deterrent.

Heh, it's not useful to compare apples to oranges. DOCX, as well as ODT, is a ZIP with XMLs inside. It's interesting to know how much space would the XMLs inside the DOCX take, or how much would zipped FODT take. It's also interesting, how much normal ODT takes, which indeed also has all the redundancy, compared to DOCX.

> However, we want to stand the competition with MS Office, don't we?

This is very argumentative.
Comment 8 Mike Kaganski 2020-10-13 18:48:36 UTC
(In reply to Mike Kaganski from comment #7)
> It's also interesting, how much normal ODT takes, which indeed also
> has all the redundancy, compared to DOCX.

And note that the XML schema of OOXML was designed with the goal of using short names for most often used XML elements (like 'w:p', 'w:r' in Word, or <c r="A1" t="s"><v>0</v></c> for cells in Excel), while for ODF, the names were chosen likely for easier readability (like longer prefix names for paragraphs/text runs in Writer, or <table:table-cell office:value-type="string" office:string-value="2020" calcext:value-type="string">... in Calc). This adds to the XML size, but is not related to "redundancy".
Comment 9 Christian Lehmann 2020-10-13 18:57:06 UTC
(In reply to Mike Kaganski from comment #5)
> (In reply to Christian Lehmann from comment #0)
> 
> To check the claims, please save the file as a FODT, and then inspect the
> resulting XML, for simplicity - you would have everything in one XML.
> 
> > File ‘content.xml’
> > The section “<office:font-face-decls>” contains 9 font faces only one of
> > which is used in this document.
> 
> Wrong. All 9 are used either in styles, or in text.

It appears we are talking past each other. What matters to the user is what he can see in the editing window. The nine font declarations mention fonts like 'Lohit Devanagari', 'Noto Sans CJK' and the like, none of which are visible to the user in this particular document, be it in direct formats, be it in the styles applied. I conclude that this information (whatever its origin) does not need to be stored in the file.
> 
> > In the section ‘<office:automatic-styles>’, several text styles have the
> > same definition, differing only in the value of the attribute
> > ‘officeooo:rsid’; but this is devoid of useful effects. In this sense, the
> > following styles are the same:
> > “T1” - “T6”
> > “T7” - “T12”
> > “T13” and “T14”.
> 
> See comment 3.
That is certainly a helpful hint. However, again, why would you want to store this information with the file, if it is inaccessible the user?

> 
> > Styles T4 – T6 contain specifications of fonts which are never used.
> 
> This doesn't make sense. The fonts are "used" as soon as the style is used.
> If characters present in a text run that uses the style (DF actually) don't
> need some script, that doesn't mean "the style should be cleaned up, such as
> when user finally decides to write some Arabic or Chinese characters, they
> would appear in something else compared to what had been defined originally".

Again, this answer doesn't make sense to me. Let's consider an example: The XML alleges that a style named 'T4' is applied to the word 'block' in the file. This fact, however, is not visible to the user. In his perspective, the entire paragraph is formatted with the Default Character Style. "T4" is in no way accessible to him.
> 
> > The section ‘<text:sequence-decls>’ contains five declarations none of which
> > is used in the text.
> 
> The sequence definitions are data by their own, just like, say, styles or
> macros. You won't want your macros in a document to disappear on save just
> because there were no buttons inserted in the document that used the macros.
> Likewise, prepared sequence definitions or styles are part of data, that
> must be saved.
> 
> That there are several sequence definitions pre-created by default, is a
> different story ...
I was just referring to these.

> 
> > The section ‘<office:text>’ contains 13 occurrences of the tag ‘<text:span
> > text:style-name="Kommentarzeichen">’. A style of this name is also listed
> > among the “Applied Styles” in the panel “Character Styles”. It is, in fact,
> > not applied in the document.
> 
> It is, in fact, *is* applied in the document. E.g., to text "¿Bá		jé".

This is true. Since it is not an LO Writer Style, it probably stems from an earlier MS Word version of the document. Again, the question is why a style that was specified at the level of a block - in this case, a comment - is assigned to single elements contained in the block.
Comment 10 Christian Lehmann 2020-10-13 19:01:15 UTC
(In reply to Mike Kaganski from comment #7)
> (In reply to Christian Lehmann from comment #6)
> > This file weighs 18.8 MB in FODT format, but 1.6 MB
> > when exported into DOCX format. This alone should alert developers about the
> > inflated file size. I have no doubt that the task of correcting a basic
> > design mistake - in this case, obnoxious redundancy in file structure - is
> > deterrent.
> 
> Heh, it's not useful to compare apples to oranges. DOCX, as well as ODT, is
> a ZIP with XMLs inside.

Sorry, I was not aware of this. In the present case, the sheer ODT file and the exported DOCX file have the same size. I remember that the original DOCX file (the source of the ODT file) was a bit smaller; but this is no longer verifiable.
Comment 11 Mike Kaganski 2020-10-13 19:06:51 UTC
(In reply to Christian Lehmann from comment #9)
> (In reply to Mike Kaganski from comment #5)
> > (In reply to Christian Lehmann from comment #0)
> > 
> > To check the claims, please save the file as a FODT, and then inspect the
> > resulting XML, for simplicity - you would have everything in one XML.
> > 
> > > File ‘content.xml’
> > > The section “<office:font-face-decls>” contains 9 font faces only one of
> > > which is used in this document.
> > 
> > Wrong. All 9 are used either in styles, or in text.
> 
> It appears we are talking past each other. What matters to the user is what
> he can see in the editing window. The nine font declarations mention fonts
> like 'Lohit Devanagari', 'Noto Sans CJK' and the like, none of which are
> visible to the user in this particular document, be it in direct formats, be
> it in the styles applied. I conclude that this information (whatever its
> origin) does not need to be stored in the file.

It only is "not visible" to you because you chose not to see it: namely, you likely don't have Asian and CTL support enabled in Options->Language Settings->Languages. It is *hidden* from people that don't choose to see it, but it doesn't mean it's inaccessible, or that it should be dropped. The styles (and direct formatting) has the full set of settings that describes its appearance, which includes possibility that someone later types a Chinese or Arabic characters there in that paragraph.

> > 
> > > In the section ‘<office:automatic-styles>’, several text styles have the
> > > same definition, differing only in the value of the attribute
> > > ‘officeooo:rsid’; but this is devoid of useful effects. In this sense, the
> > > following styles are the same:
> > > “T1” - “T6”
> > > “T7” - “T12”
> > > “T13” and “T14”.
> > 
> > See comment 3.
> That is certainly a helpful hint. However, again, why would you want to
> store this information with the file, if it is inaccessible the user?

It is available - when user uses Edit->Track Changes->Compare Document.

> 
> > 
> > > Styles T4 – T6 contain specifications of fonts which are never used.
> > 
> > This doesn't make sense. The fonts are "used" as soon as the style is used.
> > If characters present in a text run that uses the style (DF actually) don't
> > need some script, that doesn't mean "the style should be cleaned up, such as
> > when user finally decides to write some Arabic or Chinese characters, they
> > would appear in something else compared to what had been defined originally".
> 
> Again, this answer doesn't make sense to me. Let's consider an example: The
> XML alleges that a style named 'T4' is applied to the word 'block' in the
> file. This fact, however, is not visible to the user. In his perspective,
> the entire paragraph is formatted with the Default Character Style. "T4" is
> in no way accessible to him.

The automatic styles is the LibreOffice way to express direct formatting. So T4 *is* available to user, through properties of the text that has this automatic style applied.

> > 
> > > The section ‘<text:sequence-decls>’ contains five declarations none of which
> > > is used in the text.
> > 
> > The sequence definitions are data by their own, just like, say, styles or
> > macros. You won't want your macros in a document to disappear on save just
> > because there were no buttons inserted in the document that used the macros.
> > Likewise, prepared sequence definitions or styles are part of data, that
> > must be saved.
> > 
> > That there are several sequence definitions pre-created by default, is a
> > different story ...
> I was just referring to these.

This is orthogonal to redundancy. If you want, you should create something like "LibreOffice should not pre-create sequences for illustrations etc.", separately.

> 
> > 
> > > The section ‘<office:text>’ contains 13 occurrences of the tag ‘<text:span
> > > text:style-name="Kommentarzeichen">’. A style of this name is also listed
> > > among the “Applied Styles” in the panel “Character Styles”. It is, in fact,
> > > not applied in the document.
> > 
> > It is, in fact, *is* applied in the document. E.g., to text "¿Bá		jé".
> 
> This is true. Since it is not an LO Writer Style, it probably stems from an
> earlier MS Word version of the document. Again, the question is why a style
> that was specified at the level of a block - in this case, a comment - is
> assigned to single elements contained in the block.

This is not related to this issue, and is something to ask the author.
Comment 12 Christian Lehmann 2020-10-13 19:33:39 UTC
(In reply to Mike Kaganski from comment #8)
> (In reply to Mike Kaganski from comment #7)
> > It's also interesting, how much normal ODT takes, which indeed also
> > has all the redundancy, compared to DOCX.
> 
> And note that the XML schema of OOXML was designed with the goal of using
> short names for most often used XML elements (like 'w:p', 'w:r' in Word, or
> <c r="A1" t="s"><v>0</v></c> for cells in Excel), while for ODF, the names
> were chosen likely for easier readability (like longer prefix names for
> paragraphs/text runs in Writer, or <table:table-cell
> office:value-type="string" office:string-value="2020"
> calcext:value-type="string">... in Calc). This adds to the XML size, but is
> not related to "redundancy".

Agreed. The real question is, of course, how much memory size the file occupies once loaded into the editor.
Comment 13 Christian Lehmann 2020-10-13 20:02:07 UTC
(In reply to Mike Kaganski from comment #11)
> > It appears we are talking past each other. What matters to the user is what
> > he can see in the editing window. The nine font declarations mention fonts
> > like 'Lohit Devanagari', 'Noto Sans CJK' and the like, none of which are
> > visible to the user in this particular document, be it in direct formats, be
> > it in the styles applied. I conclude that this information (whatever its
> > origin) does not need to be stored in the file.
> 
> It only is "not visible" to you because you chose not to see it: namely, you
> likely don't have Asian and CTL support enabled in Options->Language
> Settings->Languages. It is *hidden* from people that don't choose to see it,
> but it doesn't mean it's inaccessible, or that it should be dropped. The
> styles (and direct formatting) has the full set of settings that describes
> its appearance, which includes possibility that someone later types a
> Chinese or Arabic characters there in that paragraph.

Indeed, I had not. Even if I activate those languages, the 'Asian Text Font' and the 'CTL Font' offered for use are different. And the two fonts I mentioned are not even offered in the dropdown list.

Moreover, activating this option would introduce the possibility of using two additional fonts. Why should there be nine of them?

> > > See comment 3.
> > That is certainly a helpful hint. However, again, why would you want to
> > store this information with the file, if it is inaccessible the user?
> 
> It is available - when user uses Edit->Track Changes->Compare Document.
>
The point was not whether this can be made visible to the user (I have not succeeded using this method), but whether this information is of any use to him.

> > 
> > > 
> > > > Styles T4 – T6 contain specifications of fonts which are never used.
> > > 
> > > This doesn't make sense. The fonts are "used" as soon as the style is used.
> > > If characters present in a text run that uses the style (DF actually) don't
> > > need some script, that doesn't mean "the style should be cleaned up, such as
> > > when user finally decides to write some Arabic or Chinese characters, they
> > > would appear in something else compared to what had been defined originally".
> > 
> > Again, this answer doesn't make sense to me. Let's consider an example: The
> > XML alleges that a style named 'T4' is applied to the word 'block' in the
> > file. This fact, however, is not visible to the user. In his perspective,
> > the entire paragraph is formatted with the Default Character Style. "T4" is
> > in no way accessible to him.
> 
> The automatic styles is the LibreOffice way to express direct formatting. So
> T4 *is* available to user, through properties of the text that has this
> automatic style applied.
>
We are talking about parsimony. Why do we need T1 - T14, each with its own definition, if most of them appear as the same to the user?
 
> > 
> > This is true. Since it is not an LO Writer Style, it probably stems from an
> > earlier MS Word version of the document. Again, the question is why a style
> > that was specified at the level of a block - in this case, a comment - is
> > assigned to single elements contained in the block.
> 
> This is not related to this issue, and is something to ask the author.

No, it is an issue of how LO Writer stores a character style that the user specified for an entire paragraph. I had asked this in a different bug report and will take it up there.
Comment 14 Mike Kaganski 2020-10-13 20:32:17 UTC
(In reply to Christian Lehmann from comment #13)
> (In reply to Mike Kaganski from comment #11)
> > > It appears we are talking past each other. What matters to the user is what
> > > he can see in the editing window. The nine font declarations mention fonts
> > > like 'Lohit Devanagari', 'Noto Sans CJK' and the like, none of which are
> > > visible to the user in this particular document, be it in direct formats, be
> > > it in the styles applied. I conclude that this information (whatever its
> > > origin) does not need to be stored in the file.
> > 
> > It only is "not visible" to you because you chose not to see it: namely, you
> > likely don't have Asian and CTL support enabled in Options->Language
> > Settings->Languages. It is *hidden* from people that don't choose to see it,
> > but it doesn't mean it's inaccessible, or that it should be dropped. The
> > styles (and direct formatting) has the full set of settings that describes
> > its appearance, which includes possibility that someone later types a
> > Chinese or Arabic characters there in that paragraph.
> 
> Indeed, I had not. Even if I activate those languages, the 'Asian Text Font'
> and the 'CTL Font' offered for use are different. And the two fonts I
> mentioned are not even offered in the dropdown list.

Just tested with attachment 165087 [details]. The Default Paragraph Style has "Cambria", "Noto Serif CJK SC", and "Lohit Devanagari". Heading paragraph style adds "Liberation Serif" and "Noto Sans CJK SC". "ex_gloss" paragraph style adds "Times New Roman". "ex_a" style adds "Arial Unicode MS". And finally, "Lohit Devanagari" and "Cambria" are there in the fonts list in two variants each, having style:font-family-generic and style:font-pitch attributes for second copy; possibly that's needed for some compatibility settings related to Word import. I see in UI all 9 fonts used in the XML, in expected positions.

> 
> Moreover, activating this option would introduce the possibility of using
> two additional fonts. Why should there be nine of them?

Two for each style. And for each direct formatting.

> 
> > > > See comment 3.
> > > That is certainly a helpful hint. However, again, why would you want to
> > > store this information with the file, if it is inaccessible the user?
> > 
> > It is available - when user uses Edit->Track Changes->Compare Document.
> >
> The point was not whether this can be made visible to the user (I have not
> succeeded using this method), but whether this information is of any use to
> him.

If you know that you are not going to compare versions of documents, you have the option to disable this - see comment 3. Software cannot figure if user needs it or not.

> 
> > > 
> > > > 
> > > > > Styles T4 – T6 contain specifications of fonts which are never used.
> > > > 
> > > > This doesn't make sense. The fonts are "used" as soon as the style is used.
> > > > If characters present in a text run that uses the style (DF actually) don't
> > > > need some script, that doesn't mean "the style should be cleaned up, such as
> > > > when user finally decides to write some Arabic or Chinese characters, they
> > > > would appear in something else compared to what had been defined originally".
> > > 
> > > Again, this answer doesn't make sense to me. Let's consider an example: The
> > > XML alleges that a style named 'T4' is applied to the word 'block' in the
> > > file. This fact, however, is not visible to the user. In his perspective,
> > > the entire paragraph is formatted with the Default Character Style. "T4" is
> > > in no way accessible to him.
> > 
> > The automatic styles is the LibreOffice way to express direct formatting. So
> > T4 *is* available to user, through properties of the text that has this
> > automatic style applied.
> >
> We are talking about parsimony. Why do we need T1 - T14, each with its own
> definition, if most of them appear as the same to the user?

Each of them differs at least with officeooo:rsid. This allows to compare versions of documents if user needs that.

>  
> > > 
> > > This is true. Since it is not an LO Writer Style, it probably stems from an
> > > earlier MS Word version of the document. Again, the question is why a style
> > > that was specified at the level of a block - in this case, a comment - is
> > > assigned to single elements contained in the block.
> > 
> > This is not related to this issue, and is something to ask the author.
> 
> No, it is an issue of how LO Writer stores a character style that the user
> specified for an entire paragraph. I had asked this in a different bug
> report and will take it up there.

This is unrelated to this issue. Period.

Generally all issues you raised seem to be lack of knowledge.
Comment 15 Christian Lehmann 2020-10-14 15:22:57 UTC
No doubt about that. It has certainly been meritorious to explain things to a simple user. It would be even more meritorious if you could engage in what is obviously the aim of my contribution, namely, to push on with parsimony in the structure of ODT files. I assume you agree that this is a virtue in programming.
Comment 16 Dieter 2021-12-29 17:56:42 UTC
(In reply to Regina Henschel from comment #3)
> Disable creating of 'officeooo:rsid': Tools > Options > Writer > Comparison.
> Clear the checkbox 'Store it when changing the document'. While
> 'officeooo:rsid' are written the styles are indeed different.
> 
> I support the idea of a "clean-up" tool, which removes all things from the
> file, which are not actually used.

Who should decide about this enhancement idea? Status is still UNCONFIRMED with no further comment for more than one year.
Comment 17 achim 2021-12-29 21:49:10 UTC
As decribed in tdf#146052 I'd like to have a cleanup function too.
Comment 18 Christian Lehmann 2022-01-05 19:26:46 UTC
If the goal of purging a document from redundant material would be seriously pursued, I would volunteer to present a list of elements that seem superfluous to me, in a given file. However, as Mike noticed long ago, I am not a specialist on technicalities. I can only see things such as this:

Have a paragraph (e.g. a heading) obeying a style which puts the entire paragraph in bold face. Then copy into this paragraph a string which already is formated, by character format, as bold face. The code of this attribute is preserved in the XML file. The user, however, does not need this. Even if he had advanced methods of looking into the inner formating of substrings of the paragraph in question, this would be of no interest to him. Consequence: Delete the character formating of a string if it is copied into a context which already has this format.

This is the kind of redundancy I was referring to. I trust my file would be much smaller if it were suppressed.
Comment 19 Christian Lehmann 2022-04-20 07:56:09 UTC
(In reply to Regina Henschel from comment #3)
> Disable creating of 'officeooo:rsid': Tools > Options > Writer > Comparison.
> Clear the checkbox 'Store it when changing the document'. While
> 'officeooo:rsid' are written the styles are indeed different.
> 
I cleared this checkbox. The effect is null. All of those 'officeooo:rsid' remain in the document.
Comment 20 Regina Henschel 2022-04-20 08:22:29 UTC
(In reply to Christian Lehmann from comment #19)
> (In reply to Regina Henschel from comment #3)
> > Disable creating of 'officeooo:rsid': Tools > Options > Writer > Comparison.
> > Clear the checkbox 'Store it when changing the document'. While
> > 'officeooo:rsid' are written the styles are indeed different.
> > 
> I cleared this checkbox. The effect is null. All of those 'officeooo:rsid'
> remain in the document.

It does not affect old documents, but prevents writing these attributes in new documents.

If you do not use LO specific features, you can save the existing document in format "ODF 1.3". That removes these attributes. But I have not tested, if that is enough to join styles and remove no longer used styles.
Comment 21 Christian Lehmann 2022-04-20 09:10:20 UTC
Thanks for taking this up. I have tested it with the file test_san.odt of Bug 148333. I now work on this file (i.e. its non-sanitized original) in LO Writer 7.3. This recommends using file format 1.3 Extended. I now chose format 1.3 [bare] and looked into its content.xml. It still contains tons of text spans named "T12358" and the like.
Comment 22 Buovjaga 2022-11-28 12:05:07 UTC
*** Bug 146052 has been marked as a duplicate of this bug. ***
Comment 23 Dieter 2023-12-06 08:56:22 UTC
Tested with

Version: 7.6.3.2 (X86_64) / LibreOffice Community
Build ID: 29d686fea9f6705b262d369fede658f824154cc0
CPU threads: 4; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL: win
Locale: en-GB (de_DE); UI: en-GB
Calc: CL threaded

Steps:

1. Open attachment 179343 [details]
2. Tools -> Options -> Load/Save -> General
3. ODF Format Version 1.3 (not 1.3 extended) -> OK
4. Save
5. Open content.xml

Couldn't find any T12358 entries (reported as problem in comment 21)

Christian, could you please retest and give a proper set of steps, if problem is still present?
=> NEEDINFO
Comment 24 Christian Lehmann 2024-01-21 23:12:40 UTC
I have finally gotten around to do the test. To my understanding, there are more than one issue involved here:

1) Can one shrink the size of an ODT file by saving it in ODF Format Version 1.3?
2) Does the content.xml of an ODT file contain redundant styles of the kind "<style:style style:name="T2375" ..."?

Maybe two separate enhancement proposals should be created for discussion.

Ad 1: Yes. The file I had been working with weighs 2.523 KB. Saved in Format Version 1.3, it weighs 2.016 KB. However: 
a) The Help function on Load/Save options does not even mention this menu item, let alone explain it.
b) The menu item is provided with a warning "Not using ODF 1.3 Extended may cause information to be lost." Now this is malicious because the user is not told what kind of information is being meant. In the case of the test file, no losses are visible. Thus: i) Saving a complex file in 1.3 format and then go on working with it may imply an incalculable risk. ii) If no information is lost, then what is the advantage of working with the 1.3 Extended format?

Ad 2: I will just mention two cases of style definitions and style marking in the content.xml of test_san.odt exported to 1.3 Format which seem redundant: 

a) <style:style style:name="T2373" style:family="text"><style:text-properties style:font-name="Cambria1" style:font-name-asian="MS Mincho1" style:language-asian="none" style:country-asian="none"/></style:style>

<style:style style:name="T2374" style:family="text"><style:text-properties style:font-name="Cambria1" style:font-name-asian="MS Mincho1" style:language-asian="none" style:country-asian="none"/></style:style>

<style:style style:name="T2375" style:family="text"><style:text-properties style:font-name="Cambria1" style:font-name-asian="MS Mincho1" style:language-asian="none" style:country-asian="none"/></style:style>

<style:style style:name="T2376" style:family="text"><style:text-properties style:font-name="Cambria1" style:font-name-asian="MS Mincho1" style:language-asian="none" style:country-asian="none"/></style:style>

<style:style style:name="T2377" style:family="text"><style:text-properties style:font-name="Cambria1" style:font-name-asian="MS Mincho1" style:language-asian="none" style:country-asian="none"/></style:style>

<style:style style:name="T2378" style:family="text"><style:text-properties style:font-name="Cambria1" style:font-name-asian="MS Mincho1" style:language-asian="none" style:country-asian="none"/></style:style>

[Paragraph breaks added for clarity!]

b) Adjacent stretches of running text are separately formated by the same style, e.g.: 
<text:span text:style-name="T11515">xxx xxx xxx</text:span><text:span text:style-name="T11515">xxx xxx xxx</text:span>

This should be: 
<text:span text:style-name="T11515">xxx xxx xxxxxx xxx xxx</text:span>

I offer to continue this search for redundancies on condition that somebody is willing to take care of them.
Comment 25 QA Administrators 2024-01-22 03:12:05 UTC Comment hidden (obsolete)