Created attachment 53951 [details]
original doc (Word) and converted odt (3.3.4/3.4.1/3.4.4) and docx (Word)
An ODF import of a DOC file (created with the LO top page) produced the following structure in LO 3.3:
<text:p text:style-name="P3">Home of the LibreOffice Productivity Suite</text:p>
In LO 3.4.1 (and 3.4.4) it produces:
<text:p text:style-name="P3">Home<text:span text:style-name="T1"> </text:span>of<text:span text:style-name="T1"> </text:span>the<text:span text:style-name="T1"> </text:span>LibreOffice<text:span text:style-name="T1"> </text:span>Productivity<text:span text:style-name="T1"> </text:span>Suite</text:p>
which basically adds a style to every word and every space.
Although the user might not see the difference, processes that rely on the XML structure of the document end up with having to deal with all that noise.
For exemple, L10N/translation software (either commercial or free) usually relies on the XML structure of the document to create a similarly structured translated file. If there is too much XML noise in the document the translator (or parser) will not be able to properly handle the contents and may end up loosing data or creating a invalid XML document.
That bug has been introduced by this commit:
Caolan, I think the best one to fix that :)
There were follow on commits which attempted to reduce the extras when those weak chars would have been assigned the same properties by the default script-assignation algorithm , does this persist in 3.5 ?
Yes, I still see it in 3.5.
let's try and reduce the noise with
cherry-picked to 3-5 as well
i.e. if writer would pick the same script for weak-chars as word wants to force on, then don't create a new span for it. Should reduce the noise significantly. Though fundamentally the output is correct either way, and relying on an "unnoisy" structure is broken
This is (at least at the surface) a Writer issue, therefore changed the 'Component' field accordingly.