Bug 43337 - The XML structure of ODF imports from DOC adds too much noise
Summary: The XML structure of ODF imports from DOC adds too much noise
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
(earliest affected)
3.4.1 release
Hardware: All All
: medium major
Assignee: Caolán McNamara
Depends on:
Reported: 2011-11-29 06:28 UTC by Jean-Christophe Helary
Modified: 2012-05-07 09:25 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:

original doc (Word) and converted odt (3.3.4/3.4.1/3.4.4) and docx (Word) (54.96 KB, application/zip)
2011-11-29 06:28 UTC, Jean-Christophe Helary

Note You need to log in before you can comment on or make changes to this bug.
Description Jean-Christophe Helary 2011-11-29 06:28:10 UTC
Created attachment 53951 [details]
original doc (Word) and converted odt (3.3.4/3.4.1/3.4.4) and docx (Word)

An ODF import of a DOC file (created with the LO top page) produced the following structure in LO 3.3:

<text:p text:style-name="P3">Home of the LibreOffice Productivity Suite</text:p>

In LO 3.4.1 (and 3.4.4) it produces:

<text:p text:style-name="P3">Home<text:span text:style-name="T1"> </text:span>of<text:span text:style-name="T1"> </text:span>the<text:span text:style-name="T1"> </text:span>LibreOffice<text:span text:style-name="T1"> </text:span>Productivity<text:span text:style-name="T1"> </text:span>Suite</text:p>

which basically adds a style to every word and every space.

Although the user might not see the difference, processes that rely on the XML structure of the document end up with having to deal with all that noise.

For exemple, L10N/translation software (either commercial or free) usually relies on the XML structure of the document to create a similarly structured translated file. If there is too much XML noise in the document the translator (or parser) will not be able to properly handle the contents and may end up loosing data or creating a invalid XML document.
Comment 1 Cédric Bosdonnat 2011-12-07 08:17:45 UTC
That bug has been introduced by this commit:

Caolan, I think the best one to fix that :)
Comment 2 Caolán McNamara 2011-12-07 08:34:02 UTC
There were follow on commits which attempted to reduce the extras when those weak chars would have been assigned the same properties by the default script-assignation algorithm , does this persist in 3.5 ?
Comment 3 Cédric Bosdonnat 2011-12-08 00:03:59 UTC
Yes, I still see it in 3.5.
Comment 4 Caolán McNamara 2011-12-08 04:56:29 UTC
let's try and reduce the noise with 


cherry-picked to 3-5 as well

i.e. if writer would pick the same script for weak-chars as word wants to force on, then don't create a new span for it. Should reduce the noise significantly. Though fundamentally the output is correct either way, and relying on an "unnoisy" structure is broken
Comment 5 Roman Eisele 2012-05-07 09:25:12 UTC
This is (at least at the surface) a Writer issue, therefore changed the 'Component' field accordingly.