Bug 142443 - Libre Office Writer breaks the text adding unwanted <span> tags in the content.xml file
Summary: Libre Office Writer breaks the text adding unwanted <span> tags in the conten...
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
6.3.0.4 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: needsDevAdvice
Depends on:
Blocks:
 
Reported: 2021-05-23 09:21 UTC by Tyco72
Modified: 2024-11-21 14:52 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Example .odt file (12.07 KB, application/vnd.oasis.opendocument.text)
2021-05-23 09:23 UTC, Tyco72
Details
Screenshot (22.45 KB, image/png)
2021-05-23 09:24 UTC, Tyco72
Details
odt file v1.2 (11.63 KB, application/vnd.oasis.opendocument.text)
2021-05-24 16:14 UTC, Tyco72
Details
Screenshot ODF v1.2 (123.79 KB, image/jpeg)
2021-05-24 16:16 UTC, Tyco72
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tyco72 2021-05-23 09:21:31 UTC
Description:
This issue is present at least since LO 6.3.6. It happens also with LO7.x. It is a critical bug when you have to paste the text in the Wordpress "preformatted blocks" or having to work with HTML! My system runs with Windows 64bit.

The error is not exactly reproducible but it happens always when you write a text, or changing an existing text, when applying formatting tags as bold words, or different colors to different words. It happens also when you insert new text between rows, or modifying existing sentences.

Almost every time, LO adds randomly unwanted <span> tags which break the text in hundred of lines, each one containing even only a single character. In documents containing many pages, it is noticeable even in the increasing size of the .odt file. It rises more than expected, because the content.xml file gets filled with thousands of unnecessary <span> tags, and maybe also tags "<style:style style:name"

You see the effect of these tags when you paste the text into a "Preformatted text" block in Wordpress. The pasted text shows all the line breaks causes by the <span> tags.
Here a small example of what happens in the content.xml file. The text is broken in many of mini-rows, filled up with tags "<text:span text..." and "<style:style style:name=...." example:

<text:span text:style-name="T1">P</text:span>
<text:span text:style-name="T2">rob</text:span>
<text:span text:style-name="T3">a</text:span>
<text:span text:style-name="T2">bly, </text:span>
<text:span text:style-name="T1">collisions with other galaxies have </text:span>
<text:span text:style-name="T2">already happened to the Milky Way in the past, in</text:span>
<text:span text:style-name="T1">corporating </text:span>

It is a problem of LibreOffice. I have attached the document "sample3.odt"
 
HTML look so:

<p align="left"><span style="font-family: Arial, sans-serif;"><span style="font-size: small;"><a href="https://en.wikipedia.org/wiki/Andromeda_Galaxy"><b>Andromeda</b></a> is the galaxy more near to ours, the Milky Way. It is about 2,5 millions of light years from us (which is a little distance in the scale of universe). Andromeda is a big Galaxy, bigger that the Milky Way. It has a diameter of 220.000 light years and contains about <b>1000 billions of stars</b>.</span></span></p>
<p align="left"><span style="font-family: Arial, sans-serif;"><span style="font-size: small;">In comparison, the Milky Way is 170.000 - 200.000 light years large and contains 100-400 billion stars.</span></span></p>
<p align="left"><span style="font-family: Arial, sans-serif;"><span style="font-size: small;">The interesting part is that Andromeda is moving towards us and in about 4,5 billions years Andromeda will collide with the Milky Way. We can do nothing to stop this way! Collisions among galaxies are a common event in the universe.</span></span></p>

The attached example has been created with LO 6.x, but it happens also with LO 7.x.

Please provide a feedback about this bug. All the existing .odt files are messed up with these tags. If possible, a tool to clean the existing documents would be demanded.

Steps to Reproduce:
See bug description

Actual Results:
Text in content.xml file is broken with many unwanted <span> tags

Expected Results:
Text is not broken with unwanted <span> tags


Reproducible: Always


User Profile Reset: Yes



Additional Info:
I am currently using LO 6.4 but it happens also with LO 7.x
I don't need to update my main LO installation before this bug is fixed.
Comment 1 Tyco72 2021-05-23 09:23:25 UTC
Created attachment 172261 [details]
Example .odt file
Comment 2 Tyco72 2021-05-23 09:24:40 UTC
Created attachment 172262 [details]
Screenshot

Effect of the unwanted <span> tags
Comment 3 Regina Henschel 2021-05-23 12:42:40 UTC
The problem is likely caused by officeooo:rsid attributes. Those are written to improve change tracking. You can avoid them with two methods:
A) Go to menu Tools > Options > Writer > Comparison and uncheck the option "Store it when changing the document". To remove the attributes from existing documents, you need to save them once in format "1.3" (method B) or remove the attributes manually in the file.
B) Save in "ODF 1.3". Unfortunately this does not only avoid this attribute but all LibreOffice extensions. And LibreOffice has a lot of extensions. To change the file format for saving go to menu Tools > Options > Writer > Load/Save and select the version "1.3" from the drop-down list 'ODF format version:'.

Please report back whether removing the officeooo:rsid attributes solves the problem.
Comment 4 krumple_sodium 2021-05-23 17:10:56 UTC
I think I know what he is talking about.
If I write text in LibreOffice Writer (I do this since I do word corrections with Writer), then I copy and paste the text into certain websites, websites where I am responding to someone, it posts lots of tags, span and such and lots of newline characters. I use Firefox. It doesn’t happen with all websites. It happens with patheos.
The solution is to do Ctrl-Shift-V, which removes the formatting.
Or, paste the text into a simple text editor, like Notepad and Kate,  and then copy from there and paste it to the website.
Comment 5 Tyco72 2021-05-23 17:50:16 UTC
(In reply to krumple_sodium from comment #4)
> The solution is to do Ctrl-Shift-V, which removes the formatting.
> Or, paste the text into a simple text editor, like Notepad and Kate,  and
> then copy from there and paste it to the website.

Hi, thank you for the hint, but it is not a viable solution. If I transform the text into plain text, I loose also all the other format tags that I need to keep (bold, text, color, underline, hyperlink, etc.)
MS Office has not this problem. I will do some tests disabling the option "Store it when changing the document" and I will let know.
Comment 6 Aron Budea 2021-05-24 02:01:07 UTC
This probably isn't a regression.
Comment 7 Tyco72 2021-05-24 16:13:46 UTC
(In reply to Regina Henschel from comment #3)
> The problem is likely caused by officeooo:rsid attributes. Those are written
> to improve change tracking. You can avoid them with two methods:
> A) Go to menu Tools > Options > Writer > Comparison and uncheck the option
> "Store it when changing the document". To remove the attributes from
> existing documents, you need to save them once in format "1.3" (method B) or
> remove the attributes manually in the file.
> B) Save in "ODF 1.3". Unfortunately this does not only avoid this attribute
> but all LibreOffice extensions. And LibreOffice has a lot of extensions. To
> change the file format for saving go to menu Tools > Options > Writer >
> Load/Save and select the version "1.3" from the drop-down list 'ODF format
> version:'.
> 
> Please report back whether removing the officeooo:rsid attributes solves the
> problem.

Hello. I could some tests, but not too much intensively:
- Disabling the option in Tools > Options > Writer > Comparison  "Store it when changing the document"
it seem so reduce the amount of unwanted line breaks, but not before and after the words with different colors or in bold.

- Saving the document in format ODF 1.2 not extended, (on LO 6.4 I have not V 1.3), it cleans some unwanted line breaks, but not before and after the words formatted with different color "star clusters" or some word in bold "today".
I will attach the file saved in ODF 1.2 and the screenshot.
Comment 8 Tyco72 2021-05-24 16:14:59 UTC
Created attachment 172303 [details]
odt file v1.2
Comment 9 Tyco72 2021-05-24 16:16:12 UTC
Created attachment 172304 [details]
Screenshot ODF v1.2
Comment 10 QA Administrators 2021-05-25 05:28:00 UTC Comment hidden (obsolete)
Comment 11 GlasenbergDyson 2023-02-02 12:55:38 UTC Comment hidden (spam)
Comment 12 Mathew Cook 2023-11-17 09:57:32 UTC Comment hidden (spam)
Comment 13 dahajsainak 2023-12-03 11:06:31 UTC Comment hidden (spam)
Comment 14 Regina Henschel 2023-12-03 11:43:29 UTC
(In reply to Tyco72 from comment #7)
> (In reply to Regina Henschel from comment #3)

> Hello. I could some tests, but not too much intensively:
> - Disabling the option in Tools > Options > Writer > Comparison  "Store it
> when changing the document"
> it seem so reduce the amount of unwanted line breaks, but not before and
> after the words with different colors or in bold.

In case of changing character styles inside a paragraph such <span> elements are essential. That is the necessary markup to describe a change in character style.
Comment 15 Tyco72 2023-12-03 17:09:56 UTC
(In reply to Regina Henschel from comment #14)
> In case of changing character styles inside a paragraph such <span> elements
> are essential. That is the necessary markup to describe a change in
> character style.

Thank you for the info. In which LO version has the bug been fixed?
But MS Office had not that problem with span tags, also when changing styles (bold or color, for example). Or at least it didn't produce unwanted returns in the text.
Comment 16 Mike Kaganski 2024-05-20 11:45:14 UTC
(In reply to Tyco72 from comment #15)
> In which LO version has the bug been fixed?

What made you think that comment 14 meant that something got fixed in the mean time? It only told you that these spans on the borders of the changed attributes are required, and are *not* going to be removed.

The whole issue is like "there is a third-party tool (screenshots in attachment 172262 [details] and attachment 172304 [details]), which can't handle spans properly; it is made specifically to work OK with the markup that Word generates (of course - how could anyone not tune for Word!), but not for correct HTML compliance/handling in general; which makes it choke on the normal output that LibreOffice procudes. Change your output, to please that broken tool".

My take: NOTABUG.
Comment 17 Buovjaga 2024-11-21 14:52:00 UTC
There is bug 148543 for extra spans without any style or formatting changes (now bibisected, but to a huge commit range in 3.6), so let's close this per the last comment.