Bug 148543 - Excess character format code embedded in Writer documents
Summary: Excess character format code embedded in Writer documents
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.0.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Formatting-Text-Diverse
  Show dependency treegraph
 
Reported: 2022-04-12 22:31 UTC by Ernest Bywater
Modified: 2023-05-21 18:28 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
original .odt file with edits in it (23.61 KB, application/vnd.oasis.opendocument.text)
2022-04-30 00:36 UTC, Ernest Bywater
Details
original with edits saved as html (41.12 KB, text/html)
2022-04-30 00:36 UTC, Ernest Bywater
Details
new version of .odt file (17.91 KB, application/vnd.oasis.opendocument.text)
2022-04-30 00:37 UTC, Ernest Bywater
Details
new version of .odt file as html file (35.82 KB, text/html)
2022-04-30 00:38 UTC, Ernest Bywater
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ernest Bywater 2022-04-12 22:31:43 UTC
Description:
When editing a document several days after first writing and saving the document the new text entered is assigned its own format code which is embedded in the document instead of being left to have the paragraph style code applied to it. This results in the document being larger and causes issues when converting the finished document from .ODT to other formats.

The way it SHOULD work is when I type a paragraph with an applied style all of the format code is stored within the document at the start of the paragraph and any later text entered into the paragraph should inherit the format of the paragraph. What IS happening is the initial text is saved as it should be, but a later edit has the new text entered with the characters of the new text being assigned their won format code which is embedded into the document as well. Thus instead of having the - < font type> < font size > < font color > <alignment > then the general text <end of paragraph > - which is how the original text is saved the revised text is saved as - < font type> < font size > < font color > <alignment > then the general < font type> < font size > < font color > revised text <end of paragraph > - another later revision would see - < font type> < font size > < font color > <alignment > then the general  < font type> < font size > < font color > revised text < font type> < font size > < font color > and more revised text <end of paragraph >.

The brackets < > are used to show the type of format code that is saved with the document but not readily visible to the user in the .ODT format but does show up in other formats.

Thus there are two or more sets of format code where only one is needed, but it all get transferred over when the document is changed from .ODT to other format like .HTML where such code is more easily visible. This problem ONLY appears where there has been edits made after the document had been saved and returned to later. Unedited paragraphs do not have the problem of the extra format code.

I am now using a AMD with Linux, but have noticed the same problem with Intel systems using Windows. Not sure when it started but hae noticed with every version of LO 7

Steps to Reproduce:
1.Create an ODT document.
2.Save the document and close the file.
3.Reopene the document hours later and enter edits within the paragraphs previously typed.
4. Save the document as another format to make the hidden format code visible - most visible in html.

Actual Results:
<p class="western" style="line-height: 100%; margin-bottom: 0cm">Both James and John are fit and while their bags are big they aren't heavy. However, they're easy to carry on their backs by the leather straps they're using over their shoulders to hold the bags in place.
Their long confident strides eat up the miles along the road. In the late afternoon or early evening they intend to find local inns to stay in for the night. There is <font color="#000000"><span style="font-weight: normal">a</span></font><font color="#000000"><span style="font-weight: normal"> lot</span></font> of day left when they dock so John and James start walking home.</p>

Expected Results:
<p class="western" style="line-height: 100%; margin-bottom: 0cm">Both
James and John are fit and while their bags are big they aren't
heavy. However, they're easy to carry on their backs by the leather
straps they're using over their shoulders to hold the bags in place. Their long confident strides eat up the miles along the road. In the late afternoon or early evening they intend to find local inns to stay in for the night. There is a lot of day left when they dock so John and James start walking home.</p>


Reproducible: Always


User Profile Reset: Yes



Additional Info:
In the example above the original text last sentence was: 

There is half of the day left when they dock so John and James start walking home.

This was edited to:

There is a lot of the day left when they dock so John and James start walking home.

The extra format code information was inserted and applied to the changed words instead of leaving them to inherit the format of the rest of the paragraph from its class style of 'western' - - - this is extremely troublesome to fix when converting to html and epub documents.

This was NOT a problem in LO 3 or 4 or 5and I'm not sure about LO 6.

Version: 7.3.1.3 / LibreOffice Community
Build ID: 30(Build:3)
CPU threads: 12; OS: Linux 5.15; UI render: default; VCL: kf5 (cairo+xcb)
Locale: en-AU (en_AU.UTF-8); UI: en-US
7.3.1-1
Calc: threaded
Comment 1 Dieter 2022-04-29 08:45:22 UTC
I can't confirm with

Version: 7.3.3.1 (x64) / LibreOffice Community
Build ID: 1688991ca59a3ca1c74bc2176b274fba1b034928
CPU threads: 4; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: de-DE (de_DE); UI: en-GB
Calc: CL

Perhaps you can add sample document after step 2 and give some specific steps how to modify the document.

I saved modified document as html.
Comment 2 Ernest Bywater 2022-04-30 00:36:17 UTC
Created attachment 179849 [details]
original .odt file with edits in it
Comment 3 Ernest Bywater 2022-04-30 00:36:51 UTC
Created attachment 179850 [details]
original with edits saved as html
Comment 4 Ernest Bywater 2022-04-30 00:37:48 UTC
Created attachment 179851 [details]
new version of .odt file
Comment 5 Ernest Bywater 2022-04-30 00:38:14 UTC
Created attachment 179852 [details]
new version of .odt file as html file
Comment 6 Ernest Bywater 2022-04-30 01:09:27 UTC
I just attached 4 files. 

My system recently updated to Libre Office 7.3.2.2 and that's what was used to create the new .odt file

The four files are in two groups of an ODT file with that file saved as a HTML file. The files with 'original' in the title are the original version of the file which was first created years ago in Libre Office 4 or 5 or maybe earlier and edited a lot over the years. The files with 'new' in the title were recently created by taking the original file, saving it as a plain text file then saving again as an ODT file and creating the various paragraph styles and applying them.

The file 'z_layout-new.html' is what I would expect to see with every ODT that is saved as a HTML. However, what I get is what you see in 'z_layout-original.html' in every case of a file being saved as HTML, other than this one example.

Since I saw the comment by Dieter I wondered if the issue is related to a change of Libre Office version. I can't rule this out as my most recent story was started in LO 7.3.1.3 and the edits show the problem.

The text in the 'z_layout-original' from lines 118 to 121 have (with spaces added to stop the code running:

< blockquote class = "western" >< b >Note:< / b > Due to the main character
and the narrator being Australians UK English is used < font color = "#0000ff" >< i >< span style="font-weight: normal" >in< / span >< / i >< / font >
this story, except for dialogue by a US character where US English is
used in the dialogue and some nouns.< / blockquote >

The first iteration of that paragraph had the word 'for' which was later edited into 'in' and the extra format code appeared with the edit. the extra format code is NOT in the 'z_layout-new' file as going via the plain text change removed it. However, that's a hell of a thing to do to a few hundred files adding up to over ten million words.

While Dieter's comment makes me hope the issue may vanish when I can upgrade to the latest version of Libre Office, which is an issue I'll cover below, I'm NOT holding my breath on it. I am waiting to see how things go as I complete and edit the stories I'm working on.

................

Libre Office upgrades - comment only as not an issue. Until about a year ago I was using either a version of Windows or a version of Linux based on Debian, so the standard downloads and upgrade routes of the latest versions from your website worked. However, due to an issue with a new graphics card and most Debian versions of Linux at that time I switch to using Manjaro Linux, which is an Arch Linux based operating system, as it came with the latest AMD graphics drivers built-in. I soon found the instructions for the RPM and DEB packages of LO do NOT natively work on Manjaro. I have been told there is a complex work around to get them to load, but I've not been able to do make that work for me. Thus I now wait until the latest version of LO upgrade is made available via the Manjaro Repository. That means I'm not always with the very latest version of LO but am often one or two behind. Since the only issue I have with LO is the one in this bug, I can easily live with that.
Comment 7 Dieter 2022-05-24 15:37:27 UTC
I think I can confirm it with

Version: 7.3.4.1 (x64) / LibreOffice Community
Build ID: 13668373362b52f6e3ebcaaecb031bd59a3ac66b
CPU threads: 4; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: de-DE (de_DE); UI: en-GB
Calc: CL

Steps to reproduce
1. Open a new document and write "This is a test". Save and close.
2. Reopen document and insert "just" into the text so you now have "This is just s test."
3. Save with a different filename and close.
4. Open content.xml second file

Actual result:
<text:p text:style-name="P1">
This is 
<text:span text:style-name="T1">just </text:span>
a test.
</text:p>

Expected result:
<text:p text:style-name="P1">This is just a test.</text:p>

At least similar to bug 142443, but I can't assess, if it is a duplicate or not.