Bug 137783 - FORMATTING: HTML-EXPORT: Excess character font code added when editing inside a paragraph - appears to be individual character attribute settings being added
Summary: FORMATTING: HTML-EXPORT: Excess character font code added when editing inside...
Status: RESOLVED DUPLICATE of bug 141498
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.0.0.1 rc
Hardware: All Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: (X)HTML-Export
  Show dependency treegraph
 
Reported: 2020-10-27 00:11 UTC by Ernest Bywater
Modified: 2021-04-05 16:10 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments
html file of the odt file saved as html withut any code cleanign editing (27.31 KB, text/html)
2020-10-27 00:11 UTC, Ernest Bywater
Details
files of the issue, as requested (24.69 KB, application/vnd.oasis.opendocument.text)
2020-11-17 13:24 UTC, Ernest Bywater
Details
2nd file (13.54 KB, text/html)
2020-11-17 13:24 UTC, Ernest Bywater
Details
3rd file (25.51 KB, application/vnd.oasis.opendocument.text)
2020-11-17 13:25 UTC, Ernest Bywater
Details
4th file (14.49 KB, text/html)
2020-11-17 13:25 UTC, Ernest Bywater
Details
5th file (25.91 KB, application/vnd.oasis.opendocument.text)
2020-11-17 13:26 UTC, Ernest Bywater
Details
6th file (15.72 KB, text/html)
2020-11-17 13:27 UTC, Ernest Bywater
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ernest Bywater 2020-10-27 00:11:35 UTC
Created attachment 166750 [details]
html file of the odt file saved as html withut any code cleanign editing

Problem doesn't show until I use "Save as HTML" - then it shows individual font format code for the changed character or words. This only affects changes made to the paragraphs that were part of the file when I opened it. Thus the work on paragraphs created in this session have only the paragraph format code, as per usual. But when I got back to make a change to a paragraph from an earlier session where I had saved the file and reopened it since then, the new character or words get given format code which isn't visible until I save as HTML. Change one character and I get the extra code for the dictionary to use and the font type. I use Palatino Linotype in 10 point as my default font style and that doesn't show in the paragraphs where the default style was used for the paragraph when it was first written. But the later changes have the individual character / words being given individual font format code. Attached is a file of a story I'm working on that has some of the default paragraphs and some of the amended paragraphs. It gets messy when the font colour and weight are also added - which doesn't seem to be done in a uniform way. The only format code that should be within a paragraph is when items like bold or italics are applied; otherwise the entire paragraph should be the same. The differences are clear.

I'm using LO 7.0.2.2 on Zorin Linux on a 64 bt AMD Ryzen CPU

about file data is

Version: 7.0.2.2
Build ID: 8349ace3c3162073abd90d81fd06dcfb6b36b994
CPU threads: 12; OS: Linux 5.4; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded
Comment 1 Ernest Bywater 2020-10-31 13:17:33 UTC
I don't know when this problem started, but I didn't have it when I last did a lot of writing back in early 2019. If LO is going to throw so much excess format code in when I revise a paragraph it makes it so much harder to work with and I may as well stop using LO at all, since it'll be easier to use plain text than having clean out the excess code.
Comment 2 Dieter 2020-11-17 09:25:49 UTC
Thank you for reporting the bug. Plese add the odt-file and some more detailed steps to reproduce like
1. Open attachment
2. Change font in first paragraph to ...
3. Change ....
...
4. Save as html

Thank you
=> NEEDINFO
Comment 3 Ernest Bywater 2020-11-17 13:22:20 UTC
(In reply to Dieter from comment #2)
> Thank you for reporting the bug. Plese add the odt-file and some more
> detailed steps to reproduce like
> 1. Open attachment
> 2. Change font in first paragraph to ...
> 3. Change ....
> ...
> 4. Save as html
> 
> Thank you
> => NEEDINFO

Attached a 6 file - 3 as ODT and 3 as HTML. I took a file originally created some years ago and reduced it to only a dozen or so paragraphs which are all set styles. The original is example_file_1.odt and then as example_file_1.html. Then I also saved the file as example_file_2.odt and made a number of changes to it and saved it again. Then I saved it as example_file_2.html. This file shows additional format code included within some of the paragraphs for some of the changes, but not all of them. I then saved that file as example_file_3.odt and made further changes, then saved the file again as .odt and .html with more extra format code within the paragraphs. I even made changes within the paragraph made the last paragraph added in the latest change, and extra format code had been added to those last changes.

If I had made all of these changes back in early 2019 none of the changes within the paragraphs would have had any extra format code added. This is shown in the original file as there are changes in the paragraphs that had been made between the original file creation in 2017 and mid 2019 without any additional format code being added to the paragraphs.

The code changes made by LO are readily visible in the html files, especially when you compare the html files in text mode. All of the format code within paragraphs in html files are excess and visible when the file is viewed in text mode.

I hope this helps you to see what the issue is.
Comment 4 Ernest Bywater 2020-11-17 13:24:22 UTC
Created attachment 167362 [details]
files of the issue, as requested

6 new files as requested and detailed in the reply to the comment.
Comment 5 Ernest Bywater 2020-11-17 13:24:57 UTC
Created attachment 167363 [details]
2nd file

2nd file
Comment 6 Ernest Bywater 2020-11-17 13:25:25 UTC
Created attachment 167364 [details]
3rd file

3rd file
Comment 7 Ernest Bywater 2020-11-17 13:25:57 UTC
Created attachment 167365 [details]
4th file

4th file
Comment 8 Ernest Bywater 2020-11-17 13:26:21 UTC
Created attachment 167366 [details]
5th file

5th file
Comment 9 Ernest Bywater 2020-11-17 13:27:02 UTC
Created attachment 167367 [details]
6th file

6th file - that makes 3 of the odt files and 3 of the html files
Comment 10 Dieter 2020-11-22 06:01:59 UTC
Thank you for the files, but I still try to find out what the problem is. So you are talking about html-code itself? Because I can't see any problems if I compare html documents and odt documents. Since I only have a very basic knowledge about html-code I can't help.
Comment 11 Ernest Bywater 2020-11-22 14:38:18 UTC
(In reply to Dieter from comment #10)
> Thank you for the files, but I still try to find out what the problem is. So
> you are talking about html-code itself? Because I can't see any problems if
> I compare html documents and odt documents. Since I only have a very basic
> knowledge about html-code I can't help.

G'day Dieter,

The issue is with the way LO 7 is adding html code when it saves as html. Unlike earlier versions LO7 is adding additional html code to individual characters when you make a later change within a paragraph. Take the 6th file in the attachments, the first paragraph has the relevant html code for the paragraph at the start of the paragraph as < p class = " western " style = " margin-bottom : 0cm ; line-heigh t: 100% " > (spaces added to the code to stop it running) while drawing the the rest of the required character format code from the < style type = " text / css " > area of the < head > section. In the past versions of LO when I enter the paragraph and make a change LO did not add any extra html code within the paragraph as all of the text within the paragraph would be as per the style applied to the paragraph. However, in LO 7, for some reason, when I make a later change to the paragraph LO is adding in character format code to the new text. This is shown in the html code of the 4th long paragraph where I changed the numbers where the html code now shows as:

  3  <font color = " # 000000 " > < span style = " font-weight : normal " > 70 < / span >< / font > ,1 < font color = " # 000000 " > < span style = " font-weight : normal " > 67 < / span >< / font >
words.

whereas what it should've done, and used to do, was include no extra html code due to the paragraph being altered already having a style applied and the code for that style already exists in the < style > area. The changed section should've appeared without any extra html code to be just -  370,167 words.

If I go back to using LO 6 (yes, I found an old copy hanging around and tried it) then the extra code isn't added as LO assume the change within the paragraph is the same as the rest of the paragraph as set by the style.

Considering how often I make changes having to clean out such code additions by hand is a real pain.
Comment 12 Ernest Bywater 2020-11-22 14:45:34 UTC
A change between LO 6 and LO 7 has the system applying individual character format code to what is being typed into a paragraph with an applied styles when that text is being altered instead of just having it use the paragraph's existing style. This issue isn't noticed until the file is saved as html where the extra code stands out.
Comment 13 Dieter 2020-11-23 07:43:31 UTC
Thanks again for clarification. As I said before; I'm not really an expert in reading html-code. And even if I confirm the behaviour, I'm not able to assess, if it is a bug or the expected behaviour. I hope somebody else can help.
Comment 14 Ernest Bywater 2020-11-23 09:20:13 UTC
(In reply to Dieter from comment #13)
> Thanks again for clarification. As I said before; I'm not really an expert
> in reading html-code. And even if I confirm the behaviour, I'm not able to
> assess, if it is a bug or the expected behaviour. I hope somebody else can
> help.

G'day Dieter,

While not an HTML expert either, I am familiar enough with it to work it quite well and I understand what is happening. However, the problem is NOT an HTML one, but it is an issue with the LO 7 code itself in that LO Writer is assigning the format code to the individual characters and words of the text at the time the text being added to the paragraph when an amendment is made to the paragraph at a later date. This is a bug as LO should NOT be assigning any format code to the added text or characters as part of the typing of the text. Such extra text should be seen by LO and Writer as being part of the paragraph and inheriting the existing style code of the paragraph unless I take extra actions to apply attributes to the text by selecting them. While in normal LO operations such format code is not visible the conversion to HTML is making this additional format code visible.

In the specific example mentioned in my previous comment all of the text in the paragraph carries the attributes of the paragraph style in the 'body' and  'css' data for the style. Those attribute include the text color of #000000, font weight of normal, font family of Palatino Linotype, and font size of 10pt as those are the default attributes for the 'Default' paragraph style and the attributes for the 'default text' as defined in LO 7. Yet when the paragraph text is being amended with LO 7 set to those 'default' text attributes LO 7  is adding character format code to the new characters of the font color and the font weight as if I was applying attributes to change the text from the default attributes to something else but being done by LO 7 instead of me using the attributes options.

LO 7 Writer should NOT be attaching any format code to the text or individual characters of a paragraph amendment at the time of being typed at all as ALL text being typed should be as per the default text attributes for the paragraph UNTIL such time as I, the operator, highlight text and assign another attribute that is NOT a default attribute to that text.

This problem is within the LO 7 code. It may be within Writer only or in other components as well, but I only notice it in Writer as that's the only component I convert to html.

Interestingly, the more we discuss this the deeper insight I get into what is happening, but not how it's happening within the LO code. While you may not be the one to find and fix where this is happening within the LO 7 Writer code, I hope someone else on the team does know, or can find out, where this is happening and stop it.
Comment 15 Dieter 2020-11-23 14:15:24 UTC
I'm not a developer, so I can't say anything about LO code.
Comment 16 Ernest Bywater 2020-12-11 08:44:25 UTC
Further checking has established the issue relates to Writer adding character attribute settings to individual text changes within a paragraph even when they are the same as the attribute settings for the paragraph style of the paragraph being edited. Because of this I've amended the summary to reflect that information.
Comment 17 Ernest Bywater 2021-04-05 16:10:06 UTC
Fault description not well described in this so raised new bug 141498 with a better description.
Comment 18 Ernest Bywater 2021-04-05 16:10:57 UTC

*** This bug has been marked as a duplicate of bug 141498 ***