Bug 65925 - FILESAVE: Ugly HTML code when changing capital letters and bold text
Summary: FILESAVE: Ugly HTML code when changing capital letters and bold text
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.6.0.4 release
Hardware: Other All
: medium minor
Assignee: Not Assigned
URL:
Whiteboard: BSA
Keywords: bibisected, bisected, filter:html, regression
: 148974 (view as bug list)
Depends on:
Blocks: (X)HTML-Export
  Show dependency treegraph
 
Reported: 2013-06-19 09:02 UTC by matta2006
Modified: 2022-05-07 13:05 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
Starting HTML file (513 bytes, text/html)
2013-06-19 09:02 UTC, matta2006
Details
Screenshot from Chrome (97.79 KB, image/png)
2013-06-20 10:12 UTC, ign_christian
Details

Note You need to log in before you can comment on or make changes to this bug.
Description matta2006 2013-06-19 09:02:38 UTC
Created attachment 81056 [details]
Starting HTML file

Problem description: 

Starting from this HTML text:

<BODY LANG="en-US" DIR="LTR">
<P>This is Black: great.<BR><FONT COLOR="#b84747">This is Colored:
why not.</FONT><BR>This is Black again: fine.</P>
</BODY>

and changing to bold and removing some capital letters, it produces this ugly code:

<BODY LANG="en-US" DIR="LTR">
<P>This is <B>b</B><B>lack</B>: great.<BR><FONT COLOR="#b84747">This
is </FONT><FONT COLOR="#b84747"><B>c</B></FONT><FONT COLOR="#b84747"><B>olored</B></FONT><FONT COLOR="#b84747">:
why not.</FONT><BR>This is Black again: fine.</P>
</BODY>

However, LO 3.5 (the best version so far in my opinion), did as expected:

<BODY LANG="en-US" DIR="LTR">
<P><B>This is black</B>: great.<BR><FONT COLOR="#b84747"><B>This is
colored</B></FONT><FONT COLOR="#b84747">: why not.</FONT><BR>This is
Black again: fine.</P>
</BODY>

I will attach the simple starting html file.

Steps to reproduce:
1. Open the starting html file with LO Writer/Html
2. Change "This is black" and "This is colored" to bold.
3. Change "Black" and "Colored" to lowercase.
4. Save to another html file.

Do you mind fixing it?              
Operating System: All
Version: 4.0.0.3 release
Last worked in: 3.5.0 release
Comment 1 Pedro 2013-06-19 14:18:47 UTC
Confirmed under Windows XP x86 using LO 4.0.3.3. I also tested under LO 3.6.5 and the same excessive code is added.

Editing under LO 3.5.7 produced a clean HTML code, as expected.
Comment 2 matta2006 2013-06-20 09:21:12 UTC
This bug is pretty annoying. To avoid it I have to:
1) Remove formatting
2) Save
3) Relaod
4) Change to bold
5) Save
6) Reload
7) Change colour
8) Save

or... I have to manually fix the html.


In addition, it also affect simpler scenarios (just change a letter from lowercase to uppercase in a bold/italic section).

I am actually wondering how this bug was not detected before. It would be worth adding a non-regression test about it.
Comment 3 ign_christian 2013-06-20 10:12:05 UTC
Created attachment 81103 [details]
Screenshot from Chrome

I think it's been fixed on LO 4.0.4.2 (Win7 32bit)

Please mark WORKSFORME if you agree
Comment 4 matta2006 2013-06-20 12:13:18 UTC
LO 4.0.4.2 does not solve the problem.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
	<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
	<TITLE></TITLE>
	<META NAME="GENERATOR" CONTENT="LibreOffice 4.0.4.2 (Windows)">
	<META NAME="CREATED" CONTENT="20130619;10490201">
	<META NAME="CHANGED" CONTENT="20130620;14113793">
</HEAD>
<BODY LANG="en-US" DIR="LTR">
<P><B>This is </B><B>b</B><B>lack</B>: great.<BR><FONT COLOR="#b84747"><B>This
is </B></FONT><FONT COLOR="#b84747"><B>c</B></FONT><FONT COLOR="#b84747"><B>olored</B></FONT><FONT COLOR="#b84747">:
why not.</FONT><BR>This is black again: fine.</P>
</BODY>
</HTML>
Comment 5 Michael Stahl (allotropia) 2013-07-05 21:03:33 UTC
you seriously care about what the HTML code produced by Writer looks like,
as opposed to how the HTML document is rendered by Web browsers?

and don't mind the many elements that it inserts that no application other than Writer can read?

well the problem was obviously introduced by the RSIDs in 3.6
(commit 062eaeffe7cb986255063bb9b0a5f3fb3fc8e34c)
Comment 6 Björn Michaelsen 2014-10-16 14:59:12 UTC Comment hidden (obsolete)
Comment 7 Rev. Bob 2015-02-03 00:55:57 UTC
(In reply to Michael Stahl from comment #5)
> you seriously care about what the HTML code produced by Writer looks like,
> as opposed to how the HTML document is rendered by Web browsers?

Absolutely. Bad code is bad code.

> and don't mind the many elements that it inserts that no application other
> than Writer can read?

When I save a copy of a Writer ODT file as HTML, this is by far the biggest problem I have with the output. Most of the other issues I have with that output pipeline amount to deleting the HTML > HEAD > STYLE element and tweaking the BODY tag.

Incidentally, this behavior persists in LO Writer 4.4.0.3, and I suspect that it's related to revision tracking. Steps to replicate:

1. Create a new Writer document. (If a template comes up, select everything, apply the Text Body paragraph style, and delete everything.)
2. Type the following line as the document's only content:

Take a look at this bug.

3. Highlight everything and use Ctrl-M to clear direct formatting, just to ensure a clean slate.
4. Select the word "this" and use Ctrl-I to italicize it.
5. Click in a random place to clear the selection, then replace the letters "is" in "this" with "at" - "Take a look at _that_ bug."
6. File > Save a copy > HTML format.

Open the file, and you'll see the paragraph rendered thus:

<p class="western">Take a look at <i>th</i><i>at</i> bug.</p>
Comment 8 Rev. Bob 2015-02-03 01:01:46 UTC
Comment 11 to bug 76021 appears to stem from the same issue:

"Moreover, if I take exactly the same document and add some text, then all these classes change!  Also note the strange duplication of classes that do exactly the same thing (.T13,.T14,.T15,.T18)"

This is notable in that 76021 relates to "export as XHTML" whereas 65925 involves "save as HTML" - implying that the root problem is in Writer rather than either of those filters.
Comment 9 Robinson Tryon (qubit) 2015-12-13 11:16:22 UTC Comment hidden (obsolete)
Comment 10 QA Administrators 2017-09-01 11:16:35 UTC Comment hidden (obsolete)
Comment 11 Xisco Faulí 2017-09-22 23:25:23 UTC
*** Bug 112563 has been marked as a duplicate of this bug. ***
Comment 12 Rev. Bob 2017-09-27 22:28:36 UTC
In response to comment 10's request for a retest, I performed the steps described in comment 7 and got the same result; nothing has changed.

This is on the Portable build of 5.4.1.2, under Windows 10 Home (32-bit, version 1703, build 15063.540). I have no reason to expect that the test would yield different results on other platforms.
Comment 13 matta2006 2017-09-28 13:02:22 UTC
Problem is still there.

For information, I still keep using LO 4.3.7.2 when I need to edit HTML files. It is not perfect, but the code generated is clearly better than the newer versions.
Comment 14 QA Administrators 2019-03-02 03:51:58 UTC Comment hidden (obsolete)
Comment 15 QA Administrators 2021-03-18 04:17:45 UTC Comment hidden (obsolete)
Comment 16 Stéphane Guillou (stragu) 2021-05-23 14:04:02 UTC
Reproduced in 7.2 alpha1+ following the steps in comment 7. A closing tag directly followed by an exactly equivalent opening tag should be removed from the HTML code.

Version: 7.2.0.0.alpha1+ / LibreOffice Community
Build ID: e9da22d3308557640e0edc45f72b1897f016d19b
CPU threads: 8; OS: Linux 4.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
TinderBox: Linux-rpm_deb-x86_64@86-TDF, Branch:master, Time: 2021-05-21_07:07:00
Calc: threaded
Comment 17 Michael Warner 2022-05-07 10:54:05 UTC
*** Bug 148974 has been marked as a duplicate of this bug. ***
Comment 18 Eldar 2022-05-07 13:05:02 UTC
(In reply to Michael Warner from comment #17)
> *** Bug 148974 has been marked as a duplicate of this bug. ***

I'm author of Bug 148974 and I'm surprised that this bug hasn't been fixed since 2013. Many users use LibreOffice to write articles and then paste formatted text into Website CMS editor. This bug generates bloated HTML code that can be penalized by search engines.