Bug 90540 - Character codes 128 to 159 (U+0080 to U+009F) should not appear in HTML/XHTML export
Summary: Character codes 128 to 159 (U+0080 to U+009F) should not appear in HTML/XHTML...
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
4.2.7.2 release
Hardware: x86-64 (AMD64) All
: low trivial
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: (X)HTML-Export
  Show dependency treegraph
 
Reported: 2015-04-10 08:26 UTC by Luke Kendall
Modified: 2024-01-17 12:43 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
A simple writer test file, and the HTML export version of it. (16.03 KB, application/zip)
2015-04-10 08:26 UTC, Luke Kendall
Details
sample document (16.49 KB, application/vnd.oasis.opendocument.text)
2016-04-04 07:48 UTC, David Tardon
Details
HTML export (3.04 KB, text/html)
2016-04-04 07:48 UTC, David Tardon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kendall 2015-04-10 08:26:08 UTC
Created attachment 114714 [details]
A simple writer test file, and the HTML export version of it.

I belong to a writer's forum, and use LibreOffice to edit my text, and then to paste submissions into a web form that accepts HTML, to provide my writing for review.  One reviewer noted that in many places, there was no space between words.  After investigation, it appeared that the HTML produced from LO was so complex that the (unknown) browser (my guess would be IE) did not render the text correctly.
Note that the HTML produced from LO is far more complex than a human would write.  It's full of spans and font changes, even when there is no actual change of font nor any necessity to introduce a new span.  Following some investigation, I found that at every point where I had edited some text, the run of characters was broken by a new <span> and sometimes by a new font descriptor.  In this way, the exported HTML becomes much more complex than it needs to be and also exposes much of the editing history of the text.  This also means LO is exposing some its internal data structuring in the HTML it produces.
The HTML produced is also much larger than it needs to be.  This means if you add up all the HTML web content produced from LO, LO is contributing to increased bandwidth use, and increased CPU cycles in processing the extra (redundant) HTML, consuming extra energy as well.
Please see the attached very simple .odt file and the .html exported from it.  Note the heavy use of <span> and redundant definitions of styles in the CSS:

E.g.
        .P1 { font-size:12pt; font-family:Liberation Serif; writing-mode:page; }
        .P2 { font-size:12pt; font-family:Liberation Serif; writing-mode:page; }
        .P3 { font-size:12pt; font-family:Liberation Serif; writing-mode:page; }

Perhaps there is some reason for making the generated HTML so complex.  If so, an option to export a minimal (or at least, simple) HTML would be very helpful.

A final note, FWIW: when I ran "tidy" on the .html, it reported two illegal characters, too:

$ tidy -ashtml < LO-HTML-export.html > LO-HTML-tidied.html
line 20 column 605 - Warning: replacing invalid character code 134
line 20 column 606 - Warning: replacing invalid character code 146
Info: Document content looks like XHTML 1.0 Strict
2 warnings, 0 errors were found!


Character codes 128 to 159 (U+0080 to U+009F) are not allowed in HTML;
even if they were, they would likely be unprintable control characters.
Tidy assumed you wanted to refer to a character with the same byte value in the 
specified encoding and replaced that reference with the Unicode equivalent.

tidy - HTML syntax checker and reformatter
$ tidy --version
HTML Tidy for Linux released on 25 March 2009


(It didn't simplify the HTML, though - as I'm sure it's technically correct.)

HTH,

luke
Comment 1 Buovjaga 2015-04-16 17:41:57 UTC
I tried reproducing from scratch in LibO 4.4, following in your footsteps by pasting "The quick brown fox jumps over the lazy dog." and then doing the changes.

My results were much cleaner.

Very minimal style definitions:

.P1 { font-size:12pt; font-family:Liberation Serif; writing-mode:page; }
.T1 { font-style:italic; }
<!-- ODF styles with no properties representable as CSS -->
{ }

No sign of the monstrous span soup breaking words apart:

<p class="P1">The quack brownish fax jumps over the <span class="T1">lazy</span> dog again.</p>

However, with htmltidy I got the same "Character codes 128 to 159 (U+0080 to U+009F) are not allowed in HTML;"

I'll change this report to be about those.

As we are only talking about warnings and not errors, lowering priority per https://wiki.documentfoundation.org/images/0/06/Prioritizing_Bugs_Flowchart.jpg

Win 7 Pro 64-bit, Version: 4.4.2.2
Build ID: c4c7d32d0d49397cad38d62472b0bc8acff48dd6
Locale: fi_FI
Comment 2 Robinson Tryon (qubit) 2015-12-13 11:20:43 UTC Comment hidden (obsolete)
Comment 3 David Tardon 2016-04-04 07:47:33 UTC
The paragraph/text styles have different officeooo:rsid attribute. I think that means the document was created with change tracking enabled. I do not think the XHTML export tries to do any cleanup/deduplication of either styles or spans: it just exports all existing styles as CSS and produces an element for each paragraph/span in the source ODF. Anyway, the XHTML export is a horrible XSLT mess that nobody is willing to touch with a 3-meter-long pole, so I doubt it very much anyone is going to look into this...
Comment 4 David Tardon 2016-04-04 07:48:14 UTC
Created attachment 124053 [details]
sample document
Comment 5 David Tardon 2016-04-04 07:48:35 UTC
Created attachment 124054 [details]
HTML export
Comment 6 Luke Kendall 2016-04-04 08:23:41 UTC
FYI, the problem is there, and with bigger consequences IMHO, for LO's own XML save file format.  Anyway, for that I submitted https://bugs.documentfoundation.org/show_bug.cgi?id=99015.
For this bug, if LO just produced compliant (X?)HTML, that would probably be enough.

BTW, is the Export to XHTML the same code that provides Save As HTML, or are they different?

I'm intrigued, too, as to why the generation of X/HTML should be "a horrible XSLT mess" - the function doesn't seem hugely different to producing XML; and having a "horrible mess" suggests there's an underlying problem, maybe even a design problem (I'm thinking especially of the creation of duplicate redundant styles).  Though maybe the XSLT "mess" is just due to historical chance, and so many things to do...

It's possible I was had used change tracking on the document at some point, BTW.  I wouldn't have thought so (since I tend to only use it when exchanging documents with my editor, and the MS in question is nowhere near that stage); but it is a possibility.
Comment 7 Buovjaga 2016-04-04 08:28:42 UTC
(In reply to Luke Kendall from comment #6)
> BTW, is the Export to XHTML the same code that provides Save As HTML, or are
> they different?

At least based on the output, they are completely different.
Comment 8 David Tardon 2016-04-05 05:32:52 UTC
(In reply to Luke Kendall from comment #6)
> BTW, is the Export to XHTML the same code that provides Save As HTML, or are
> they different?

They are very much different.

> 
> I'm intrigued, too, as to why the generation of X/HTML should be "a horrible
> XSLT mess"

The XHTML export filter is written in XSLT, which is absolutely inappropriate for conversing a complex format to another one.
Comment 9 QA Administrators 2018-06-27 02:48:22 UTC Comment hidden (obsolete)
Comment 10 QA Administrators 2020-06-27 03:49:26 UTC Comment hidden (obsolete)
Comment 11 QA Administrators 2022-06-28 03:25:47 UTC Comment hidden (obsolete)
Comment 12 Mike Kaganski 2024-01-17 05:54:13 UTC
Does it still produce characters with these codes, after commit e4f53484d255f844169957c411dc3e872af7d3bb ?
Comment 13 Buovjaga 2024-01-17 12:43:46 UTC
(In reply to Mike Kaganski from comment #12)
> Does it still produce characters with these codes, after commit
> e4f53484d255f844169957c411dc3e872af7d3bb ?

I can't repro with any of the attachments, even with 7.2, before that commit. htmltidy version 5.8.0.