Created attachment 114714 [details] A simple writer test file, and the HTML export version of it. I belong to a writer's forum, and use LibreOffice to edit my text, and then to paste submissions into a web form that accepts HTML, to provide my writing for review. One reviewer noted that in many places, there was no space between words. After investigation, it appeared that the HTML produced from LO was so complex that the (unknown) browser (my guess would be IE) did not render the text correctly. Note that the HTML produced from LO is far more complex than a human would write. It's full of spans and font changes, even when there is no actual change of font nor any necessity to introduce a new span. Following some investigation, I found that at every point where I had edited some text, the run of characters was broken by a new <span> and sometimes by a new font descriptor. In this way, the exported HTML becomes much more complex than it needs to be and also exposes much of the editing history of the text. This also means LO is exposing some its internal data structuring in the HTML it produces. The HTML produced is also much larger than it needs to be. This means if you add up all the HTML web content produced from LO, LO is contributing to increased bandwidth use, and increased CPU cycles in processing the extra (redundant) HTML, consuming extra energy as well. Please see the attached very simple .odt file and the .html exported from it. Note the heavy use of <span> and redundant definitions of styles in the CSS: E.g. .P1 { font-size:12pt; font-family:Liberation Serif; writing-mode:page; } .P2 { font-size:12pt; font-family:Liberation Serif; writing-mode:page; } .P3 { font-size:12pt; font-family:Liberation Serif; writing-mode:page; } Perhaps there is some reason for making the generated HTML so complex. If so, an option to export a minimal (or at least, simple) HTML would be very helpful. A final note, FWIW: when I ran "tidy" on the .html, it reported two illegal characters, too: $ tidy -ashtml < LO-HTML-export.html > LO-HTML-tidied.html line 20 column 605 - Warning: replacing invalid character code 134 line 20 column 606 - Warning: replacing invalid character code 146 Info: Document content looks like XHTML 1.0 Strict 2 warnings, 0 errors were found! Character codes 128 to 159 (U+0080 to U+009F) are not allowed in HTML; even if they were, they would likely be unprintable control characters. Tidy assumed you wanted to refer to a character with the same byte value in the specified encoding and replaced that reference with the Unicode equivalent. tidy - HTML syntax checker and reformatter $ tidy --version HTML Tidy for Linux released on 25 March 2009 (It didn't simplify the HTML, though - as I'm sure it's technically correct.) HTH, luke
I tried reproducing from scratch in LibO 4.4, following in your footsteps by pasting "The quick brown fox jumps over the lazy dog." and then doing the changes. My results were much cleaner. Very minimal style definitions: .P1 { font-size:12pt; font-family:Liberation Serif; writing-mode:page; } .T1 { font-style:italic; } <!-- ODF styles with no properties representable as CSS --> { } No sign of the monstrous span soup breaking words apart: <p class="P1">The quack brownish fax jumps over the <span class="T1">lazy</span> dog again.</p> However, with htmltidy I got the same "Character codes 128 to 159 (U+0080 to U+009F) are not allowed in HTML;" I'll change this report to be about those. As we are only talking about warnings and not errors, lowering priority per https://wiki.documentfoundation.org/images/0/06/Prioritizing_Bugs_Flowchart.jpg Win 7 Pro 64-bit, Version: 4.4.2.2 Build ID: c4c7d32d0d49397cad38d62472b0bc8acff48dd6 Locale: fi_FI
Migrating Whiteboard tags to Keywords: (needsDevEval) [NinjaEdit]
The paragraph/text styles have different officeooo:rsid attribute. I think that means the document was created with change tracking enabled. I do not think the XHTML export tries to do any cleanup/deduplication of either styles or spans: it just exports all existing styles as CSS and produces an element for each paragraph/span in the source ODF. Anyway, the XHTML export is a horrible XSLT mess that nobody is willing to touch with a 3-meter-long pole, so I doubt it very much anyone is going to look into this...
Created attachment 124053 [details] sample document
Created attachment 124054 [details] HTML export
FYI, the problem is there, and with bigger consequences IMHO, for LO's own XML save file format. Anyway, for that I submitted https://bugs.documentfoundation.org/show_bug.cgi?id=99015. For this bug, if LO just produced compliant (X?)HTML, that would probably be enough. BTW, is the Export to XHTML the same code that provides Save As HTML, or are they different? I'm intrigued, too, as to why the generation of X/HTML should be "a horrible XSLT mess" - the function doesn't seem hugely different to producing XML; and having a "horrible mess" suggests there's an underlying problem, maybe even a design problem (I'm thinking especially of the creation of duplicate redundant styles). Though maybe the XSLT "mess" is just due to historical chance, and so many things to do... It's possible I was had used change tracking on the document at some point, BTW. I wouldn't have thought so (since I tend to only use it when exchanging documents with my editor, and the MS in question is nowhere near that stage); but it is a possibility.
(In reply to Luke Kendall from comment #6) > BTW, is the Export to XHTML the same code that provides Save As HTML, or are > they different? At least based on the output, they are completely different.
(In reply to Luke Kendall from comment #6) > BTW, is the Export to XHTML the same code that provides Save As HTML, or are > they different? They are very much different. > > I'm intrigued, too, as to why the generation of X/HTML should be "a horrible > XSLT mess" The XHTML export filter is written in XSLT, which is absolutely inappropriate for conversing a complex format to another one.
** Please read this message in its entirety before responding ** To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the information from Help - About LibreOffice. If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice. Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from http://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to 'inherited from OOo'; 4b. If the bug was not present in 3.3 - add 'regression' to keyword Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team MassPing-UntouchedBug
Dear Luke Kendall, To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the information from Help - About LibreOffice. If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice. Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to 'inherited from OOo'; 4b. If the bug was not present in 3.3 - add 'regression' to keyword Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team MassPing-UntouchedBug
Dear Luke Kendall, To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the information from Help - About LibreOffice. If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice. Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to 'inherited from OOo'; 4b. If the bug was not present in 3.3 - add 'regression' to keyword Feel free to come ask questions or to say hello in our QA chat: https://web.libera.chat/?settings=#libreoffice-qa Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team MassPing-UntouchedBug
Does it still produce characters with these codes, after commit e4f53484d255f844169957c411dc3e872af7d3bb ?
(In reply to Mike Kaganski from comment #12) > Does it still produce characters with these codes, after commit > e4f53484d255f844169957c411dc3e872af7d3bb ? I can't repro with any of the attachments, even with 7.2, before that commit. htmltidy version 5.8.0.