Bug 76021 - FORMATTING: Libre Office Writer: save As HTML results in interlaced <strike> and <span> tags
Summary: FORMATTING: Libre Office Writer: save As HTML results in interlaced <strike> ...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.2.1.1 release
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: BSA
Keywords:
Depends on:
Blocks: (X)HTML-Export
  Show dependency treegraph
 
Reported: 2014-03-11 09:51 UTC by Patrick Goetz
Modified: 2023-07-12 16:12 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
A Libre Office document which, when saved as HTML, produces interlaced <strike> and <span> tags. (23.57 KB, application/vnd.oasis.opendocument.text)
2014-03-11 09:51 UTC, Patrick Goetz
Details
.docx file used for "Export to xhtml" example discussed in the comment. (13.60 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2014-03-15 10:21 UTC, Patrick Goetz
Details
.odt document with abnormal line breaks and span tags (12.07 KB, application/vnd.oasis.opendocument.text)
2020-07-31 11:10 UTC, Tyco72
Details
Screenshot of HTML source in Firefox 88 (61.71 KB, image/png)
2021-05-18 12:38 UTC, Stéphane Guillou (stragu)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Patrick Goetz 2014-03-11 09:51:25 UTC
Created attachment 95585 [details]
A Libre Office document which, when saved as HTML, produces interlaced <strike> and <span> tags.

Problem description: 

I am saving *.docx files as html using Libre Office 4.2.1.1.  Much to my surprise, I noticed that I'm getting horrifically invalid html, with interlaced tags.  As an experiment, I copy&pasted some of the offending text into a Libre Office document, saved as ODT, and then saved again as HTML.  The behavior appears to be the same.  Here is an example of what I'm talking about under current behavior.

Notice that the <strike> and <span> tags are interlaced, something which should never happen and which makes the file impossible to parse, say using xslt.


Steps to reproduce:
1. See attached Libre Office document
2. Save as HTML
3. Check the resulting HTML document using a text editor

I will test this using the linux version of Libre Office Writer

Current behavior:

<p class="western" style="margin-bottom: 0in; line-height: 110%"><b>Advisor</   b>&nbsp;shall
mean a person designated to support, assist, consult with<span style="display:  inline-block; border: none; padding: 0in"><strike>&nbsp;</strike><strike>and</span></strike>,

Expected behavior:

There are various ways this might be formatted with HTML; it doesn't matter as long as the tags aren't interlaced.        
Operating System: Windows XP
Version: 4.2.1.1 release
Comment 1 Urmas 2014-03-11 13:02:20 UTC
HTML is not XML and therefore doesn't require nested tags or XML document structure.
Comment 2 Patrick Goetz 2014-03-11 15:30:07 UTC
> HTML is not XML and therefore doesn't require nested tags or XML document structure.

While this might very well have been true in 1998, all modern versions of HTML are also valid XML with DTD's and Doctypes.  In any case, users expect to get valid output, and often the reason someone is doing Save as HTML in the first place is the document is going to be parsed.  It makes no sense to start out with a document that must be valid xml and end up with invalid HTML

This is quite embarrassing.  I've been recommending that people upgrade to Libre Office from MS Office, but in this case at least Microsoft is putting out valid HTML.  I don't understand what happened, I don't recall seeing this with previous versions of Open Office.
Comment 3 Patrick Goetz 2014-03-11 15:56:14 UTC
I checked Google Docs as well, converting the same document to HTML and checking to see if the tag structure is xml-valid.  While the HTML output from Google Docs can best be described as bizarre (every possible text formatting is set up as a class and applied using <span class=>), the file is nevertheless valid xml.
Comment 4 Julien Nabet 2014-03-11 22:07:21 UTC
On pc Debian x86-64 with master sources updated today, I can reproduce this.
Comment 5 Tomaz Vajngerl 2014-03-12 09:18:00 UTC
Heh - it's even a bigger mess when you add bold, italics and underline into the mix.
Comment 6 Patrick Goetz 2014-03-12 09:26:18 UTC
I've been doing this -- in particular, coding, and working with XML/HTML -- for a long time.  This smells of horrifically bad coding that probably needs to be rewritten from scratch.  No sensible XML parser would start with valid XML and end up with invalid HTML -- that doesn't make sense.
Comment 7 Julien Nabet 2014-03-12 11:13:17 UTC
I wonder if export->xhtml and save as->html calls the same part.
I think having read in a bug that it could be 2 different parts (one uses xslt file)

Miklos: any idea?
Comment 8 Tomaz Vajngerl 2014-03-13 15:00:38 UTC
I agree that HTML export in LO is reallybad, hasn't been worked on since Netscape was king and it probably needs rewriting to better use CSS and SVG, not use deprecated HTML features and to use new HTML5 tags where appropriate (easily choosing between HTML4 and HTML5). This probably will take some time..

However, if you are trying to parse HTML with a XML parser then it is your own fault. HTML is not XML - there are subtle differences like tags are case sensitive in XML but on HTML, no need for "/" if element has no body (for example: <br> is valid HTML but not XML) and nesting tags is allowed in HTML. In other words: it is recommended today to write HTML as XML but not mandated so you can not rely on that.

If you want a valid XML document export it as XHTML, which is actually using XML as a base.
Comment 9 Tomaz Vajngerl 2014-03-13 15:03:12 UTC
(In reply to comment #7)
> I wonder if export->xhtml and save as->html calls the same part.
> I think having read in a bug that it could be 2 different parts (one uses
> xslt file)
> 
> Miklos: any idea?

Yes, export->xhtml is using XSLT and they aren't using the same code paths.
Comment 10 Patrick Goetz 2014-03-15 10:21:04 UTC
Created attachment 95845 [details]
.docx file used for "Export to xhtml" example discussed in the comment.
Comment 11 Patrick Goetz 2014-03-15 10:26:39 UTC
> If you want a valid XML document export it as XHTML, which is actually using XML as a base.

The problem with this is that the xhtml I get when I use "Export to xhtml" is, in my opinion, quite bizarre (however, similar to what you get with "Publish to the Web" using Google Docs).  Using the attached .docx file as a starting point, this is what I get when I export to xhtml (snippet of file):

<p class="P1"><span class="T1">Complainant</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T2">shall mean (a)</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T3">the</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T4">any</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T2">person or persons from whom the Intake Officer receives information concerning an Offense</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T4">and who, upon consent of that person(s), is designated a Complainant by the Intake Officer</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T2">or (b) any Injured Person designated by the Bishop Diocesan who in the Bishop Diocesan’s discretion, should be afforded the status of a Complainant, provided, however, that any Injured Person so designated may decline such designation.</span></p>

(Ignoring that vim on the Windows XP machine I'm using is not reading the UTF-8 characters correctly), notice that common tags such as <b> and <i> are being inserted as classes using the <span> tag.  In this case, .T1 maps to single CSS attribute:
	.T1 { font-weight:bold; }

In a longer version of the same document (i.e. including more text from the same original document) you get more complex classes:
	.T1 { font-size:10pt; font-weight:bold; }
	.T13 { font-style:italic; }
	.T14 { font-style:italic; }
	.T15 { font-style:italic; }
	.T16 { font-style:italic; text-decoration:underline; }
	.T17 { font-style:italic; text-decoration:underline; }
	.T18 { font-style:italic; }
	.T19 { font-style:italic; font-weight:bold; }
	.T20 { font-style:italic; font-weight:bold; }
	.T21 { font-style:italic; font-weight:bold; }
	.T22 { font-style:italic; font-weight:bold; }
	.T26 { padding:0in; border-style:none; }
	.T27 { text-decoration:underline; }
	.T28 { text-decoration:underline; padding:0in; border-style:none; }
	.T29 { font-style:italic; text-decoration:underline; }

This is both unreadable and hard to parse.  Moreover, if I take exactly the same document and add some text, then all these classes change!  Also note the strange duplication of classes that do exactly the same thing (.T13,.T14,.T15,.T18)

In my application, what I need to do is extract the text, preserving simple formatting such as <p>, <b>, <i>, and (deprecated) <strike> in order to paste this content into another xml document.  This is do-able using the exported xhtml, but extremely onerous; since, for example, it will require at least 2 passes through a parser: first to add the simple xhtml tags I want (<b>, <i>) that weren't included in the first place, then another pass to strip out all the remaining classes and other xhmtl coding that I don't want.

I can't fathom why KISS isn't being applied here:  use basic xhtml tags whenever possible in order to keep the output readable and sane. I've written a fair amount of XML parsing code myself, so do know something about it.  I can't help but think this is an example of incredibly lazy programming (unless I'm missing something).
Comment 12 Patrick Goetz 2014-03-17 17:12:51 UTC
Intellectual curiosity leads me to add that I'd love for the person who wrote the "Export to xhmtl" code to explain why they went with a purely CSS class-based approach; especially since the Google Docs people (who I know have plenty of resources) did the same thing.
Comment 13 Julien Nabet 2014-03-17 22:16:45 UTC
(In reply to comment #12)
> Intellectual curiosity leads me to add that I'd love for the person who
> wrote the "Export to xhmtl" code to explain why they went with a purely CSS
> class-based approach; especially since the Google Docs people (who I know
> have plenty of resources) did the same thing.

Patrick: if it's ooo2wordml_text.xsl which does the job, it might be explained like this:
when we look at the history of this file (see http://opengrok.libreoffice.org/history/core/filter/source/xslt/export/wordml/ooo2wordml_text.xsl), we can see it's been created in 2004 and, if you leave the license changes, the last change was in March 2005. (9 years ago!)
Comment 14 Patrick Goetz 2014-03-17 22:26:49 UTC
ooo2wordml_text.xsl sounds like an XSL script which converts ODF to OOXML -- surely this woudn't be the same XSL used to export to xhtml?
Comment 15 Julien Nabet 2014-03-18 06:39:01 UTC
Patrick: Oups, you're right of course! :-)
Comment 16 Rev. Bob 2015-04-20 02:22:20 UTC
(In reply to Tomaz Vajngerl from comment #5)
> Heh - it's even a bigger mess when you add bold, italics and underline into
> the mix.

Something tells me this is related to the behavior I describe in bug 89069, especially where bold and italic are treated differently than the other inline formatting options. I was specifically looking at start-of-line behavior, but there may well be more to it...
Comment 17 QA Administrators 2016-09-20 09:32:47 UTC Comment hidden (obsolete)
Comment 18 Tyco72 2020-07-31 11:10:15 UTC
Created attachment 163803 [details]
.odt document with abnormal line breaks and span tags

I have the same issue, tested with LO 6.3.6 and 6.4.5. It is a critical bug when you have to paste the content for example in the Wordpress and have to work with  HTML!
The code looks messed up as shown in the comment #11, and the pasted text looks broken in a lot of rows with few characters each one (in the preformatted block of Wordpress). It makes the LO documents useless.
I use the .odt format. In the content.xml file, the text is broken in hundred of mini-rows, filled up with tags "<text:span text..." and "<style:style style:name=...." example:

<text:span text:style-name="T1">P</text:span>
<text:span text:style-name="T2">rob</text:span>
<text:span text:style-name="T3">a</text:span>
<text:span text:style-name="T2">bly, </text:span>
<text:span text:style-name="T1">collisions with other galaxies have </text:span>
<text:span text:style-name="T2">already happened to the Milky Way in the past, in</text:span>
<text:span text:style-name="T1">corporating </text:span>

It is a problem of LibreOffice. I have attached the document "sample3.odt"
 
I don't know exactly hot to reproduce the error, but it is a huge trouble. It seems to happen more when editing an already saved document, inserting new text among the rows, or modifying an existing sentence. 
HTML look so:

<p align="left"><span style="font-family: Arial, sans-serif;"><span style="font-size: small;"><a href="https://en.wikipedia.org/wiki/Andromeda_Galaxy"><b>Andromeda</b></a> is the galaxy more near to ours, the Milky Way. It is about 2,5 millions of light years from us (which is a little distance in the scale of universe). Andromeda is a big Galaxy, bigger that the Milky Way. It has a diameter of 220.000 light years and contains about <b>1000 billions of stars</b>.</span></span></p>
<p align="left"><span style="font-family: Arial, sans-serif;"><span style="font-size: small;">In comparison, the Milky Way is 170.000 - 200.000 light years large and contains 100-400 billion stars.</span></span></p>
<p align="left"><span style="font-family: Arial, sans-serif;"><span style="font-size: small;">The interesting part is that Andromeda is moving towards us and in about 4,5 billions years Andromeda will collide with the Milky Way. We can do nothing to stop this way! Collisions among galaxies are a common event in the universe.</span></span></p>

I will check whether it happened also with older versions of LO. Probably it didn't happened with LO 6.3.0 but  I can't be sure, because at that time I didn't used HTML.
Comment 19 Tyco72 2020-08-01 14:54:19 UTC
Update to my comment #18:
It seems that the bug didn't happen with LO 6.0. Now I have roll back to Lo 6.0.2 I will report if it happens also with that version. 
How is it possible that the bug is not yet fixed? It is a very critical bug for who uses the HTML format, or pastes the content in Wordpress and edit it in html.
Comment 20 Stéphane Guillou (stragu) 2021-05-18 12:38:51 UTC
Created attachment 172129 [details]
Screenshot of HTML source in Firefox 88

Reproducible with LO 7.2 Alpha0+. Firefox 88 even highlights the offending closing tags in red (see attachment).

Version: 7.2.0.0.alpha0+ / LibreOffice Community
Build ID: 6b09276d157abada74e1a4989700139167207778
CPU threads: 8; OS: Linux 4.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
TinderBox: Linux-rpm_deb-x86_64@86-TDF, Branch:master, Time: 2021-05-14_04:32:30
Calc: threaded
Comment 21 Tyco72 2021-05-18 20:46:47 UTC
(In reply to stragu from comment #20)
Yes you are right, and the bug affects actually also LO 6.x. I didn't update my description. 
Now I am sticking still with LO 6.4 : For me it doesn't worth the effort to update LO until this bug is not fixed. It is a nightmare to use LO to work correctly with HTML texts.

This bug hasn't been even assigned to someone. Do you have any news about a fix for this? Is it possible to change the severity at least to 'major'?
Comment 22 Miklos Vajna 2021-05-19 08:20:45 UTC
If this is a regression that please tag it as such and let's bisect to find the first bad commit. Thanks.
Comment 23 Tyco72 2021-05-19 11:30:19 UTC
(In reply to Miklos Vajna from comment #22)
Hello Miklos, I don't know whether it is a regression error. I noticed it since I am working with Wordpress. I used LO 6.0. Since that time, this bug is present in all LO versions.
Comment 24 Aron Budea 2021-05-23 05:40:27 UTC
(In reply to Tyco72 from comment #18)
> I have the same issue, tested with LO 6.3.6 and 6.4.5. It is a critical bug
> when you have to paste the content for example in the Wordpress and have to
> work with  HTML!
> The code looks messed up as shown in the comment #11, and the pasted text
> looks broken in a lot of rows with few characters each one (in the
> preformatted block of Wordpress). It makes the LO documents useless.
Surely this isn't the same as the originally reported bug, please open a new bug report for your issue.
Comment 25 Tyco72 2021-05-23 09:37:36 UTC
> Surely this isn't the same as the originally reported bug, please open a new
> bug report for your issue.

Hi, thank you. I have created the bug:
Bug 142443
https://bugs.documentfoundation.org/show_bug.cgi?id=142443
Comment 26 QA Administrators 2023-05-24 03:14:45 UTC Comment hidden (obsolete)
Comment 27 Tyco72 2023-07-12 16:12:20 UTC
Hello,
I have tested it with LO 7.5.3.2 (Win 64bit)and the bug is still present, as I have described in comment #18
https://bugs.documentfoundation.org/show_bug.cgi?id=76021#c18

It was present also in LO 7.4. I don't know for older versions, but the comment #20 reports that it happened also with LO 7.2 then I would say that the bug is inherited.