76021 – FORMATTING: Libre Office Writer: save As HTML results in interlaced <strike> and <span> tags

Bug 76021 - FORMATTING: Libre Office Writer: save As HTML results in interlaced <strike> and <span> tags

Summary: FORMATTING: Libre Office Writer: save As HTML results in interlaced <strike> ...

Status:	RESOLVED DUPLICATE of bug 160017

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	4.2.1.1 release
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-03-11 09:51 UTC by Patrick Goetz
Modified:	2024-05-20 11:11 UTC (History)
CC List:	4 users (show)

See Also:	142443
Crash report or crash signature:

Attachments
A Libre Office document which, when saved as HTML, produces interlaced <strike> and <span> tags. (23.57 KB, application/vnd.oasis.opendocument.text) 2014-03-11 09:51 UTC, Patrick Goetz	Details
.docx file used for "Export to xhtml" example discussed in the comment. (13.60 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document) 2014-03-15 10:21 UTC, Patrick Goetz	Details
.odt document with abnormal line breaks and span tags (12.07 KB, application/vnd.oasis.opendocument.text) 2020-07-31 11:10 UTC, Tyco72	Details
Screenshot of HTML source in Firefox 88 (61.71 KB, image/png) 2021-05-18 12:38 UTC, Stéphane Guillou (stragu)	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Patrick Goetz 2014-03-11 09:51:25 UTC

Created attachment 95585 [details]
A Libre Office document which, when saved as HTML, produces interlaced <strike> and <span> tags.

Problem description: 

I am saving *.docx files as html using Libre Office 4.2.1.1.  Much to my surprise, I noticed that I'm getting horrifically invalid html, with interlaced tags.  As an experiment, I copy&pasted some of the offending text into a Libre Office document, saved as ODT, and then saved again as HTML.  The behavior appears to be the same.  Here is an example of what I'm talking about under current behavior.

Notice that the <strike> and <span> tags are interlaced, something which should never happen and which makes the file impossible to parse, say using xslt.


Steps to reproduce:
1. See attached Libre Office document
2. Save as HTML
3. Check the resulting HTML document using a text editor

I will test this using the linux version of Libre Office Writer

Current behavior:

<p class="western" style="margin-bottom: 0in; line-height: 110%"><b>Advisor</   b>&nbsp;shall
mean a person designated to support, assist, consult with<span style="display:  inline-block; border: none; padding: 0in"><strike>&nbsp;</strike><strike>and</span></strike>,

Expected behavior:

There are various ways this might be formatted with HTML; it doesn't matter as long as the tags aren't interlaced.        
Operating System: Windows XP
Version: 4.2.1.1 release

Comment 1 Urmas 2014-03-11 13:02:20 UTC

HTML is not XML and therefore doesn't require nested tags or XML document structure.

Comment 2 Patrick Goetz 2014-03-11 15:30:07 UTC

> HTML is not XML and therefore doesn't require nested tags or XML document structure.

While this might very well have been true in 1998, all modern versions of HTML are also valid XML with DTD's and Doctypes.  In any case, users expect to get valid output, and often the reason someone is doing Save as HTML in the first place is the document is going to be parsed.  It makes no sense to start out with a document that must be valid xml and end up with invalid HTML

This is quite embarrassing.  I've been recommending that people upgrade to Libre Office from MS Office, but in this case at least Microsoft is putting out valid HTML.  I don't understand what happened, I don't recall seeing this with previous versions of Open Office.

Comment 3 Patrick Goetz 2014-03-11 15:56:14 UTC

I checked Google Docs as well, converting the same document to HTML and checking to see if the tag structure is xml-valid.  While the HTML output from Google Docs can best be described as bizarre (every possible text formatting is set up as a class and applied using <span class=>), the file is nevertheless valid xml.

Comment 4 Julien Nabet 2014-03-11 22:07:21 UTC

On pc Debian x86-64 with master sources updated today, I can reproduce this.

Comment 5 Tomaz Vajngerl 2014-03-12 09:18:00 UTC

Heh - it's even a bigger mess when you add bold, italics and underline into the mix.

Comment 6 Patrick Goetz 2014-03-12 09:26:18 UTC

I've been doing this -- in particular, coding, and working with XML/HTML -- for a long time.  This smells of horrifically bad coding that probably needs to be rewritten from scratch.  No sensible XML parser would start with valid XML and end up with invalid HTML -- that doesn't make sense.

Comment 7 Julien Nabet 2014-03-12 11:13:17 UTC

I wonder if export->xhtml and save as->html calls the same part.
I think having read in a bug that it could be 2 different parts (one uses xslt file)

Miklos: any idea?

Comment 8 Tomaz Vajngerl 2014-03-13 15:00:38 UTC

I agree that HTML export in LO is reallybad, hasn't been worked on since Netscape was king and it probably needs rewriting to better use CSS and SVG, not use deprecated HTML features and to use new HTML5 tags where appropriate (easily choosing between HTML4 and HTML5). This probably will take some time..

However, if you are trying to parse HTML with a XML parser then it is your own fault. HTML is not XML - there are subtle differences like tags are case sensitive in XML but on HTML, no need for "/" if element has no body (for example: <br> is valid HTML but not XML) and nesting tags is allowed in HTML. In other words: it is recommended today to write HTML as XML but not mandated so you can not rely on that.

If you want a valid XML document export it as XHTML, which is actually using XML as a base.

Comment 9 Tomaz Vajngerl 2014-03-13 15:03:12 UTC

(In reply to comment #7)
> I wonder if export->xhtml and save as->html calls the same part.
> I think having read in a bug that it could be 2 different parts (one uses
> xslt file)
> 
> Miklos: any idea?

Yes, export->xhtml is using XSLT and they aren't using the same code paths.

Comment 10 Patrick Goetz 2014-03-15 10:21:04 UTC

Created attachment 95845 [details]
.docx file used for "Export to xhtml" example discussed in the comment.

Comment 11 Patrick Goetz 2014-03-15 10:26:39 UTC

> If you want a valid XML document export it as XHTML, which is actually using XML as a base.

The problem with this is that the xhtml I get when I use "Export to xhtml" is, in my opinion, quite bizarre (however, similar to what you get with "Publish to the Web" using Google Docs).  Using the attached .docx file as a starting point, this is what I get when I export to xhtml (snippet of file):

<p class="P1"><span class="T1">Complainant</span><span class="apple-converted-space"><span class="T2">Â </span></span><span class="T2">shall mean (a)</span><span class="apple-converted-space"><span class="T2">Â </span></span><span class="T3">the</span><span class="apple-converted-space"><span class="T2">Â </span></span><span class="T4">any</span><span class="apple-converted-space"><span class="T2">Â </span></span><span class="T2">person or persons from whom the Intake Officer receives information concerning an Offense</span><span class="apple-converted-space"><span class="T2">Â </span></span><span class="T4">and who, upon consent of that person(s), is designated a Complainant by the Intake Officer</span><span class="apple-converted-space"><span class="T2">Â </span></span><span class="T2">or (b) any Injured Person designated by the Bishop Diocesan who in the Bishop Diocesanâ€™s discretion, should be afforded the status of a Complainant, provided, however, that any Injured Person so designated may decline such designation.</span></p>

(Ignoring that vim on the Windows XP machine I'm using is not reading the UTF-8 characters correctly), notice that common tags such as <b> and <i> are being inserted as classes using the <span> tag.  In this case, .T1 maps to single CSS attribute:
	.T1 { font-weight:bold; }

In a longer version of the same document (i.e. including more text from the same original document) you get more complex classes:
	.T1 { font-size:10pt; font-weight:bold; }
	.T13 { font-style:italic; }
	.T14 { font-style:italic; }
	.T15 { font-style:italic; }
	.T16 { font-style:italic; text-decoration:underline; }
	.T17 { font-style:italic; text-decoration:underline; }
	.T18 { font-style:italic; }
	.T19 { font-style:italic; font-weight:bold; }
	.T20 { font-style:italic; font-weight:bold; }
	.T21 { font-style:italic; font-weight:bold; }
	.T22 { font-style:italic; font-weight:bold; }
	.T26 { padding:0in; border-style:none; }
	.T27 { text-decoration:underline; }
	.T28 { text-decoration:underline; padding:0in; border-style:none; }
	.T29 { font-style:italic; text-decoration:underline; }

This is both unreadable and hard to parse.  Moreover, if I take exactly the same document and add some text, then all these classes change!  Also note the strange duplication of classes that do exactly the same thing (.T13,.T14,.T15,.T18)

In my application, what I need to do is extract the text, preserving simple formatting such as <p>, <b>, <i>, and (deprecated) <strike> in order to paste this content into another xml document.  This is do-able using the exported xhtml, but extremely onerous; since, for example, it will require at least 2 passes through a parser: first to add the simple xhtml tags I want (<b>, <i>) that weren't included in the first place, then another pass to strip out all the remaining classes and other xhmtl coding that I don't want.

I can't fathom why KISS isn't being applied here:  use basic xhtml tags whenever possible in order to keep the output readable and sane. I've written a fair amount of XML parsing code myself, so do know something about it.  I can't help but think this is an example of incredibly lazy programming (unless I'm missing something).

Comment 12 Patrick Goetz 2014-03-17 17:12:51 UTC

Intellectual curiosity leads me to add that I'd love for the person who wrote the "Export to xhmtl" code to explain why they went with a purely CSS class-based approach; especially since the Google Docs people (who I know have plenty of resources) did the same thing.

Comment 13 Julien Nabet 2014-03-17 22:16:45 UTC

(In reply to comment #12)
> Intellectual curiosity leads me to add that I'd love for the person who
> wrote the "Export to xhmtl" code to explain why they went with a purely CSS
> class-based approach; especially since the Google Docs people (who I know
> have plenty of resources) did the same thing.

Patrick: if it's ooo2wordml_text.xsl which does the job, it might be explained like this:
when we look at the history of this file (see http://opengrok.libreoffice.org/history/core/filter/source/xslt/export/wordml/ooo2wordml_text.xsl), we can see it's been created in 2004 and, if you leave the license changes, the last change was in March 2005. (9 years ago!)

Comment 14 Patrick Goetz 2014-03-17 22:26:49 UTC

ooo2wordml_text.xsl sounds like an XSL script which converts ODF to OOXML -- surely this woudn't be the same XSL used to export to xhtml?

Comment 15 Julien Nabet 2014-03-18 06:39:01 UTC

Patrick: Oups, you're right of course! :-)

Comment 16 Rev. Bob 2015-04-20 02:22:20 UTC

(In reply to Tomaz Vajngerl from comment #5)
> Heh - it's even a bigger mess when you add bold, italics and underline into
> the mix.

Something tells me this is related to the behavior I describe in bug 89069, especially where bold and italic are treated differently than the other inline formatting options. I was specifically looking at start-of-line behavior, but there may well be more to it...

Comment 17 QA Administrators 2016-09-20 09:32:47 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present on a currently supported version of LibreOffice
(5.1.5 or 5.2.1 https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the version of LibreOffice and
your operating system, and any changes you see in the bug behavior

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave
a short comment that includes your version of LibreOffice and Operating System

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to "inherited from OOo";
4b. If the bug was not present in 3.3 - add "regression" to keyword

Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug-20160920

Comment 18 Tyco72 2020-07-31 11:10:15 UTC Comment hidden (off-topic)

Created attachment 163803 [details]
.odt document with abnormal line breaks and span tags

I have the same issue, tested with LO 6.3.6 and 6.4.5. It is a critical bug when you have to paste the content for example in the Wordpress and have to work with  HTML!
The code looks messed up as shown in the comment #11, and the pasted text looks broken in a lot of rows with few characters each one (in the preformatted block of Wordpress). It makes the LO documents useless.
I use the .odt format. In the content.xml file, the text is broken in hundred of mini-rows, filled up with tags "<text:span text..." and "<style:style style:name=...." example:

<text:span text:style-name="T1">P</text:span>
<text:span text:style-name="T2">rob</text:span>
<text:span text:style-name="T3">a</text:span>
<text:span text:style-name="T2">bly, </text:span>
<text:span text:style-name="T1">collisions with other galaxies have </text:span>
<text:span text:style-name="T2">already happened to the Milky Way in the past, in</text:span>
<text:span text:style-name="T1">corporating </text:span>

It is a problem of LibreOffice. I have attached the document "sample3.odt"
 
I don't know exactly hot to reproduce the error, but it is a huge trouble. It seems to happen more when editing an already saved document, inserting new text among the rows, or modifying an existing sentence. 
HTML look so:

<p align="left"><span style="font-family: Arial, sans-serif;"><span style="font-size: small;"><a href="https://en.wikipedia.org/wiki/Andromeda_Galaxy"><b>Andromeda</b></a> is the galaxy more near to ours, the Milky Way. It is about 2,5 millions of light years from us (which is a little distance in the scale of universe). Andromeda is a big Galaxy, bigger that the Milky Way. It has a diameter of 220.000 light years and contains about <b>1000 billions of stars</b>.</span></span></p>
<p align="left"><span style="font-family: Arial, sans-serif;"><span style="font-size: small;">In comparison, the Milky Way is 170.000 - 200.000 light years large and contains 100-400 billion stars.</span></span></p>
<p align="left"><span style="font-family: Arial, sans-serif;"><span style="font-size: small;">The interesting part is that Andromeda is moving towards us and in about 4,5 billions years Andromeda will collide with the Milky Way. We can do nothing to stop this way! Collisions among galaxies are a common event in the universe.</span></span></p>

I will check whether it happened also with older versions of LO. Probably it didn't happened with LO 6.3.0 but  I can't be sure, because at that time I didn't used HTML.

Comment 19 Tyco72 2020-08-01 14:54:19 UTC Comment hidden (off-topic)

Update to my comment #18:
It seems that the bug didn't happen with LO 6.0. Now I have roll back to Lo 6.0.2 I will report if it happens also with that version. 
How is it possible that the bug is not yet fixed? It is a very critical bug for who uses the HTML format, or pastes the content in Wordpress and edit it in html.

Comment 20 Stéphane Guillou (stragu) 2021-05-18 12:38:51 UTC

Created attachment 172129 [details]
Screenshot of HTML source in Firefox 88

Reproducible with LO 7.2 Alpha0+. Firefox 88 even highlights the offending closing tags in red (see attachment).

Version: 7.2.0.0.alpha0+ / LibreOffice Community
Build ID: 6b09276d157abada74e1a4989700139167207778
CPU threads: 8; OS: Linux 4.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
TinderBox: Linux-rpm_deb-x86_64@86-TDF, Branch:master, Time: 2021-05-14_04:32:30
Calc: threaded

Comment 21 Tyco72 2021-05-18 20:46:47 UTC Comment hidden (off-topic)

(In reply to stragu from comment #20)
Yes you are right, and the bug affects actually also LO 6.x. I didn't update my description. 
Now I am sticking still with LO 6.4 : For me it doesn't worth the effort to update LO until this bug is not fixed. It is a nightmare to use LO to work correctly with HTML texts.

This bug hasn't been even assigned to someone. Do you have any news about a fix for this? Is it possible to change the severity at least to 'major'?

Comment 22 Miklos Vajna 2021-05-19 08:20:45 UTC Comment hidden (obsolete)

If this is a regression that please tag it as such and let's bisect to find the first bad commit. Thanks.

Comment 23 Tyco72 2021-05-19 11:30:19 UTC Comment hidden (obsolete)

(In reply to Miklos Vajna from comment #22)
Hello Miklos, I don't know whether it is a regression error. I noticed it since I am working with Wordpress. I used LO 6.0. Since that time, this bug is present in all LO versions.

Comment 24 Aron Budea 2021-05-23 05:40:27 UTC

(In reply to Tyco72 from comment #18)
> I have the same issue, tested with LO 6.3.6 and 6.4.5. It is a critical bug
> when you have to paste the content for example in the Wordpress and have to
> work with  HTML!
> The code looks messed up as shown in the comment #11, and the pasted text
> looks broken in a lot of rows with few characters each one (in the
> preformatted block of Wordpress). It makes the LO documents useless.
Surely this isn't the same as the originally reported bug, please open a new bug report for your issue.

Comment 25 Tyco72 2021-05-23 09:37:36 UTC

> Surely this isn't the same as the originally reported bug, please open a new
> bug report for your issue.

Hi, thank you. I have created the bug:
Bug 142443
https://bugs.documentfoundation.org/show_bug.cgi?id=142443

Comment 26 QA Administrators 2023-05-24 03:14:45 UTC Comment hidden (obsolete)

Dear Patrick Goetz,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://web.libera.chat/?settings=#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 27 Tyco72 2023-07-12 16:12:20 UTC

Hello,
I have tested it with LO 7.5.3.2 (Win 64bit)and the bug is still present, as I have described in comment #18
https://bugs.documentfoundation.org/show_bug.cgi?id=76021#c18

It was present also in LO 7.4. I don't know for older versions, but the comment #20 reports that it happened also with LO 7.2 then I would say that the bug is inherited.

Comment 28 Stéphane Guillou (stragu) 2024-05-20 11:11:22 UTC

Reproduced with attachment 95585 [details] in:

Version: 7.6.7.2 (X86_64) / LibreOffice Community
Build ID: dd47e4b30cb7dab30588d6c79c651f218165e3c5
CPU threads: 8; OS: Linux 6.5; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded

Resolved in:

Version: 24.2.3.2 (X86_64) / LibreOffice Community
Build ID: 433d9c2ded56988e8a90e6b2e771ee4e6a5ab2ba
CPU threads: 8; OS: Linux 6.5; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: CL threaded

Resolved by 6ebe0eceb1ae4a3e544c733be37e5f02c5f46e80 in 24.2.2, cherrypick of:

commit 6d797c83d9fb891b783de39646b42d34a895c81e
author	Mike Kaganski 	Mon Mar 04 12:20:13 2024 +0600
committer	Mike Kaganski 	Mon Mar 04 13:26:06 2024 +0100
tdf160017: make sure to emit the closing tags in correct order
Reviewed-on: https://gerrit.libreoffice.org/c/core/+/164325

Thanks Mike!

*** This bug has been marked as a duplicate of bug 160017 ***