Bug 153839 - Exporting to xhtml results in most of the tags and content in one single line
Summary: Exporting to xhtml results in most of the tags and content in one single line
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.0.6.2 release
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard: target:7.6.0 target:7.5.3 target:7.5.4
Keywords: bibisectRequest, regression
: 154268 (view as bug list)
Depends on:
Blocks: (X)HTML-Export
  Show dependency treegraph
 
Reported: 2023-02-26 03:55 UTC by Franklin Weng
Modified: 2023-05-09 16:28 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
Exported HTML file by 3.6.7.2. (19.58 KB, text/html)
2023-02-26 03:55 UTC, Franklin Weng
Details
Exported HTML file by 4.0.6.2 (35.25 KB, text/html)
2023-02-26 03:56 UTC, Franklin Weng
Details
test ODT to export as XHTML (300.17 KB, application/vnd.oasis.opendocument.text)
2023-03-16 15:49 UTC, Stéphane Guillou (stragu)
Details
XHTML export of text document following first commit (372.76 KB, text/html)
2023-03-16 15:50 UTC, Stéphane Guillou (stragu)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Franklin Weng 2023-02-26 03:55:06 UTC
Description:
When exporting the document into xhtml format, almost all the tags, including <!DOCTYPE>, <html>, <head>, <meta>, <body>, ... etc don't have a newline \n (and/or carriage return \r) after them, so most of the content are in one single line.  It didn't affect the result in a browser, but a lot more difficult if we need to manually maintain the xhtml file.

Steps to Reproduce:
1. Save a document (or export it) as a xhtml file
2.
3.

Actual Results:
Most of the content are in one single line.

Expected Results:
Should split into multiple lines by each tag.  It's easier to maintain manually when necessary.


Reproducible: Always


User Profile Reset: No

Additional Info:
In 3.6.7.2 (版本 3.6.7.2 (組建 ID:e183d5b)) the tags and contents are split into multiple lines.

Since 4.0.6.2 (版本 4.0.6.2 (組建 ID:2e2573268451a50806fcd60ae2d9fe01dd0ce24), the second oldest version I installed in my system) it started becoming in all single line.
Comment 1 Franklin Weng 2023-02-26 03:55:48 UTC
Created attachment 185591 [details]
Exported HTML file by 3.6.7.2.

The contents are split into multiple lines.
Comment 2 Franklin Weng 2023-02-26 03:56:30 UTC
Created attachment 185592 [details]
Exported HTML file by 4.0.6.2

Most of the content are in one single line.  No newline after most of the html tags.
Comment 3 Stéphane Guillou (stragu) 2023-02-26 15:51:13 UTC
Thanks Franklin.
Confirmed, this has been bugging me for a while and I thought I had reported it bug I could not find it.

Tagging as a regression.

Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 6d9b9d1228cdee69e767833202442a1fed6174a6
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded
Comment 4 Franklin Weng 2023-02-28 08:32:22 UTC
After digging in, exporting to xhtml is defined in filter/source/xslt/odf2xhtml/export/xhtml.  I added <xsl:text>&#xa;</xsl:text> here and there in the opendoc2xhtml.xsl, body.xsl and header.xsl and can produce xhtml files with <head> section elements and each paragraph separated.  But I think we'll need one who is expert or familiar with xslt syntax to review these xsl files and decide how to properly fix this.
Comment 5 Commit Notification 2023-03-09 07:43:56 UTC
Franklin Weng committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/d2e8705c9cc503afdaed366b1f71ed012b0c568f

tdf#153839: add newline after certain tags

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 6 Commit Notification 2023-03-16 11:46:19 UTC
Franklin Weng committed a patch related to this issue.
It has been pushed to "libreoffice-7-5":

https://git.libreoffice.org/core/commit/5ee2f4ee7838401afdae5eef5669881601fb4ee6

tdf#153839: add newline after certain tags

It will be available in 7.5.3.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 7 Stéphane Guillou (stragu) 2023-03-16 15:49:19 UTC
Created attachment 186006 [details]
test ODT to export as XHTML

Thanks for this work, Franklin.
I just tested and I think it's pretty closed to be resolved, the HTML source is a lot more readable now.
I am attaching an example file that I used, to list a couple of extra tags that could be improved, if you feel like submitting follow-ups:

- Comments of the type <!--Next 'div' was a 'text:p'.--> are either kept inline or breaking across multiple lines in a weird way
- Table markup could be broken down better, as it overflows heavily. But this might also have to do with the filter creating unnecessary complicated table markup, I'm not sure.
- Note also the closing </table> tag directly followed by <h1> without breaking.

What do you think?
Comment 8 Stéphane Guillou (stragu) 2023-03-16 15:50:37 UTC
Created attachment 186007 [details]
XHTML export of text document following first commit

Exported with:

Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 44837a12d12be3e525fa48b37c3dd2553cc97d94
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded
Comment 9 Franklin Weng 2023-03-17 11:18:54 UTC
(In reply to Stéphane Guillou (stragu) from comment #7)
> Created attachment 186006 [details]
> test ODT to export as XHTML
> 
> Thanks for this work, Franklin.
> I just tested and I think it's pretty closed to be resolved, the HTML source
> is a lot more readable now.
> I am attaching an example file that I used, to list a couple of extra tags
> that could be improved, if you feel like submitting follow-ups:
> 
> - Comments of the type <!--Next 'div' was a 'text:p'.--> are either kept
> inline or breaking across multiple lines in a weird way
> - Table markup could be broken down better, as it overflows heavily. But
> this might also have to do with the filter creating unnecessary complicated
> table markup, I'm not sure.
> - Note also the closing </table> tag directly followed by <h1> without
> breaking.
> 
> What do you think?

Looks like a lot more complicated, but I think I can spend some time figuring it out and see if it could pass the unit test or not.

However in the commit Miklos commented that:

> in general the XSL-based XHTML export is horrible, you should never use it. Instead, you can use the XHTML mode of the C++-based HTML export, like:
> soffice --convert-to "xhtml:HTML (StarWriter):XHTML" ...
> I just note this because this change is easy enough to review, but if you would want nontrivial changes in this XSL mess, I won't be able to review. 

I guess so far we still need to stick with the XSLT solutions if C++ based HTML export could only used by command line, which doesn't make much sense for normal users.  I can test it as well, though.
Comment 10 Stéphane Guillou (stragu) 2023-04-21 08:31:43 UTC
I tested https://gerrit.libreoffice.org/c/core/+/149280 and it looks good to me.
The table code still spreads horizontally too much but that might have to do with the filter unnecessarily repeating tags, a different issue.
Overall, this is a big improvement over the previous situation. Happy to have this marked as fixed once the second patch is merged.
Thanks Franklin!
Comment 11 Commit Notification 2023-04-21 09:35:27 UTC
Franklin Weng committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/ce4272c25426f0084e53735e80870b9339239078

tdf#153839 : Further handling for adding newlines

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 12 Franklin Weng 2023-04-21 12:07:32 UTC
(In reply to Stéphane Guillou (stragu) from comment #10)
> I tested https://gerrit.libreoffice.org/c/core/+/149280 and it looks good to
> me.
> The table code still spreads horizontally too much but that might have to do
> with the filter unnecessarily repeating tags, a different issue.
> Overall, this is a big improvement over the previous situation. Happy to
> have this marked as fixed once the second patch is merged.
> Thanks Franklin!

Some places couldn't be fixed since when I tried to insert the line break there, it always caused unit test error. (For example, before <h1>)

But let's live with this for now.
Comment 13 Stéphane Guillou (stragu) 2023-04-21 13:17:05 UTC
Verified with own build. Thanks again!
Comment 14 Stéphane Guillou (stragu) 2023-04-26 23:43:11 UTC
*** Bug 154268 has been marked as a duplicate of this bug. ***
Comment 15 Commit Notification 2023-05-08 07:41:08 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/63ac36893ad7f3b1c73cb46667fbfd5384a747dc

tdf#153839 XHTML export: fix syntax error in table.xsl

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 16 Commit Notification 2023-05-08 07:42:11 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/ab85fd73a52256da6feb4fabd1b188f4f0fb7ce4

tdf#153839 XHTML export: do not add newlines to attribute values

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 17 Commit Notification 2023-05-09 16:27:35 UTC
Franklin Weng committed a patch related to this issue.
It has been pushed to "libreoffice-7-5":

https://git.libreoffice.org/core/commit/c910a1320c7247c111d4f7e2a61540fc646938ff

tdf#153839 : Further handling for adding newlines

It will be available in 7.5.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 18 Commit Notification 2023-05-09 16:28:37 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "libreoffice-7-5":

https://git.libreoffice.org/core/commit/fc4b4d007e41192c21d2979e45ac73541935c00e

tdf#153839 XHTML export: fix syntax error in table.xsl

It will be available in 7.5.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 19 Commit Notification 2023-05-09 16:28:39 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "libreoffice-7-5":

https://git.libreoffice.org/core/commit/35fe68188e984d32d3f21db81e633743ca06f67c

tdf#153839 XHTML export: do not add newlines to attribute values

It will be available in 7.5.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.