Bug 141187 - LO produces messy HTML in EPUB export
Summary: LO produces messy HTML in EPUB export
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.1.1.2 release
Hardware: All Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 147392 (view as bug list)
Depends on: 141664
Blocks: EPUB-Export
  Show dependency treegraph
 
Reported: 2021-03-22 23:29 UTC by Coburn Ingram
Modified: 2024-03-17 02:45 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
odt document in order to test this bug (10.01 KB, application/vnd.oasis.opendocument.text)
2021-07-31 06:24 UTC, BogdanB
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Coburn Ingram 2021-03-22 23:29:44 UTC
Description:
I have tried exporting book-length material to EPUB from Writer and then editing it with Calibre.

What happens is that every place I have edited my text in LO (and I edit a lot, because that is what writers do), I find a separate span. For example, if I change a word in a paragraph after I have written it, that word will be in its own span. That makes for very messy HTML that is hard to edit afterward.

Thank you for your attention to this. I apologize for an amateur bug report. I think the fix should be fairly simple.

Steps to Reproduce:
1. Create a document.
2. Go back and change text within the document.
3. Export to EPUB.

Actual Results:
Text within a paragraph, sentence, or even a single word is separated into multiple spans. A new span is created whenever even a single letter is changed.

Expected Results:
I would like every paragraph to be kept in one single span, as long as no styles are changed, that is, as long as it would be in its own single span if it had not been edited.


Reproducible: Always


User Profile Reset: No



Additional Info:
[Information automatically included from LibreOffice]
Locale: en-US
Module: TextDocument
[Information guessed from browser]
OS: Linux (All)
OS is 64bit: yes
Comment 1 Coburn Ingram 2021-04-07 15:16:47 UTC
I guess what I am asking for, to simplify, is that adjacent identical tags would be merged. And that text that is of the same WYSIWYG style would also have the same tag style, and therefore be merged.

This is asking LO to behave like an HTML cleaner, but it is also asking for the app to not create new spans where none are needed.

At the risk of being annoying, this seems related to the observed behavior that LO likes to default to a basic text style that may or may not be the user-defined style. If I have changed my style (e.g. to 10 pt.) in Options, and perform an Undo operation, sometimes the changed text reverts to the basic out-of-the-box style (e.g. 12 pt.) instead of the user-defined style. It seems to me there should only be one place that styles are defined, to avoid confusion. Not asking for a separate bugfix, just trying to add helpful information that may help locate the source of this behavior.
Comment 2 BogdanB 2021-07-31 06:20:42 UTC
I added "my" in the middle of a sentence and is creating a new span for this.

----
in </span><span class="span2">my</span><span class="span2"> the
----



<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head><link href="../styles/stylesheet.css" rel="stylesheet" type="text/css"/></head><body class="body0" xmlns:epub="http://www.idpf.org/2007/ops"><p class="para2"><span class="span1">Chapter 1</span></p><p class="para1"><span class="span2">He heard quiet steps behind him lying in </span><span class="span2">my</span><span class="span2"> the middle of the sidewalk.  Would this door save his hide?</span></p><p class="para1"><span class="span2">Another paragraf</span></p><p class="para1"> </p></body></html>

Confirm with
Version: 7.1.5.2 / LibreOffice Community
Build ID: 85f04e9f809797b8199d13c421bd8a2b025d52b5
CPU threads: 4; OS: Linux 5.8; UI render: default; VCL: gtk3
Locale: ro-RO (ro_RO.UTF-8); UI: en-US
Calc: threaded
Comment 3 BogdanB 2021-07-31 06:24:57 UTC
Created attachment 173984 [details]
odt document in order to test this bug
Comment 4 Dieter 2022-03-13 15:45:00 UTC
*** Bug 147392 has been marked as a duplicate of this bug. ***
Comment 5 QA Administrators 2024-03-13 03:15:59 UTC Comment hidden (obsolete)
Comment 6 Tex2002ans 2024-03-17 02:45:45 UTC
Yes, this is still an issue in:

Version: 24.2.1.2 (X86_64) / LibreOffice Community
Build ID: db4def46b0453cc22e2d0305797cf981b68ef5ac
CPU threads: 8; OS: Windows 10.0 Build 22631; UI render: Skia/Raster; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: CL threaded

- - -

I tested using BogdanB's attachment 173984 [details] in comment 3.

0. Open file.
1. Add a random word or two inside the text.
2. File > Export As > Export as EPUB.
3. Press OK.
4. Unzip the EPUB and look inside the HTML.
   - Or use an EPUB editing program like Sigil or Calibre.

You'll see extra <span>s in the EPUB:

> Like lightning he darted off to the left and disappeared between the two warehouses almost falling over the trash can lying în </span><span class="span2">my</span><span class="span2"> the middle of the sidewalk. He tried to nervously tap his way along in </span><span class="span2">my </span><span class="span2">the inky darkness and suddenly stiffened:

with this blank class in the EPUB's CSS:

> .span2 {
> }

= = = = = = = = = = = =

I believe part of the root cause is spurious:

- officeooo:rsid

inside the ODT file, which get carried over into the HTML/EPUB export.

(I believe these RSIDs are "Random Session IDs"—to know when a certain text was edited for Comparison / Tracked Changes reasons.)

- - -

If you take the ODT and:

- File > Save As
- Dropdown for "Save as Type:"
   - Choose "Flat XML ODF Text Document"

You can open the FODT up in a text editor and see code along these lines:

> <text:p text:style-name="P1">He heard quiet steps behind him. [...] almost falling over the trash can lying în <text:span text:style-name="T1">my</text:span> the middle of the sidewalk. He tried to nervously tap his way along in <text:span text:style-name="T2">my </text:span>the inky darkness and suddenly stiffened: it was a dead-end, [...]

where extra <text:span>s appear around everything you insert/edit.

Higher in the FODT document, you can see what "T1" and "T2" were equivalent to:

>  <style:style style:name="T1" style:family="text">
>   <style:text-properties officeooo:rsid="00019890"/>
>  </style:style>
>  <style:style style:name="T2" style:family="text">
>   <style:text-properties officeooo:rsid="0003570a"/>
>  </style:style>

The only thing these <text:span>s were there for was:

- officeooo:rsid

they didn't supply any other info.

- - -

There was a similar issue with "single URLs" getting split into "multiple identical ones" here:

- Bug #112429 : "officeooo:rsid multiplies the links"
- Bug #148198 : "Editing single hyperlink breaks it into smaller ones"
   - Which got fixed in 7.5.0 and 7.4.0.2.

Mike Kaganski then came up with a patch to "merge identical hyperlinks of adjacent text ranges on ODF export":

- https://bugs.documentfoundation.org/show_bug.cgi?id=148198#c19

= = = = = = = = = = = =

So, on EPUB Export, I would probably do some logic along these lines:

Case 1: Before

- If ODT's "text:span text:style-name" only has "officeooo:rsid":
   - Do not export this <span> to EPUB at all.
- If 2 "text:spans" are right next to each other and the only difference is "officeooo:rsid".
   - Merge them together before HTML/EPUB export.
      - Similar to Bug 148198 above!

Case 2: After

You could have a pass that says:

- If the CSS class is empty/blank on the other end:
   - Delete that <span> out of the HTML/CSS/EPUB export completely.

= = = = = = = = = = = =

Note 1: Calibre's EPUB Editor has a fantastic feature called:

- "Remove Unused CSS"
- https://manual.calibre-ebook.com/edit.html#removing-unused-css-rules

which can do this type of thing in one button push:

- Tools > "Remove unused CSS"

It:

- Finds and purges all CSS and related HTML tags that that are blank / not in use

making the leftover HTML *much* easier to work with.

- - -

Note 2: I've also written many topics about this type of HTML+CSS cleanup over the years. Most recently:

2023: "Nested span, clean"
- https://www.mobileread.com/forums/showthread.php?p=4342160#post4342160

2023: "removing excessive <class> and other formatting horrors on epub"
- https://www.mobileread.com/forums/showthread.php?p=4312194#post4312194

2022: "Convert text formating from CSS to HTML"
- https://www.mobileread.com/forums/showthread.php?p=4188132#post4188132