Bug 135282 - A showcase of HTML import, editing and export bugs in an HTML5 era
Summary: A showcase of HTML import, editing and export bugs in an HTML5 era
Status: REOPENED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.0.0.0.beta1+
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: HTML-Export HTML-Import
  Show dependency treegraph
 
Reported: 2020-07-29 15:12 UTC by Eyal Rozenberg
Modified: 2021-10-23 15:04 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
The page saved using "Save Page WE" to a single HTML (1.18 MB, text/html)
2020-07-29 15:14 UTC, Eyal Rozenberg
Details
Screenshot 01 - HTML meta tags as comments (93.74 KB, image/png)
2020-07-29 15:16 UTC, Eyal Rozenberg
Details
Screenshot 02 - Layout in LO Writer vs a browser (552.52 KB, image/png)
2020-07-29 15:23 UTC, Eyal Rozenberg
Details
Screenshot 03 - Three copies of same image (zoom 80%) (375.49 KB, image/png)
2020-07-29 15:27 UTC, Eyal Rozenberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eyal Rozenberg 2020-07-29 15:12:42 UTC
This is filed as a single bug, because it involves a single document, but in fact likely involves many bugs, some dupes and some probably not.

Instructions:

1. Have the Firefox web browser installed (I use v78.0.2 though it shouldn't matter much)
2. Install the "Save Page WE" extension.
3. Visit this page: https://medium.com/datadriveninvestor/what-are-time-series-databases-a3e847608f91
4. Press the "Save Page WE" button
5. Choose a path for the saved copy of the document; let's say you choose /path/to/doc.html  (you should get a 1.2 MB file, unfortunately)
6. Open /path/to/doc.html in LibreOffice Writer

Now let's enumerate the issues.

HTML meta tags and comments
------------------------------

(See screenshot 01)

1. Many HTML meta tags show up as comments. Most/all of them shouldn't; they should either go into Document Properties - if nothing else than as custom properties. Some tags _do_ show up there, but not nearly all.

2. Many HTML meta tags have their name stripped for some reason. Examples: "twitter:app:name:iphone", "al:ios:app_name".

3. The comment-overflow UI is quite inadequate: an idiosyncratic pair of buttons, which don't really look like buttons, and their up-triangle and down-triangle are also incredibly small. which stands in stark contrast to

4. Choice of a relatively-large comment font, and layout of comment balloons, so that only about 25% of the comment balloons is actual used for comment text. That's not just an issue with this document or with HTML; it's something I've noticed elsewhere as well.

5. Comments exhibit entry time, although the comments are _not_ timed. They exhibit an author even though they have no identified author. Now, if _some_ comments_ had an author and some didn't, that might perhaps make sense, but not when no comments have authors. This also relates to issue (4.), since a lot of balloon real-estate is used for the dummy time and author listing.

6. If you want to make HTML meta tags into comments - already a questionable idea - why have the comment text be HTML _code_? Especially since all you have about these meta tags is a name and a content string, or perhaps even _just_ a comment string (issue (2.) )? Shouldn't the comments be just: 

    somenamehere: The tag content here

  and that's that?

Viewing the document:

(Screenshots 02, 03)

7. The document is opened for editing in Normal view rather than Web view. This doesn't make sense to me - it's an HTML document, saved from the web.

8. The rendered document looks nothing like its rendering in a browser; and its rendering in different browsers is almost-identical. Now, ok, LO Writer is an editor, not a browser, so one can expect a few inaccuracies here and there. But:

  8.1 The basic block/framek layout is very different, up to and including vertical-vs-horizontal
  8.2 Font sizes and colors are different
  8.3 Areas which are supposed to be demarcated, aren't (e.g. the "Why data will transform investment management" block)
  8.3 Navigation bar background is missing
  8.4 The centralized, width-limited style for rendering the content is not respected.
  8.5. The size of (all copies of) each chart  is larger than the size of the chart when displayed in a browser.

9. Instead of a single crisp chart, we're seeing three (!) copies of the chart, two of which are blurry to different degrees. 

10. As you get to the end of the document (after the line saying "Original Source"), LO Writer viewport repainting gets messed up, and scrolling up-and-down results in the same parts of the content being repeated several times, while other parts disappear (despite some of them having been visible before).

Saving/exporting:

11. Saving the document - with no changes - to another HTML file results in 22 files in addition to the new HTML file. That should not happen: LO Writer should be able to maintain data within "data:" URIs inside the HTML document.

12. Opening the newly-exported HTML document in a browser shows a document very similar to what we saw in LO - not similar to the original document. That means the many/most of changes we saw weren't just cosmetic and carried forward to the output.
Comment 1 Eyal Rozenberg 2020-07-29 15:14:53 UTC
Created attachment 163738 [details]
The page saved using "Save Page WE" to a single HTML

You can save this attachment instead of following instructions 1 through 5.
Comment 2 Eyal Rozenberg 2020-07-29 15:16:27 UTC
Created attachment 163739 [details]
Screenshot 01 - HTML meta tags as comments
Comment 3 Eyal Rozenberg 2020-07-29 15:23:48 UTC
Created attachment 163740 [details]
Screenshot 02 - Layout in LO Writer vs a browser
Comment 4 Eyal Rozenberg 2020-07-29 15:27:37 UTC
Created attachment 163741 [details]
Screenshot 03 - Three copies of same image (zoom 80%)

This is issue (9.). You'll note three different "copies" of the same image/chart, with different degrees of bluriness.

Note it's not impossible that the three images exist within the .HTML file - but even if they do, they overlap, and some sort of Javascript magic ensures the crisp one shows and the others don't.
Comment 5 Eyal Rozenberg 2020-08-13 06:28:00 UTC
Some of the issues I brought up regard the LO UI, so adding needsUXEval
Comment 6 V Stuart Foote 2020-08-13 17:09:06 UTC
Work on the Writer Web module ended at HTML 4.0 Transitional--while not officially deprecated the feature is essentially abandoned.

Import and Export (save to HTML) works reasonably well for inline CSS2 HTML 4.0 markup--that is it. 

The default import filter mode for opening a .HTML document with LibreOffice is into the Writer Web module, into its 'Web' (un-paged view). I can not confirm reported issue of import opening to Writer Web 'Normal' (i.e. page view).

Clear you user profile to defaults to resolve.

The CSS of the js based HTML5/CSS3 web page linked is simply not renderable, and excess content/meta is filter import captured as comments. 

The Writer Web mode allows those spurious (to HTML 4.0) comments to be toggled off--or better to simply delete in bulk from the HTML file.

Point is this is as good as it gets, and we have bug 95861 open to consider work to make the Writer Web module HTML5 and CSS3 aware if not functional. With some devs opining it would be better to drop the Writer Web module completely and only filter import to Writer, and export to styled XHTML.

*** This bug has been marked as a duplicate of bug 95861 ***
Comment 7 Eyal Rozenberg 2020-08-13 21:40:07 UTC
(In reply to V Stuart Foote from comment #6)

You've made several points in your comment; but I'll begin by stressing that this bug is not a duplicate of 95861. That bug regards HTML5 and CSS3, like you yourself said; but this bug has nothing in particular to do with CSS3. While it's quite possible that the HTML I attached has some CSS3-specific selectors or attributes - most of the issues listed here have nothing to do with that. The appearance of the document may involve mis-handling or non-handling of CSS3, but I'm not even sure that's the case; and again - it's 2 out of 10 issues.

It's important, IMHO, not to "kill" this bug as a dupe exactly because it showcases many issues at once.

Oh, also - IIANM, the HTML itself in the attached document is plain-vanilla. Nothing beyond HTML 4.0 and probably earlier.

> Work on the Writer Web module ended at HTML 4.0 Transitional--while not
> officially deprecated the feature is essentially abandoned.

I'm not sure I see why this is relevant. Bugs are bugs. If the feature was experimental, or unavailable by default etc. then it might be argued that bugs should not be reported and addressed. I understand that nobody is springing into action to fix this, and that is ok (well, maybe).

> Import and Export (save to HTML) works reasonably well for inline CSS2 HTML
> 4.0 markup--that is it. 

First note that this issue is not merely about the importation and the exportation but also about what LO does with what's been imported.

Having said that - import and export  doesn't work reasonably well in some cases. There are significant issues - as I have demonstrated. That is another reason why it is inappropriate to close this bug.

> The default import filter mode for opening a .HTML document with LibreOffice
> is into the Writer Web module, into its 'Web' (un-paged view). I can not
> confirm reported issue of import opening to Writer Web 'Normal' (i.e. page
> view).

I'll try to get others to confirm.

> 
> Clear you user profile to defaults to resolve.

I've never cleaned my LO user profile before. I'll try it and report the result.

> The CSS of the js based HTML5/CSS3 web page linked is simply not renderable,

The web page is not "JS-based"; and it is quite renderable. In fact, its script elements are mostly empty. The URIs are actually not in src= attribtes but in data-savepage-src attributes. And if you delete the script tags, you still get basically the same rendering in a browser and the same mis-rendering in LibreOffice.

> and excess content/meta is filter import captured as comments. 

... which is a bug, or several bugs, as I've described.

> Point is this is as good as it gets

With respect - that is unacceptable. That is, you are of course under no personal obligation to fix things, but LO's current handling of HTML documents is not nearly what it should be, and there is no reason to lower users' expectations to the current state of the implementation.

>, and we have bug 95861 open to consider
> work to make the Writer Web module HTML5 and CSS3 aware if not functional.

It's possible that work on that may help some of the issues here, but probably at most the two issues which may be the cause of lack of CSS3 support. Possibly not even those.

> With some devs opining it would be better to drop the Writer Web module
> completely and only filter import to Writer, and export to styled XHTML.

Only 2 of the issues I've reported regard saving the edited file. And they too are valid issues, I believe, while writing HTML files is supported. Also, are you certain that saving this document to XHTML would yield reasonable output? I am somewhat doubtful.
Comment 8 Heiko Tietze 2021-04-08 13:49:16 UTC
Please don't forget add the keyword needsUXEval when CC'ing libreoffice-ux-advise.
Comment 9 Heiko Tietze 2021-04-09 11:18:20 UTC
Would rather drop HTML support than putting effort in. After 10 (or 20 years) there are better suited tools and HTML/CSS develops so fast that we never catch up. But anyway, nothing to discuss for UX.
Comment 10 Eyal Rozenberg 2021-04-09 12:08:58 UTC
(In reply to Heiko Tietze from comment #9)
> But anyway, nothing to discuss for UX.

Actually, several points are UX/UI relevant:

* "Real-estate" distribution of comments - generally and for comments which are the result of an imported piece of text not placed in the body of the document.
* Named-author vs no-named-author comments - why should the latter say "no author" rather than not saying anything?
* Undated comments - why don't we have them?
* Possibility of hiding comment authors/dates, manually or automatically when we have many comments.
* The many-comments scrolling mechanism
* The in-comment-balloon scroll bar when it's super-small (see screenshot 1)

All of these issues are not really about HTML support.

> Would rather drop HTML support than putting effort in. After 10 (or 20
> years) there are better suited tools

If there's a tool which takes an HTML and produces an ODT, maybe we can just use it / its code for importing? :-|

>  and HTML/CSS develops so fast that we never catch up.

That's fair enough, but this bug is really not about catching up to new fancy CSS. The example is not the ACID test...
Comment 11 Heiko Tietze 2021-04-09 12:15:21 UTC
We have several tickets about a large number of comments, eg. bug 38295.
Comment 12 Michael Warner 2021-10-23 14:30:40 UTC
(In reply to Eyal Rozenberg from comment #10)
> If there's a tool which takes an HTML and produces an ODT, maybe we can just
> use it / its code for importing? :-|

There is pandoc (https://pandoc.org/) which claims to support both html5 and odt, but I have not used it myself, so I have no idea how well it works.
Comment 13 Eyal Rozenberg 2021-10-23 15:04:15 UTC
(In reply to Michael Warner from comment #12)
> There is pandoc (https://pandoc.org/) which claims to support both html5 and
> odt, but I have not used it myself, so I have no idea how well it works.

Perhaps someone closer to LO development than I am might want to open an issue about exploring the possibility of integrating some pandoc import-filter+ODT-output-filter pairs into LO.