Bug 114856 - HTML parser in Writer broken
Summary: HTML parser in Writer broken
Status: RESOLVED DUPLICATE of bug 114428
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
3.3 all versions
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:html
Depends on:
Blocks:
 
Reported: 2018-01-05 18:05 UTC by Dirk
Modified: 2018-01-07 16:45 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk 2018-01-05 18:05:20 UTC
When trying to import a HTML document into write it seems to insist on <html> on the first line. Otherwise it is being imported as plain text.

This is broken as nowadays^W since at least 20years (HTML 3.2 as far as I can recall) requires a DOCTYPE header before.

My expectation is that the parser needs to be adjusted so that it is more flexible.
Comment 1 V Stuart Foote 2018-01-05 19:01:07 UTC
Hmm, not clear this is valid.

The GUI allows you to select the filter for import/opening of a document into a LO module. And HTML with .htm/.html are opened into Writer Web by default.

The Writer Web module provides an HTML source view mode to directly adjust content and markup.

And if you want to open HTML with formatting into Writer, selecting the "HTML Document Writer (*.html, *.htm, *.xhtml)" filter with the GUI will correctly handle it.

Beyond that, not clear there is an issue. Perhaps provide a sample document you beleive is not being correctly opening into Writer Web, or with filter selection into Writer.
Comment 2 Dirk 2018-01-06 10:24:41 UTC
> And if you want to open HTML with formatting into Writer, selecting the "HTML Document Writer (*.html, *.htm, *.xhtml)" filter with the GUI will correctly handle it.

As a user my expectation is that I don't have fiddle with a GUI to tell a program  what is expected behavior.

> Beyond that, not clear there is an issue.

For me and others it is an issue, as people stumble over this (sorry) brain dead behavior. I needed to google for it to find a solution how to import a freaking HTML file correctly. 

> Perhaps provide a sample document you beleive is not being correctly opening into Writer Web, or with filter selection into Writer.

The filter is not an option to me as said.

Simple test:

prompt > wget eiklaut.net 
prompt > libreoffice index.html

Instead of the second step you can as well use the GUI, the File --> Open.

It works as expected if I remove the first two lines from index.html.
Comment 3 Maxim Monastirsky 2018-01-06 21:20:16 UTC
(In reply to Dirk from comment #2)
> Simple test:
> 
> prompt > wget eiklaut.net 
> prompt > libreoffice index.html
This is *not* about the DOCTYPE header (which should work with current releases of LO), but rather about the additional <?xml version="1.0" encoding="utf-8" ?> line. LO 6.1 will include a fix for that too.

*** This bug has been marked as a duplicate of bug 114428 ***
Comment 4 Dirk 2018-01-07 15:45:24 UTC
This was no feature request bug a bug report. Can we have a solution also for the current versions pls??

My use case scenario is that I deal with a lot of PROPER HTML documents which libreoffice refuses to treat as such.
Comment 5 V Stuart Foote 2018-01-07 16:45:27 UTC
This issue is bug 114428, or its dupe bug 37753 in handling XHTML _not_ HTML.

Corrected for 6.1.0 on current master and unlikely a candidate for backport to a 5.4.

*** This bug has been marked as a duplicate of bug 114428 ***