Bug 36977 - Improved HTML5/XHTML Import
Summary: Improved HTML5/XHTML Import
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
(earliest affected)
Hardware: Other All
: medium enhancement
Assignee: Not Assigned
Depends on:
Blocks: 101772
  Show dependency treegraph
Reported: 2011-05-08 13:12 UTC by gleppert
Modified: 2022-11-24 21:39 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:
Regression By:

XSLT 2.0 second iteration of XHTML import sample (28.68 KB, application/xslt+xml)
2011-06-03 18:31 UTC, gleppert

Note You need to log in before you can comment on or make changes to this bug.
Description gleppert 2011-05-08 13:12:42 UTC
HTML import in OpenOffice.org/LibreOffice has not been considerably improved for many years and is lacking many features. The import filter must be completely overhauled. There were two possible solutions mentioned in the EasyHacks-Wiki page:

The first idea:
Use libXML and libcroco in the HTML import
The current HTML import filter is pretty bad and uses home made parsers while there are nice HTML and CSS parsers out in the FOSS world. Beating that custom code to use libxml (HTML / XML parsing lib) and libcroco (CSS parsing lib) would help increase the quality of that filter. 

The second idea:
Development on a XHTML import filter had already been started by Vyacheslav Sedov and the patch submitted to the OOo bugtracker on 2009-02-01 is already better than the default html import filter in OOo/LO. The patch can be found here http://qa.openoffice.org/issues/show_bug.cgi?id=83494. There are still many problems and non-implemented features. One particular issue is that the patch does not yet consider external Cascading Style Sheets (CSS). Access to external CSS for import/export filters is now possible, as XSLT 2.0 can now use external XSLT(XQuery) processors (closed OOo Issue 83500) since Mid 2009.
Comment 1 gleppert 2011-06-03 18:31:10 UTC
Created attachment 47508 [details]
XSLT 2.0 second iteration of XHTML import sample

I attached the XHMTL import filter by Vyacheslav Sedov which he had submitted to the OOo bugtracker on 2009-02-01. 

He offered to finish the XHTML import filter for some subsistence allowance. Please refer to the bug entry at the OpenOffice.org bugtracker.
Comment 2 Peter Jentsch 2011-08-03 13:51:21 UTC
Hi there, 

I just stumbled across this EasyHack and want to note that we started to replace the Java based XSLT import with a version that uses libxslt. libxslt support XSLT 1.0 with exslt extensions (which IMHO is all you ever want from XSLT). 

The attached XSLT script seems to use only one dubious XSLT feature, the tunnel attribute, which doesn't seem to be strictly required as far as I can see. 

Please keep that in mind when going further in that direction. Beyond that, I'd like to point to to http://xhtml2odt.org/, which is a set of XSLT stylesheets which do, hm, xhtml2odt conversion. Even though they make a few assumtions on the structure of the html they import which could be difficult to sell as generic xhtml import  facility in LO, those look *very* promising. So I'd be great to bundle those in a LO XSLT import filter. 

Instead of importing XSLT 2.0 as a requirement (which inadvertently pulls in Java, because there's only on decent XSLT 2.0 implementation out there, which is Michael Kay's saxon9, written in Java), I'd like to assist in providing libxslt extension functions to keep calculations which cannot be performed well in xpath/xslt out of the scripts.
Comment 3 Peter Jentsch 2011-08-03 13:54:25 UTC
Ah, and another one: while an XSLT based solution will make developers happy (because you can use it on the server without requiring a running LO instance), copy/paste from the browser will most definitely will require some c/c++ code within LO. So in order to really *replace* the current import I'd favor the suggestion using libxml/libcroco (although I don't now libcroco).
Comment 4 gleppert 2011-08-05 00:50:12 UTC
Hi, will the combination of libxml/libcroco/libxslt be able to import XHTML with external style sheets (CSS in separate files)? This would be fabulous!
Comment 5 Björn Michaelsen 2011-12-23 12:03:21 UTC Comment hidden (obsolete)
Comment 6 gleppert 2012-03-29 23:12:08 UTC
This request for enhancement seems to be still valid for LO 3.5.1
Comment 7 Caolán McNamara 2012-07-15 09:21:41 UTC
Comment on attachment 47508 [details]
XSLT 2.0 second iteration of XHTML import sample

unsetting the patch flag on this patch, cause it's Vyacheslav Sedov's patch from http://qa.openoffice.org/issues/show_bug.cgi?id=83494 submitted to then OpenOffice.org rather than a patch directly contributed to LibreOffice
Comment 8 Michael Warner 2021-12-17 14:42:53 UTC
There is a C/C++ version of Saxon now:

Although, it was auto-translated from the Java source using a commercial product that has been discontinued. One of the Saxon developers said they would continue maintaining Saxon/C, but I don't know what the future holds for that.