HTML import in OpenOffice.org/LibreOffice has not been considerably improved for many years and is lacking many features. The import filter must be completely overhauled. There were two possible solutions mentioned in the EasyHacks-Wiki page: The first idea: Use libXML and libcroco in the HTML import The current HTML import filter is pretty bad and uses home made parsers while there are nice HTML and CSS parsers out in the FOSS world. Beating that custom code to use libxml (HTML / XML parsing lib) and libcroco (CSS parsing lib) would help increase the quality of that filter. The second idea: Development on a XHTML import filter had already been started by Vyacheslav Sedov and the patch submitted to the OOo bugtracker on 2009-02-01 is already better than the default html import filter in OOo/LO. The patch can be found here http://qa.openoffice.org/issues/show_bug.cgi?id=83494. There are still many problems and non-implemented features. One particular issue is that the patch does not yet consider external Cascading Style Sheets (CSS). Access to external CSS for import/export filters is now possible, as XSLT 2.0 can now use external XSLT(XQuery) processors (closed OOo Issue 83500) since Mid 2009.
Created attachment 47508 [details] XSLT 2.0 second iteration of XHTML import sample I attached the XHMTL import filter by Vyacheslav Sedov which he had submitted to the OOo bugtracker on 2009-02-01. He offered to finish the XHTML import filter for some subsistence allowance. Please refer to the bug entry at the OpenOffice.org bugtracker.
Hi there, I just stumbled across this EasyHack and want to note that we started to replace the Java based XSLT import with a version that uses libxslt. libxslt support XSLT 1.0 with exslt extensions (which IMHO is all you ever want from XSLT). The attached XSLT script seems to use only one dubious XSLT feature, the tunnel attribute, which doesn't seem to be strictly required as far as I can see. Please keep that in mind when going further in that direction. Beyond that, I'd like to point to to http://xhtml2odt.org/, which is a set of XSLT stylesheets which do, hm, xhtml2odt conversion. Even though they make a few assumtions on the structure of the html they import which could be difficult to sell as generic xhtml import facility in LO, those look *very* promising. So I'd be great to bundle those in a LO XSLT import filter. Instead of importing XSLT 2.0 as a requirement (which inadvertently pulls in Java, because there's only on decent XSLT 2.0 implementation out there, which is Michael Kay's saxon9, written in Java), I'd like to assist in providing libxslt extension functions to keep calculations which cannot be performed well in xpath/xslt out of the scripts.
Ah, and another one: while an XSLT based solution will make developers happy (because you can use it on the server without requiring a running LO instance), copy/paste from the browser will most definitely will require some c/c++ code within LO. So in order to really *replace* the current import I'd favor the suggestion using libxml/libcroco (although I don't now libcroco).
Hi, will the combination of libxml/libcroco/libxslt be able to import XHTML with external style sheets (CSS in separate files)? This would be fabulous!
[This is an automated message.] This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it started right out as NEW without ever being explicitly confirmed. The bug is changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases. Details on how to test the 3.5.0 beta1 can be found at: http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1 more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
This request for enhancement seems to be still valid for LO 3.5.1
Comment on attachment 47508 [details] XSLT 2.0 second iteration of XHTML import sample unsetting the patch flag on this patch, cause it's Vyacheslav Sedov's patch from http://qa.openoffice.org/issues/show_bug.cgi?id=83494 submitted to then OpenOffice.org rather than a patch directly contributed to LibreOffice
There is a C/C++ version of Saxon now: https://www.saxonica.com/saxon-c/index.xml Although, it was auto-translated from the Java source using a commercial product that has been discontinued. One of the Saxon developers said they would continue maintaining Saxon/C, but I don't know what the future holds for that.
*** This bug has been marked as a duplicate of bug 95861 ***