Download it now!
Bug 119944 - Writer does not resolve some/most HTML entities.
Summary: Writer does not resolve some/most HTML entities.
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:html
Depends on:
Blocks: HTML-Import
  Show dependency treegraph
 
Reported: 2018-09-18 03:21 UTC by Jens Troeger
Modified: 2019-04-09 14:40 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments
HTML file containing a few unresolved HTML entities. (620 bytes, text/html)
2018-09-18 03:21 UTC, Jens Troeger
Details
HTML file containing all HTML5 entities. (280.26 KB, text/html)
2018-09-28 00:22 UTC, Jens Troeger
Details
HTML file containing all HTML5 entities. (402.86 KB, text/html)
2018-09-28 06:49 UTC, Jens Troeger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jens Troeger 2018-09-18 03:21:01 UTC
Created attachment 144968 [details]
HTML file containing a few unresolved HTML entities.

See the attached HTML document. Loading it does not resolve all HTML entities to their respective Unicode characters. There are probably (many) more entities that I haven’t tried…

Reference chart: https://dev.w3.org/html5/html-author/charref
Comment 1 Alex Thurgood 2018-09-19 09:15:11 UTC
Confirming with

Version: 6.2.0.0.alpha0+
Build ID: 694a433d5fbc9ab77dd37e7be9e79f3d3776eb24
CPU threads: 4; OS: Mac OS X 10.13.6; UI render: default; 
Locale: fr-FR (fr_FR.UTF-8); Calc: threaded
Comment 2 Jens Troeger 2018-09-28 00:21:13 UTC
I’ve updated the HTML file: the new one is generated from the w3 reference webpage and should include all HTML entities in their text/hex/dec encodings.

The Python script I used to generate that file is commented into that same file; notice, however, that Python’s html5 entity lookup is also incomplete resulting in a "???" string rather than the proper text.

Poked around a bit here:

    https://github.com/LibreOffice/core/blob/master/svtools/source/svhtml/parhtml.cxx#L394-L622

but it seems that the entity-aware string object messes things up.  The entity parser itself looks ok to me.
Comment 3 Jens Troeger 2018-09-28 00:22:38 UTC
Created attachment 145234 [details]
HTML file containing all HTML5 entities.
Comment 4 Jens Troeger 2018-09-28 06:49:20 UTC
Created attachment 145242 [details]
HTML file containing all HTML5 entities.

Ugh 😒Due to a bug in the Python code that generated the file, I came to the wrong conclusion that Python doesn’t contain all Entities. After checking the Python code and then my code, I smacked my head and fixed everything.

Updated test HTML file attached to this comment.
Comment 5 Xisco Faulí 2018-10-15 16:04:48 UTC
Also reproduced in

LibreOffice 3.3.0 
OOO330m19 (Build:6)
tag libreoffice-3.3.0.4