Bug 119944 - Writer does not resolve some/most HTML entities.
Summary: Writer does not resolve some/most HTML entities.
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:html
Depends on:
Blocks: HTML-Import
  Show dependency treegraph
 
Reported: 2018-09-18 03:21 UTC by Jens Troeger
Modified: 2023-04-10 03:17 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments
HTML file containing a few unresolved HTML entities. (620 bytes, text/html)
2018-09-18 03:21 UTC, Jens Troeger
Details
HTML file containing all HTML5 entities. (280.26 KB, text/html)
2018-09-28 00:22 UTC, Jens Troeger
Details
HTML file containing all HTML5 entities. (402.86 KB, text/html)
2018-09-28 06:49 UTC, Jens Troeger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jens Troeger 2018-09-18 03:21:01 UTC
Created attachment 144968 [details]
HTML file containing a few unresolved HTML entities.

See the attached HTML document. Loading it does not resolve all HTML entities to their respective Unicode characters. There are probably (many) more entities that I haven’t tried…

Reference chart: https://dev.w3.org/html5/html-author/charref
Comment 1 Alex Thurgood 2018-09-19 09:15:11 UTC
Confirming with

Version: 6.2.0.0.alpha0+
Build ID: 694a433d5fbc9ab77dd37e7be9e79f3d3776eb24
CPU threads: 4; OS: Mac OS X 10.13.6; UI render: default; 
Locale: fr-FR (fr_FR.UTF-8); Calc: threaded
Comment 2 Jens Troeger 2018-09-28 00:21:13 UTC
I’ve updated the HTML file: the new one is generated from the w3 reference webpage and should include all HTML entities in their text/hex/dec encodings.

The Python script I used to generate that file is commented into that same file; notice, however, that Python’s html5 entity lookup is also incomplete resulting in a "???" string rather than the proper text.

Poked around a bit here:

    https://github.com/LibreOffice/core/blob/master/svtools/source/svhtml/parhtml.cxx#L394-L622

but it seems that the entity-aware string object messes things up.  The entity parser itself looks ok to me.
Comment 3 Jens Troeger 2018-09-28 00:22:38 UTC
Created attachment 145234 [details]
HTML file containing all HTML5 entities.
Comment 4 Jens Troeger 2018-09-28 06:49:20 UTC
Created attachment 145242 [details]
HTML file containing all HTML5 entities.

Ugh 😒Due to a bug in the Python code that generated the file, I came to the wrong conclusion that Python doesn’t contain all Entities. After checking the Python code and then my code, I smacked my head and fixed everything.

Updated test HTML file attached to this comment.
Comment 5 Xisco Faulí 2018-10-15 16:04:48 UTC
Also reproduced in

LibreOffice 3.3.0 
OOO330m19 (Build:6)
tag libreoffice-3.3.0.4
Comment 6 QA Administrators 2021-04-09 03:46:55 UTC Comment hidden (obsolete)
Comment 7 Jens Troeger 2021-04-09 03:52:41 UTC
Bug still exists with LibreOffice 7.0.4.2 dcf040e67528d9187c66b2379df5ea4407429775
Comment 8 QA Administrators 2023-04-10 03:17:56 UTC
Dear Jens Troeger,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.
 
If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not 
appropriate in this case)


If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword


Feel free to come ask questions or to say hello in our QA chat: https://web.libera.chat/?settings=#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug