Bug 152204 - Many HTML5 entity names not supported
Summary: Many HTML5 entity names not supported
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium minor
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:html
Depends on:
Blocks: Writer-Web-Layout HTML-Import
  Show dependency treegraph
 
Reported: 2022-11-24 17:03 UTC by Robert Margulski
Modified: 2022-11-24 21:53 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
html unicode megalist (68.13 KB, text/html)
2022-11-24 20:55 UTC, Stéphane Guillou (stragu)
Details
LO ODF text document sample with mix of named character entity values (7.05 KB, text/html)
2022-11-24 21:42 UTC, V Stuart Foote
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Margulski 2022-11-24 17:03:35 UTC
Description:
open HTML file in LibreOffice 7.4.3.2
ă - LOffice does not know what to do with this
ă displays ok
ă displays ok
ALL three should display the same

Steps to Reproduce:
1.ă - LOffice does not know what to do with this
2.ă displays ok
3.ă displays ok

Actual Results:
LOffice displays "ă"

Expected Results:
should be one character; a lower case "a" with the breve symbol



Reproducible: Always


User Profile Reset: No

Additional Info:
<html>
<head>
	<meta name="Author" CONTENT="Robert Margulski">
	<meta name="Description" CONTENT="Romanian - a breve">
</head>
<body>

<h3>
Romanian - a breve
<br />
<br />&amp;breve  = &abreve; (does NOT display correctly in LOffice 7.4.3.2)
<br />&amp;#259   = &#259; (Okay)
<br />&amp;#x0103 = &#x0103; (Okay)
</h3>

</body>
</html>
Comment 1 raal 2022-11-24 17:50:24 UTC
Confirm Version 4.1.0.0.alpha0+ (Build ID: efca6f15609322f62a35619619a6d5fe5c9bd5a)

Version: 7.5.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 7b23c53232245a1f61c3e8ddff59d049a49fe975
CPU threads: 4; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: cs-CZ (cs_CZ.UTF-8); UI: en-US
Calc: threaded
Comment 2 Stéphane Guillou (stragu) 2022-11-24 20:23:19 UTC
To make it explicit, it's the Unicode HTML entity:

Char	Dec	Hex	Entity	Name
ă	258	0103	&abreve	Latin Small Letter A with Breve

https://www.compart.com/en/unicode/U+0103
Comment 3 Stéphane Guillou (stragu) 2022-11-24 20:55:32 UTC
Created attachment 183757 [details]
html unicode megalist

This actually extends to dozens of unicode HTML entity names (one ampersand followed by only letters), see the list.
Comment 4 V Stuart Foote 2022-11-24 21:39:52 UTC
Confirming.

Version: 7.5.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 651658d37bcb3f493942dd5d0b9a0d65c96f105c
CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Vulkan; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded

LibreOffice filter import does not handle the additional HTML / XML named character entities added for HTML5. Not just &abreve; or &Abreve; as here. The unhandled entities are not converted to the appropriate glyph on LibreOffice document canvas and remain plaintext.

Dante did add the handling needed for MathML support with https://gerrit.libreoffice.org/c/core/+/108333

But something similar is needed to support Writer Web parsing the characters from HTML/XHTML or XML.

Attached test ODF text doc shows example of the named entities not being handled in LibreOffice Writer Web import.

=-ref-=
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references
Comment 5 V Stuart Foote 2022-11-24 21:42:50 UTC
Created attachment 183759 [details]
LO ODF text document sample with mix of named character entity values

For named character entities added at HTML5 so without LO import filter handling, the unrecognized entity is left in its "& <name> ;" format on import.
Comment 6 V Stuart Foote 2022-11-24 21:53:10 UTC
(In reply to V Stuart Foote from comment #5)
> Created attachment 183759 [details]
> LO ODF text document sample with mix of named character entity values
> 
> For named character entities added at HTML5 so without LO import filter
> handling, the unrecognized entity is left in its "& <name> ;" format on
> import.

Oops sorry, that actually is a HTML generated from LO 7.5. Then edited a bit to clean up the HTML formatting to put each stanza on its own row.

So when opened into LibreOffice, from Writer Web module select the HTML source view to see what entities are missing from filter import.