Description: open HTML file in LibreOffice 7.4.3.2 ă - LOffice does not know what to do with this ă displays ok ă displays ok ALL three should display the same Steps to Reproduce: 1.ă - LOffice does not know what to do with this 2.ă displays ok 3.ă displays ok Actual Results: LOffice displays "ă" Expected Results: should be one character; a lower case "a" with the breve symbol Reproducible: Always User Profile Reset: No Additional Info: <html> <head> <meta name="Author" CONTENT="Robert Margulski"> <meta name="Description" CONTENT="Romanian - a breve"> </head> <body> <h3> Romanian - a breve <br /> <br />&breve = ă (does NOT display correctly in LOffice 7.4.3.2) <br />&#259 = ă (Okay) <br />&#x0103 = ă (Okay) </h3> </body> </html>
Confirm Version 4.1.0.0.alpha0+ (Build ID: efca6f15609322f62a35619619a6d5fe5c9bd5a) Version: 7.5.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: 7b23c53232245a1f61c3e8ddff59d049a49fe975 CPU threads: 4; OS: Linux 5.15; UI render: default; VCL: gtk3 Locale: cs-CZ (cs_CZ.UTF-8); UI: en-US Calc: threaded
To make it explicit, it's the Unicode HTML entity: Char Dec Hex Entity Name ă 258 0103 &abreve Latin Small Letter A with Breve https://www.compart.com/en/unicode/U+0103
Created attachment 183757 [details] html unicode megalist This actually extends to dozens of unicode HTML entity names (one ampersand followed by only letters), see the list.
Confirming. Version: 7.5.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: 651658d37bcb3f493942dd5d0b9a0d65c96f105c CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Vulkan; VCL: win Locale: en-US (en_US); UI: en-US Calc: threaded LibreOffice filter import does not handle the additional HTML / XML named character entities added for HTML5. Not just ă or Ă as here. The unhandled entities are not converted to the appropriate glyph on LibreOffice document canvas and remain plaintext. Dante did add the handling needed for MathML support with https://gerrit.libreoffice.org/c/core/+/108333 But something similar is needed to support Writer Web parsing the characters from HTML/XHTML or XML. Attached test ODF text doc shows example of the named entities not being handled in LibreOffice Writer Web import. =-ref-= https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references
Created attachment 183759 [details] LO ODF text document sample with mix of named character entity values For named character entities added at HTML5 so without LO import filter handling, the unrecognized entity is left in its "& <name> ;" format on import.
(In reply to V Stuart Foote from comment #5) > Created attachment 183759 [details] > LO ODF text document sample with mix of named character entity values > > For named character entities added at HTML5 so without LO import filter > handling, the unrecognized entity is left in its "& <name> ;" format on > import. Oops sorry, that actually is a HTML generated from LO 7.5. Then edited a bit to clean up the HTML formatting to put each stanza on its own row. So when opened into LibreOffice, from Writer Web module select the HTML source view to see what entities are missing from filter import.