Bug Hunting Session
Bug 124588 - FILEOPEN DOC DOCX RTF: U+00AD should not be treated as soft hyphen in Word documents
Summary: FILEOPEN DOC DOCX RTF: U+00AD should not be treated as soft hyphen in Word do...
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
Keywords: dataLoss
Depends on:
Blocks: RTF Formatting-Mark DOCX DOC
  Show dependency treegraph
Reported: 2019-04-07 13:40 UTC by Phil Krylov
Modified: 2019-04-07 15:58 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:

Document to reproduce the bug (43.59 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2019-04-07 13:41 UTC, Phil Krylov
Font to reproduce the bug (44.71 KB, application/x-font-ttf)
2019-04-07 13:42 UTC, Phil Krylov
Word screenshot (8.00 KB, image/png)
2019-04-07 13:43 UTC, Phil Krylov
Writer screenshot (8.40 KB, image/png)
2019-04-07 13:44 UTC, Phil Krylov

Note You need to log in before you can comment on or make changes to this bug.
Description Phil Krylov 2019-04-07 13:40:03 UTC
Word treats U+00AD as a normal character and there are actual fonts that have a non-hyphen glyph mapped to this codepoint. For soft hyphens, Word uses 0x1F in DOC, <w:softHyphen/> in DOCX, \- in RTF. On import, Writer converts all these to U+00AD, so that normal U+00AD character usage is not possible, and (even worse) one can't distinguish between normal U+00AD character usage and soft hyphen to change non-Unicode-compliant usages to some other codepoint.

Steps to Reproduce:
Install the attached font and open the attached document

Actual Results:
You see a soft hyphen in the sample

Expected Results:
A diacritic from the font should be displayed

Reproducible: Always

User Profile Reset: No

Additional Info:
Comment 1 Phil Krylov 2019-04-07 13:41:48 UTC
Created attachment 150579 [details]
Document to reproduce the bug
Comment 2 Phil Krylov 2019-04-07 13:42:25 UTC
Created attachment 150580 [details]
Font to reproduce the bug
Comment 3 Phil Krylov 2019-04-07 13:43:15 UTC
Created attachment 150581 [details]
Word screenshot
Comment 4 Phil Krylov 2019-04-07 13:44:03 UTC
Created attachment 150582 [details]
Writer screenshot
Comment 5 Mike Kaganski 2019-04-07 14:14:32 UTC
But U+00AD *is* soft hyphen? At least Unicode tells that: https://www.unicode.org/charts/PDF/U0080.pdf
Comment 6 Phil Krylov 2019-04-07 14:20:47 UTC
Yes it is - as per Unicode spec. But in Word documents, 0x00AD is a normal character. So the problem is how to allow usage of 0x00AD as a normal character in LibreOffice (if we remap them on import to some other codepoint, they won't be displayed with the proper glyph). Probably some special character attribute can be added for verbatim usages of special chars.
Comment 7 Phil Krylov 2019-04-07 15:58:03 UTC
Another option could be adding a user-changeable import filter preference to convert U+00AD to some other codepoint/string. Ugly, right.