Bug 124588 - FILEOPEN DOC DOCX RTF: U+00AD should not be treated as soft hyphen in Word documents
Summary: FILEOPEN DOC DOCX RTF: U+00AD should not be treated as soft hyphen in Word do...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: dataLoss
Depends on:
Blocks: RTF Formatting-Mark DOCX DOC
  Show dependency treegraph
 
Reported: 2019-04-07 13:40 UTC by Phil Krylov
Modified: 2022-08-22 23:02 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
Document to reproduce the bug (43.59 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2019-04-07 13:41 UTC, Phil Krylov
Details
Font to reproduce the bug (44.71 KB, application/x-font-ttf)
2019-04-07 13:42 UTC, Phil Krylov
Details
Word screenshot (8.00 KB, image/png)
2019-04-07 13:43 UTC, Phil Krylov
Details
Writer screenshot (8.40 KB, image/png)
2019-04-07 13:44 UTC, Phil Krylov
Details
comparison MSO 2010 and LibreOffice 6.5 Master (56.17 KB, image/png)
2019-11-21 13:16 UTC, Xisco Faulí
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Phil Krylov 2019-04-07 13:40:03 UTC
Description:
Word treats U+00AD as a normal character and there are actual fonts that have a non-hyphen glyph mapped to this codepoint. For soft hyphens, Word uses 0x1F in DOC, <w:softHyphen/> in DOCX, \- in RTF. On import, Writer converts all these to U+00AD, so that normal U+00AD character usage is not possible, and (even worse) one can't distinguish between normal U+00AD character usage and soft hyphen to change non-Unicode-compliant usages to some other codepoint.

Steps to Reproduce:
Install the attached font and open the attached document

Actual Results:
You see a soft hyphen in the sample

Expected Results:
A diacritic from the font should be displayed


Reproducible: Always


User Profile Reset: No



Additional Info:
Comment 1 Phil Krylov 2019-04-07 13:41:48 UTC
Created attachment 150579 [details]
Document to reproduce the bug
Comment 2 Phil Krylov 2019-04-07 13:42:25 UTC
Created attachment 150580 [details]
Font to reproduce the bug
Comment 3 Phil Krylov 2019-04-07 13:43:15 UTC
Created attachment 150581 [details]
Word screenshot
Comment 4 Phil Krylov 2019-04-07 13:44:03 UTC
Created attachment 150582 [details]
Writer screenshot
Comment 5 Mike Kaganski 2019-04-07 14:14:32 UTC
But U+00AD *is* soft hyphen? At least Unicode tells that: https://www.unicode.org/charts/PDF/U0080.pdf
Comment 6 Phil Krylov 2019-04-07 14:20:47 UTC
Yes it is - as per Unicode spec. But in Word documents, 0x00AD is a normal character. So the problem is how to allow usage of 0x00AD as a normal character in LibreOffice (if we remap them on import to some other codepoint, they won't be displayed with the proper glyph). Probably some special character attribute can be added for verbatim usages of special chars.
Comment 7 Phil Krylov 2019-04-07 15:58:03 UTC
Another option could be adding a user-changeable import filter preference to convert U+00AD to some other codepoint/string. Ugly, right.
Comment 8 Xisco Faulí 2019-10-16 11:46:49 UTC
@Khaled, I thought you might be interested in this issue...
Comment 9 ⁨خالد حسني⁩ 2019-10-17 13:03:58 UTC
(In reply to Xisco Faulí from comment #8)
> @Khaled, I thought you might be interested in this issue...

What Word doing is not Unicode-conformant and is probably some legacy behavior kept for backward compatibility. What LibreOffice should do when reading Word files is not something I’m qualified to answer.
Comment 10 Xisco Faulí 2019-11-21 13:16:15 UTC
Created attachment 156001 [details]
comparison MSO 2010 and LibreOffice 6.5 Master
Comment 11 Xisco Faulí 2019-11-21 13:16:41 UTC
Reproduced in

Version: 6.5.0.0.alpha0+
Build ID: 60b1a93a990a9978a30dee929526faf8db629a7f
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk3; 
Locale: ca-ES (ca_ES.UTF-8); UI-Language: en-US
Calc: threaded
Comment 12 Maathew Peter 2020-12-01 10:35:01 UTC Comment hidden (spam)