Bug 161514 - Invalid Unicode mappings in PDF output for combining diacritics
Summary: Invalid Unicode mappings in PDF output for combining diacritics
Status: RESOLVED DUPLICATE of bug 158329
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
7.5.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: bibisected, bisected, regression
Depends on:
Blocks:
 
Reported: 2024-06-11 16:51 UTC by David Huggins-Daines
Modified: 2024-09-15 01:33 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
PDF from 24.2 with incorrect Unicode (6.85 KB, application/pdf)
2024-06-11 16:53 UTC, David Huggins-Daines
Details
PDF from 7.4 with correct Unicode mapping (6.26 KB, application/pdf)
2024-06-11 16:54 UTC, David Huggins-Daines
Details
Document used to create the two PDFs (10.49 KB, application/vnd.oasis.opendocument.text)
2024-06-11 16:54 UTC, David Huggins-Daines
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Huggins-Daines 2024-06-11 16:51:45 UTC
Description:
When exporting documents containing Unicode combining diacritics from Writer 24.2 to PDF, invalid character mappings are generated.  This means that copying text from the PDF or converting it to text gives incorrect output.  This is because there is a mismatch between the text content stream and the unicode mapping in the output of 24.2.  We'll use 

The unicode mapping itself is probably okay (though different from 7.4.7.2), it has regrouped the grapheme cluster into a single code, which is probably a good thing.  The relevant parts are:

/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
2 beginbfchar
<01> <0078030C>
<03> <0075>
endbfchar

Here we see <01> is mapped to U+0078 U+030C, that is, the grapheme x̌, while <03> is mapped to U+0075, that is, the grapheme u.

The PDF is internally very different in 24.2 as the text "x̌ux̌ux̌ux̌" has been tagged as 4 separate spans (though all in the same marked content section).  For each instance of "x̌u" (which is U+0078 U+030C U+0075) we get something like (note the hex UTF-16-BE in ActualText):

/Span<</ActualText<FEFF0078030C>>>
BDC
56.8 668.1 Td /F1 72 Tf[<01>243<02>]TJ
EMC
1 0 0 1 92.8 668.1 Tm
/F1 72 Tf<03>Tj

The problem is that <02> is not defined in the unicode map, so right away we get an undefined character (space or tofu) when extracting text, and I'm not at all sure what 243 is supposed to correspond to.

This is a regression as 7.4.7.2 does not show this behaviour.  The PDF internals there are much more straightforward, the cmap contains:

/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
3 beginbfchar
<01> <0078>
<02> <030C>
<03> <0075>
endbfchar
endcmap

And the text stream is just:

56.8 668.1 Td /F1 72 Tf[<01>243<02>-242<0301>243<02>-242<0301>243<02>-242<0301>243<02>]TJ

The problem doesn't seem to be related to Tagged PDF or PDF/A, since I get the same weird output from 24.2 when I disable them in exporting.

Steps to Reproduce:
1. Create a document with a some unicode combining diacritics, e.g. x̌ux̌ux̌ux̌ (x + U+030C Combining Caron)
2. Export to PDF
3. Copy and paste text from the PDF (or run pdftotext)

Actual Results:
got the output: x̌ ux̌ ux̌ ux̌

(either space or tofu between x̌ and u, corresponding to the missing <02> character)

Expected Results:
expect the output: x̌ux̌ux̌ux̌


Reproducible: Always


User Profile Reset: No

Additional Info:
Version: 24.2.3.2 (X86_64) / LibreOffice Community
Build ID: 420(Build:2)
CPU threads: 4; OS: Linux 6.1; UI render: default; VCL: gtk3
Locale: en-CA (en_CA.UTF-8); UI: en-US
Debian package version: 4:24.2.3-1~bpo12+1
Calc: threaded
Comment 1 David Huggins-Daines 2024-06-11 16:53:49 UTC
Created attachment 194661 [details]
PDF from 24.2 with incorrect Unicode
Comment 2 David Huggins-Daines 2024-06-11 16:54:10 UTC
Created attachment 194662 [details]
PDF from 7.4 with correct Unicode mapping
Comment 3 David Huggins-Daines 2024-06-11 16:54:24 UTC
Created attachment 194663 [details]
Document used to create the two PDFs
Comment 4 David Huggins-Daines 2024-06-11 16:55:49 UTC
Just an extra comment ... I am always rather lost in PDF internals ... the 242/243 are not of any consequence here, since you can see them in 7.4 as well.  The problem is the <02> that doesn't map to anything.
Comment 5 David Huggins-Daines 2024-06-11 17:00:48 UTC
And upon looking at this again, it seems pretty clear what's happening: the text span has been created as <01><02>, with one-to-one mappings to code points <01> = U+0078 and <02> = U+030C.

Then later on some other code has coalesced these into a single mapping (since it is a single grapheme) <01> = U+0078 U+030C

But the text span itself is not actually getting updated to reflect this, and that's why the <02> is still there.
Comment 6 David Huggins-Daines 2024-06-11 17:06:27 UTC
Right, 243 and -242 are character spacing commands, as in section 9.4.3 of PDF 1.7:

array TJ Show one or more text strings, allowing individual glyph positioning. Each
element of array shall be either a string or a number. If the element is a
string, this operator shall show the string. If it is a number, the operator
shall adjust the text position by that amount; that is, it shall translate the
text matrix, Tm . The number shall be expressed in thousandths of a unit
of text space (see 9.4.4, "Text Space Details"). This amount shall be
subtracted from the current horizontal or vertical coordinate, depending
on the writing mode. In the default coordinate system, a positive
adjustment has the effect of moving the next glyph painted either to the
left or down by the given amount. Figure 46 shows an example of the
effect of passing offsets to TJ.

[ (A) 120 (W) 120 (A) 95 (Y again) ] TJ

Anyway the problem seems pretty straightforward to fix, but I have no knowledge of the relevant code, so perhaps that's optimistic :)
Comment 7 David Huggins-Daines 2024-06-11 17:22:29 UTC
Ah, okay.  In actual fact the "coalescing" should just not be done, because the font embedded in the PDF still contains the three separate characters <01>(=U+0078) <02>(=U+030C) and <03>(=U+0075) for display.  The <02> character is not there by mistake, it is the actual character in the font.

I suggest finding whatver change caused entries in the ToUnicode CMap to be clustered in this sense and just reverting it because there is no way the extracted text can ever be valid aside from using the /ActualText tag, which every PDF viewer I've tried this on does not actually look at.
Comment 8 David Huggins-Daines 2024-06-11 17:22:42 UTC Comment hidden (obsolete)
Comment 9 Dieter 2024-06-26 19:25:55 UTC
David, thank you for reporting the bug. Unfortunately I'm not an expert with unicode, so I can only follow your steps.

I can't confirm the problem with
Version: 24.8.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: d2eab48f697a1e6097778158f623f11306ac7a3d
CPU threads: 4; OS: Windows 10 X86_64 (10.0 build 19045); UI render: Skia/Raster; VCL: win
Locale: de-DE (de_DE); UI: en-GB
Calc: CL threaded

And when I openattacment 194661 (PDF ith wrong unicode), copy the text and paste it into LO writer, it also gives the correct result.
Comment 10 David Huggins-Daines 2024-06-26 19:44:58 UTC
(In reply to Dieter from comment #9)
> David, thank you for reporting the bug. Unfortunately I'm not an expert with
> unicode, so I can only follow your steps.

Thanks for checking this out - as mentioned below you may be experiencing the problem anyway, but you have a smarter PDF reader which is able to repair the broken unicode map.

> I can't confirm the problem with
> Version: 24.8.0.0.alpha1+ (X86_64) / LibreOffice Community
> Build ID: d2eab48f697a1e6097778158f623f11306ac7a3d

Still present for me with
Version: 24.8.0.0.beta1 (X86_64) / LibreOffice Community
Build ID: 318462181c709ed29c01eb3239b4d600d7b82ecc
CPU threads: 4; OS: macOS 13.6.7; UI render: Skia/Metal; VCL: osx
Locale: fr-CA (fr_CA.UTF-8); UI: en-US
Calc: threaded

> And when I openattacment 194661 (PDF ith wrong unicode), copy the text and
> paste it into LO writer, it also gives the correct result.

Interesting, what PDF viewer are you using?  Apple Preview and the GNOME document viewer insert a blank (or "tofu") character for the character <02> which is missing in the unicode mapping, giving:

x̌ ux̌ ux̌ ux̌
Comment 11 Dieter 2024-06-27 20:32:13 UTC
(In reply to David Huggins-Daines from comment #10)
> Interesting, what PDF viewer are you using?
I'm using Adobe Acrobat Reader
Comment 12 Buovjaga 2024-08-21 06:02:37 UTC
Bibisected with linux-64-7.5 to 09c076c3f29c28497f162d3a5b7baab040725d56
tdf#151350: Fix extraneous gaps before marks

I could see the problem with Okular, which had the same empty spaces in copied content as mentioned in the description. Adobe Acrobat gave an even more broken copy:

x̌
u
u
u
Comment 13 ⁨خالد حسني⁩ 2024-09-15 01:33:55 UTC
The PDF has valid character data, but some PDF readers don't support /ActualText tagging and only use the limited CMap mapping which can't handle all cases. Previously we would put the base and combining marks in separate clusters and this worked better for this specific case since not /ActualText tags were needed and CMap was enough, but this was changed and now we put the base and mrk in the same cluster and /ActualText tags became necessary. This is a duplicate of bug 158329 as the root cause is the same change:
https://gerrit.libreoffice.org/c/core/+/140994

*** This bug has been marked as a duplicate of bug 158329 ***