Description: When exporting documents containing Unicode combining diacritics from Writer 24.2 to PDF, invalid character mappings are generated. This means that copying text from the PDF or converting it to text gives incorrect output. This is because there is a mismatch between the text content stream and the unicode mapping in the output of 24.2. We'll use The unicode mapping itself is probably okay (though different from 7.4.7.2), it has regrouped the grapheme cluster into a single code, which is probably a good thing. The relevant parts are: /CMapType 2 def 1 begincodespacerange <00> <FF> endcodespacerange 2 beginbfchar <01> <0078030C> <03> <0075> endbfchar Here we see <01> is mapped to U+0078 U+030C, that is, the grapheme x̌, while <03> is mapped to U+0075, that is, the grapheme u. The PDF is internally very different in 24.2 as the text "x̌ux̌ux̌ux̌" has been tagged as 4 separate spans (though all in the same marked content section). For each instance of "x̌u" (which is U+0078 U+030C U+0075) we get something like (note the hex UTF-16-BE in ActualText): /Span<</ActualText<FEFF0078030C>>> BDC 56.8 668.1 Td /F1 72 Tf[<01>243<02>]TJ EMC 1 0 0 1 92.8 668.1 Tm /F1 72 Tf<03>Tj The problem is that <02> is not defined in the unicode map, so right away we get an undefined character (space or tofu) when extracting text, and I'm not at all sure what 243 is supposed to correspond to. This is a regression as 7.4.7.2 does not show this behaviour. The PDF internals there are much more straightforward, the cmap contains: /CMapType 2 def 1 begincodespacerange <00> <FF> endcodespacerange 3 beginbfchar <01> <0078> <02> <030C> <03> <0075> endbfchar endcmap And the text stream is just: 56.8 668.1 Td /F1 72 Tf[<01>243<02>-242<0301>243<02>-242<0301>243<02>-242<0301>243<02>]TJ The problem doesn't seem to be related to Tagged PDF or PDF/A, since I get the same weird output from 24.2 when I disable them in exporting. Steps to Reproduce: 1. Create a document with a some unicode combining diacritics, e.g. x̌ux̌ux̌ux̌ (x + U+030C Combining Caron) 2. Export to PDF 3. Copy and paste text from the PDF (or run pdftotext) Actual Results: got the output: x̌ ux̌ ux̌ ux̌ (either space or tofu between x̌ and u, corresponding to the missing <02> character) Expected Results: expect the output: x̌ux̌ux̌ux̌ Reproducible: Always User Profile Reset: No Additional Info: Version: 24.2.3.2 (X86_64) / LibreOffice Community Build ID: 420(Build:2) CPU threads: 4; OS: Linux 6.1; UI render: default; VCL: gtk3 Locale: en-CA (en_CA.UTF-8); UI: en-US Debian package version: 4:24.2.3-1~bpo12+1 Calc: threaded
Created attachment 194661 [details] PDF from 24.2 with incorrect Unicode
Created attachment 194662 [details] PDF from 7.4 with correct Unicode mapping
Created attachment 194663 [details] Document used to create the two PDFs
Just an extra comment ... I am always rather lost in PDF internals ... the 242/243 are not of any consequence here, since you can see them in 7.4 as well. The problem is the <02> that doesn't map to anything.
And upon looking at this again, it seems pretty clear what's happening: the text span has been created as <01><02>, with one-to-one mappings to code points <01> = U+0078 and <02> = U+030C. Then later on some other code has coalesced these into a single mapping (since it is a single grapheme) <01> = U+0078 U+030C But the text span itself is not actually getting updated to reflect this, and that's why the <02> is still there.
Right, 243 and -242 are character spacing commands, as in section 9.4.3 of PDF 1.7: array TJ Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm . The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount. Figure 46 shows an example of the effect of passing offsets to TJ. [ (A) 120 (W) 120 (A) 95 (Y again) ] TJ Anyway the problem seems pretty straightforward to fix, but I have no knowledge of the relevant code, so perhaps that's optimistic :)
Ah, okay. In actual fact the "coalescing" should just not be done, because the font embedded in the PDF still contains the three separate characters <01>(=U+0078) <02>(=U+030C) and <03>(=U+0075) for display. The <02> character is not there by mistake, it is the actual character in the font. I suggest finding whatver change caused entries in the ToUnicode CMap to be clustered in this sense and just reverting it because there is no way the extracted text can ever be valid aside from using the /ActualText tag, which every PDF viewer I've tried this on does not actually look at.
David, thank you for reporting the bug. Unfortunately I'm not an expert with unicode, so I can only follow your steps. I can't confirm the problem with Version: 24.8.0.0.alpha1+ (X86_64) / LibreOffice Community Build ID: d2eab48f697a1e6097778158f623f11306ac7a3d CPU threads: 4; OS: Windows 10 X86_64 (10.0 build 19045); UI render: Skia/Raster; VCL: win Locale: de-DE (de_DE); UI: en-GB Calc: CL threaded And when I openattacment 194661 (PDF ith wrong unicode), copy the text and paste it into LO writer, it also gives the correct result.
(In reply to Dieter from comment #9) > David, thank you for reporting the bug. Unfortunately I'm not an expert with > unicode, so I can only follow your steps. Thanks for checking this out - as mentioned below you may be experiencing the problem anyway, but you have a smarter PDF reader which is able to repair the broken unicode map. > I can't confirm the problem with > Version: 24.8.0.0.alpha1+ (X86_64) / LibreOffice Community > Build ID: d2eab48f697a1e6097778158f623f11306ac7a3d Still present for me with Version: 24.8.0.0.beta1 (X86_64) / LibreOffice Community Build ID: 318462181c709ed29c01eb3239b4d600d7b82ecc CPU threads: 4; OS: macOS 13.6.7; UI render: Skia/Metal; VCL: osx Locale: fr-CA (fr_CA.UTF-8); UI: en-US Calc: threaded > And when I openattacment 194661 (PDF ith wrong unicode), copy the text and > paste it into LO writer, it also gives the correct result. Interesting, what PDF viewer are you using? Apple Preview and the GNOME document viewer insert a blank (or "tofu") character for the character <02> which is missing in the unicode mapping, giving: x̌ ux̌ ux̌ ux̌
(In reply to David Huggins-Daines from comment #10) > Interesting, what PDF viewer are you using? I'm using Adobe Acrobat Reader
Bibisected with linux-64-7.5 to 09c076c3f29c28497f162d3a5b7baab040725d56 tdf#151350: Fix extraneous gaps before marks I could see the problem with Okular, which had the same empty spaces in copied content as mentioned in the description. Adobe Acrobat gave an even more broken copy: x̌ u u u
The PDF has valid character data, but some PDF readers don't support /ActualText tagging and only use the limited CMap mapping which can't handle all cases. Previously we would put the base and combining marks in separate clusters and this worked better for this specific case since not /ActualText tags were needed and CMap was enough, but this was changed and now we put the base and mrk in the same cluster and /ActualText tags became necessary. This is a duplicate of bug 158329 as the root cause is the same change: https://gerrit.libreoffice.org/c/core/+/140994 *** This bug has been marked as a duplicate of bug 158329 ***