Description: Text containing a special formatted number that look good in PDF, but if I copy and paste the number in PDF, it fails. There are some weird characters in the clipboard. Steps to Reproduce: 1. Open the attached file. It contains some figures. Note that they stems from an old document I edited in 2109 (sorry, don't know the then version). I can't remember what I did in order to have the figures grouped as 3-piece groups as they currently are. Double click onto number, press ctrl-C. Paste it into e.g. a plaintext editor; note that the same number appears in text editor. 2. Export document into PDF. 3. View PDF (I used SumatraPDF for that). 4. Mark/select number in PDF, copy and paste it in the text editor in the line below for comparism. 5. (Optional) open a new Writer document, type "012345", export into PDF, open in PDF viewer, copy and paste again Actual Results: After doing steps 1..4, both pasted numbers differ. ("0123456789" vs. "01�231��61���"), After doing step 5, both numbers equal. Expected Results: In both cases I'd expect the numbers to be equal. Reproducible: Always User Profile Reset: No Additional Info: Version: 7.4.6.2 (x64) / LibreOffice Community Build ID: 5b1f5509c2decdade7fda905e3e1429a67acd63d CPU threads: 8; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL: win Locale: de-DE (de_DE); UI: de-DE Calc: threaded During PDF export, I've choosen to make a Hybrid-PDF (embedded ODF file).
Created attachment 188107 [details] Writer file containing some special formatted number figures.
Just recently I openend the PDF in another viewer (PDF-XChange Viewer). The copied text contains the same figures, but the weird non figure characters differ.
Sorry, I meant 2019, not 2109. Back then I also created a PDF (not with only the single number, but with more text sourrounding). The back then PDF also shows some weird characters, but more closely resembles the original number: All figures are there, but there is an additional "3" in pasted text between every 3-piece-group of figures. Even worse, this is a less striking fail.
Tested with: Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: 609a1567d0e60ca11800df56059b97b6a61ad117 CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3 Locale: en-AU (en_AU.UTF-8); UI: en-US Calc: threaded - In Evince/GNOME Document Viewer: the odd characters only appear when I select the text in the PDF. When copied from Evince, the paste is fine and pastes the numbers without the special triplet formatting. - In Okular: no odd characters in PDF, nor when selecting. Paste is same as Evince: correct digits without triplet formatting. - In Firefox: copying and pasting does result in odd characters and duplicated digits: 0123161 Copy-pasted from a PDF created with LO 6.0 (same in Evince, Okular, Firefox): 0112314461777 In OOo 3.3, the spacing wasn't there in LO nor in export, but pasting would be true to the original. In summary: the situation has improved over previous version in that the copy-paste only misses the spacing, and the missing characters / duplicated digits issue seems to be PDF reader-dependent. Khaled, what is your take on this?
Copying from Adobe Reader, I get: 012345 6789 (no funny characters, but there an extra space which is not surprising as many PDF readers will interpret a large gap between glyphs as space even if the PDF does not have a space character there) If I use pdftotext, I get: 0123456789 The number grouping is a “feature” of Linux Libertine G font, but it is done in a very odd way that affects PDF export. $ hb-shape LinBiolinum_R_G.ttf "0123456789" --no-positions [zero=0|uni202F=1|one=1|two=2|three=3|uni202F=4|four=4|five=4|six=6|uni202F=7|seven=7|eight=7|nine=7] (the text before equal sign is the glyph name, and the number after it is the index of the input string corresponding to this character) The font output zero fine, no funny business. Then it outputs the glyph for NNBSP then glyph for one and gives both the same input string index, then two and three normally, then NNBSP, four and five and gives all the three of them the same input string index, then six normally, then NNBSP, seven, eight and nine and gives the four of them the same input string index. This funny business with input string index leads us to group the output as the following mapping between glyphs and input characters: zero => "0" uni202F,one => "1" two => "2" three => "3" uni202F,four,five => "45" six => "6" uni202F,seven,eight,nine => "789" This mapping of multiple glyphs to multiple input characters is problematic in PDF for text extraction, since PDF can represent only one glyph to one character or one glyph ti multiple characters mapping. To keep the text copy-able we have to resent to tagging the problematic glyph groups using /ActualText spans, and not all PDF viewers support this. So this a combination of oddly built font and buggy PDF viewers, we are doing our best and there is not much we can do about this.
(In reply to Stéphane Guillou (stragu) from comment #4) > In OOo 3.3, the spacing wasn't there in LO nor in export, but pasting would > be true to the original. This is a Graphite font, so either you are not using the same font or that version of OOo is not Graphite-enabled.
Lets see of we can fix the font.
No progress on the font front, I couldn’t get its sources to build no matter what I tried. Closing again until someone figures out how to fix the font.