Created attachment 179989 [details] 1.pdf When open the attached pdf document in Draw, some characters within the formula is shown as garbage characters. Steps to Reproduce: 1. Open the attached 1.pdf with Draw. Current Result: There are garbage characters shown. For instance, it shows "将方程^2 − 6�-1 = 0 配方后,原方程变形为( )", rather than 将方程x2 − 6x-1 = 0 配方后,原方程变形为( )" in the 2nd list paragraph. Expected result: No garbage characters in the imported PDF. For instance, the above paragraph should show as "将方程x^2 − 6x-1 = 0 配方后,原方程变形为( )". Additional Info: If I do: $ /opt/libreofficedev7.4/program/xpdfimport ./1.pdf Then I already get the garbage characters: updateFont 8 0 4 0 0 1045.000000 0 SimSun drawChar 111.000000 259.009000 121.450000 259.009000 0.050000 0.000000 0.000000 -0.050000 209.000000 将 drawChar 121.439850 259.009000 131.889850 259.009000 0.050000 0.000000 0.000000 -0.050000 209.000000 方 drawChar 131.999250 259.009000 142.449250 259.009000 0.050000 0.000000 0.000000 -0.050000 209.000000 程 endTextObject restoreState saveState updateFillColor 0.000000 0.000000 0.000000 1.000000 updateFillColor 0.000000 0.000000 0.000000 1.000000 updateStrokeColor 0.000000 0.000000 0.000000 1.000000 updateFont 24 0 4 0 0 1045.000000 0 CambriaMath drawChar 142.440000 259.730000 147.999400 259.730000 0.050000 0.000000 0.000000 -0.050000 209.000000 � endTextObject restoreState saveState As a result, the garbage characters started early in the https://cgit.freedesktop.org/libreoffice/core/tree/sdext/source/pdfimport/xpdfwrapper. If you open the pdf with Evince (i.e. the PDF Viewer in linux Fedora / Gnome), when you copy paste the paragraph the pasted content is also garbage character. Since Evince also uses poppler lib, I guess this is a bug in the poppler side.
Bug reproduced in: Version: 7.3.3.2 / LibreOffice Community Build ID: d1d0ea68f081ee2800a922cac8f79445e4603348 CPU threads: 4; OS: Mac OS X 10.14.6; UI render: default; VCL: osx Locale: en-GB (en_GB.UTF-8); UI: en-GB Calc: threaded Adobe Reader 11.0.23 and Mac OS Preview Version 10.1 (944.6.16.1) both seem to display the PDF correctly. LO input into Draw (using File>Open) results in multiple characters displaying as � Bug also reproduced with: Version: 6.4.4.2 Build ID: 3d775be2011f3886db32dfd395a6a6d1ca2630ff CPU threads: 4; OS: Mac OS X 10.14.6; UI render: default; VCL: osx; Locale: en-GB (en_GB.UTF-8); UI-Language: en-GB Calc: threaded Status set to NEW, earliest version affected to 6.4.4.2.
Created attachment 180138 [details] 1.pdf, uncompressed with qpdf --stream-data=uncompress
e.g. --- /FT8 209 Tf /GS13 gs 0.05 0 0 -0.05 153.959 742.609 Tm <1C5F>Tj 208.797 -0 TD<0430>Tj 211.188 -0 TD<0773>Tj 208.797 -0 TD<04BC>Tj 211.188 -0 TD<2151>Tj 208.797 -0 TD<1BE9>Tj 211.188 -0 TD<303B>Tj ET Q q BT 0 0 0 rg /FT24 209 Tf /GS13 gs 0.05 0 0 -0.05 227.4 742.85 Tm <0754>Tj ET Q q BT 0 0 0 rg /FT24 149 Tf /GS13 gs 0.05 0 0 -0.05 233.04 739.13 Tm <0374>Tj ET Q q BT 0 0 0 rg /FT24 209 Tf /GS13 gs 0.05 0 0 -0.05 239.88 742.85 Tm <0D46>Tj ET --- <2151> = U+6B21 = '次' <1BE9> = U+65B9 = '方' <303B> = U+7A0B = '程' <0754> = <D835> <0374> = U+0032 = '2' <0D46> = U+2212 = '-' when I tried copying <D835> with firefox nightly and pasted to the text editor I normally use, I got a surrogate pair d835 dc00 = (U+1D400) when I tried the same thing with PDF-XChange, the <D835> part was just a blank.
5 0 obj << (snip) /FT24 10 0 R (snip) /FT8 13 0 R >> /XObject << /IM39 14 0 R /IM41 15 0 R >> >> /Rotate 0 /TrimBox [ 0 0 595.3 841.9 ] /Type /Page >> 10 0 obj << /BaseFont /DCWGQU+CambriaMath /DescendantFonts [ 20 0 R ] /Encoding /Identity-H /Subtype /Type0 /ToUnicode 21 0 R /Type /Font >> endobj 13 0 obj << /BaseFont /LNUHNF+SimSun /DescendantFonts [ 26 0 R ] /Encoding /Identity-H /Subtype /Type0 /ToUnicode 27 0 R /Type /Font >> endobj
The PDF contains mangled text; surrogate pairs are all missing the low surrogate part, making the original text unrecoverable. Garbage in, garbage out.