Bug 148980 - sdext xpdfimport (poppler): Garbage characters shown when open certain PDF in Draw
Summary: sdext xpdfimport (poppler): Garbage characters shown when open certain PDF in...
Status: RESOLVED NOTOURBUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Draw (show other bugs)
Version:
(earliest affected)
6.4.4.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2022-05-08 02:44 UTC by Kevin Suo
Modified: 2022-12-25 00:06 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
1.pdf (1.10 MB, application/pdf)
2022-05-08 02:44 UTC, Kevin Suo
Details
1.pdf, uncompressed with qpdf --stream-data=uncompress (2.36 MB, application/pdf)
2022-05-16 15:51 UTC, himajin100000
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Suo 2022-05-08 02:44:31 UTC
Created attachment 179989 [details]
1.pdf

When open the attached pdf document in Draw, some characters within the formula is shown as garbage characters.

Steps to Reproduce:
1. Open the attached 1.pdf with Draw.

Current Result:
There are garbage characters shown. For instance, it shows "将方程^2 − 6�-1 = 0 配方后,原方程变形为( )", rather than 将方程x2 − 6x-1 = 0 配方后,原方程变形为( )" in the 2nd list paragraph.

Expected result:
No garbage characters in the imported PDF. For instance, the above paragraph should show as "将方程x^2 − 6x-1 = 0 配方后,原方程变形为( )".

Additional Info:

If I do:
$ /opt/libreofficedev7.4/program/xpdfimport  ./1.pdf

Then I already get the garbage characters:
updateFont 8 0 4 0 0 1045.000000 0 SimSun
drawChar 111.000000 259.009000 121.450000 259.009000 0.050000 0.000000 0.000000 -0.050000 209.000000 将
drawChar 121.439850 259.009000 131.889850 259.009000 0.050000 0.000000 0.000000 -0.050000 209.000000 方
drawChar 131.999250 259.009000 142.449250 259.009000 0.050000 0.000000 0.000000 -0.050000 209.000000 程
endTextObject
restoreState
saveState
updateFillColor 0.000000 0.000000 0.000000 1.000000
updateFillColor 0.000000 0.000000 0.000000 1.000000
updateStrokeColor 0.000000 0.000000 0.000000 1.000000
updateFont 24 0 4 0 0 1045.000000 0 CambriaMath
drawChar 142.440000 259.730000 147.999400 259.730000 0.050000 0.000000 0.000000 -0.050000 209.000000 �
endTextObject
restoreState
saveState

As a result, the garbage characters started early in the https://cgit.freedesktop.org/libreoffice/core/tree/sdext/source/pdfimport/xpdfwrapper.

If you open the pdf with Evince (i.e. the PDF Viewer in linux Fedora / Gnome), when you copy paste the paragraph the pasted content is also garbage character. Since Evince also uses poppler lib, I guess this is a bug in the poppler side.
Comment 1 Andrew Watson 2022-05-10 09:23:34 UTC
Bug reproduced in:

Version: 7.3.3.2 / LibreOffice Community
Build ID: d1d0ea68f081ee2800a922cac8f79445e4603348
CPU threads: 4; OS: Mac OS X 10.14.6; UI render: default; VCL: osx
Locale: en-GB (en_GB.UTF-8); UI: en-GB
Calc: threaded

Adobe Reader 11.0.23 and Mac OS Preview Version 10.1 (944.6.16.1) both seem to display the PDF correctly. LO input into Draw (using File>Open) results in multiple characters displaying as � 

Bug also reproduced with:

Version: 6.4.4.2
Build ID: 3d775be2011f3886db32dfd395a6a6d1ca2630ff
CPU threads: 4; OS: Mac OS X 10.14.6; UI render: default; VCL: osx; 
Locale: en-GB (en_GB.UTF-8); UI-Language: en-GB
Calc: threaded

Status set to NEW, earliest version affected to 6.4.4.2.
Comment 2 himajin100000 2022-05-16 15:51:41 UTC
Created attachment 180138 [details]
1.pdf, uncompressed with qpdf --stream-data=uncompress
Comment 3 himajin100000 2022-05-16 16:10:29 UTC
e.g.
---
/FT8 209 Tf
/GS13 gs
0.05 0 0 -0.05 153.959 742.609 Tm
<1C5F>Tj 208.797 -0 TD<0430>Tj 211.188 -0 TD<0773>Tj 208.797 -0 TD<04BC>Tj 211.188 -0 TD<2151>Tj 208.797 -0 TD<1BE9>Tj 211.188 -0 TD<303B>Tj
ET
Q
q
BT
0 0 0 rg
/FT24 209 Tf
/GS13 gs
0.05 0 0 -0.05 227.4 742.85 Tm
<0754>Tj
ET
Q
q
BT
0 0 0 rg
/FT24 149 Tf
/GS13 gs
0.05 0 0 -0.05 233.04 739.13 Tm
<0374>Tj
ET
Q
q
BT
0 0 0 rg
/FT24 209 Tf
/GS13 gs
0.05 0 0 -0.05 239.88 742.85 Tm
<0D46>Tj
ET
---
<2151> = U+6B21 = '次'
<1BE9> = U+65B9 = '方'
<303B> = U+7A0B = '程'
<0754> = <D835> 
<0374> = U+0032 = '2' 
<0D46> = U+2212 = '-'

when I tried copying <D835> with firefox nightly and pasted to the text editor I normally use,
I got a surrogate pair d835 dc00 = (U+1D400)

when I tried the same thing with PDF-XChange,
the <D835> part was just a blank.
Comment 4 himajin100000 2022-05-16 16:16:18 UTC
5 0 obj
<< (snip) /FT24 10 0 R (snip) /FT8 13 0 R >> /XObject << /IM39 14 0 R /IM41 15 0 R >> >> /Rotate 0 /TrimBox [ 0 0 595.3 841.9 ] /Type /Page >>

10 0 obj
<< /BaseFont /DCWGQU+CambriaMath /DescendantFonts [ 20 0 R ] /Encoding /Identity-H /Subtype /Type0 /ToUnicode 21 0 R /Type /Font >>
endobj

13 0 obj
<< /BaseFont /LNUHNF+SimSun /DescendantFonts [ 26 0 R ] /Encoding /Identity-H /Subtype /Type0 /ToUnicode 27 0 R /Type /Font >>
endobj
Comment 5 ⁨خالد حسني⁩ 2022-12-25 00:06:22 UTC
The PDF contains mangled text; surrogate pairs are all missing the low surrogate part, making the original text unrecoverable. Garbage in, garbage out.