Bug 82163 - sdext.pdfimport: xpdfimport failed to detect certain characters in the attached PDF
Summary: sdext.pdfimport: xpdfimport failed to detect certain characters in the attach...
Status: RESOLVED NOTOURBUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
4.3.0.4 release
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 100966 115390 142673 (view as bug list)
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2014-08-05 01:40 UTC by Doug
Modified: 2022-12-25 02:43 UTC (History)
14 users (show)

See Also:
Crash report or crash signature:


Attachments
Magazine front page PDF (259.96 KB, application/pdf)
2014-08-08 01:11 UTC, Doug
Details
Screenshot of the formating issue referenced in this ticket (449.37 KB, image/png)
2020-11-07 06:50 UTC, unwantedbox
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Doug 2014-08-05 01:40:47 UTC
Opened PDF in Draw (vr 4.3.0.4) the magazine title was rendered as 34567. Should have been The WATCHTOWER with ANNOUNCING JEHOVAH'S KINDOM below.
Also happened in 4.2 don't remember the rest of the version number.
Ubuntu 14.04x64
Here is snapshot of PDF in Draw>>http://imgur.com/lh2BVpY
Comment 1 Jean-Baptiste Faure 2014-08-05 17:12:21 UTC Comment hidden (obsolete)
Comment 2 Doug 2014-08-05 19:03:59 UTC Comment hidden (obsolete)
Comment 3 Doug 2014-08-08 01:11:48 UTC
Created attachment 104251 [details]
Magazine front page PDF
Comment 4 Doug 2014-08-08 01:15:04 UTC Comment hidden (obsolete)
Comment 5 Owen Genat (retired) 2014-08-09 08:12:28 UTC
Thanks for providing the cut-down example. Confirmed under GNU/Linux using v4.3.0.3 Build ID: 08ebe52789a201dd7d38ef653ef7a48925e7f9f7. Related forum thread:

http://en.libreofficeforum.org/node/8707

Status set to NEW. Severity set to enhancement (as this functionality has likely never existed). The request is to include something like the PDF Viewer (pdf.js) plug-in used in Firefox to allow use of embedded font subsets for accurate rendering:

https://support.mozilla.org/en-US/kb/view-pdf-files-firefox-without-downloading-them#w_using-the-pdf-viewer-extension
Comment 6 Jouni Järvinen 2015-02-09 23:57:05 UTC
Reproducible on 4.4.0.3, Win7 x64. Will render just fine on Foxit Reader but misses stuff on LO. Looks the same as the screenshot in comment 1.
Comment 7 Andrey Skvortsov 2017-05-31 14:03:54 UTC
The problem still exist in LO 5.2.7 GNU/Linux amd4.
Comment 8 Buovjaga 2018-02-24 15:34:18 UTC
*** Bug 115390 has been marked as a duplicate of this bug. ***
Comment 9 scottie 2019-10-14 05:52:42 UTC
in Version 6.3.2, text still overflows boundaries in Libreoffice Draw after the pdf file opened.

in pdf reader (adobe reader), the pdf is displayed as normal with embedded fonts.

fixing this would be a big improvement to libreoffice draw as well as a fix to enable a very common/basic/fundamental use case.

thanks to software engineers for diligent work
Comment 10 Gibtnix 2019-10-23 16:00:21 UTC
Besides using the embedded font (which obviously would be the best option), maybe it might also be an idea to implement a similar functionality like Inkscape does? If you import a PDF into Inkscape you can choose if it should be imported using Inksacpe itself (which also replaces fonts as LO Draw does) or if you want to open it using Poppler. The latter only allows you to modify the text as shapes but not as text, but does not change any fonts or layouts. Since Inkscape only works with single-page documents, a similar dual-import PDF function in LO Draw would also be pretty useful.
Comment 11 David 2020-01-27 18:23:42 UTC
Hi, 

Still happening in 6.3.4.2. I have this problem at least once at week. Is there a way I can help with this?
Comment 12 jeremia 2020-08-21 21:54:08 UTC
The bug remains in v7.0.0.3.
Because the font changes when opening the pdf in LibreOffice Draw the layout becomes incorrect as text boxes changes size when the font is incorrect, which results in incorrect text overlapping. Very annoying!
Comment 13 unwantedbox 2020-11-07 06:48:56 UTC
This issue still exists in 7.0.3.1 - Ubuntu 20.04

Example file used: https://www.irs.gov/pub/irs-pdf/fw4.pdf

When imported, the formatting is destroyed. I believe this is due to missing the embedded fonts. This also isn't some fancy document, just a government form.

While I understand why this was labeled enhancement due to a feature that never existed, it sure feels like a bug, and of course would be great to get this functionality, especially since it looks like it has been a request for 6 years.

Note, if I'm able to, I would like to update the ticket to Draw instead of "filters and storage"? If I'm wrong in doing so, please feel free to change it back.
Comment 14 unwantedbox 2020-11-07 06:50:23 UTC
Created attachment 167071 [details]
Screenshot of the formating issue referenced in this ticket
Comment 15 Timur 2021-06-07 08:03:26 UTC
*** Bug 142673 has been marked as a duplicate of this bug. ***
Comment 16 Michael Warner 2021-06-23 03:17:06 UTC
Repro in a development build from recent master:
Version: 7.3.0.0.alpha0+ / LibreOffice Community
Build ID: 736e100c516ed5326f4cccd6d22205264df51914
CPU threads: 12; OS: Linux 4.15; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: CL

I've noticed this message about an unknown attribute being spammed on the console:

warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text101
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text174
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text174
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text174
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text174
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text178
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text178
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text178
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text178
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text178
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text178
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text172
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text172
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text101
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text101
warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text101


Might be related.
Comment 17 Timur 2021-08-20 07:12:58 UTC
*** Bug 143959 has been marked as a duplicate of this bug. ***
Comment 18 Buovjaga 2021-08-22 11:34:19 UTC
*** Bug 100966 has been marked as a duplicate of this bug. ***
Comment 19 V Stuart Foote 2021-08-23 14:13:50 UTC
LibreOffice is not a PDF editor--we make no claims to be!

Rather we provide a functional filter import to XML stream and rendering as ODF document, with some legitimate issues. 

We also provide a pdfium based renderer that has high fidelity to original PDF source--including correct rendering of embedded fonts for text runs. As has been noted if you want to work with filter imported PDF, you must have the full font(s) installed to system.

The pdfium based insert filter will reconstruct the PDF fonts (either embedded or by toUnicode handling of the paths).

The import filter does not handle the subset fonts well--but isn't it actually is better to perform font fall back. That is when PDF is filter imported, rather than inserted as an image, there is some expectation that one would want to edit the Draw (or Writer, or Impress) document.

Consider what happens with the new document (that is what it is) and attempts to edit when the characters desired are not present in the embedded subset?  That's right you'd be depending on a font fallback to substitute for missing glyphs.

Better to not use the subset fonts at all and provide more reliable fallback and identification of oddly PS named fonts.
Comment 20 Emily Bowman 2021-08-23 22:16:05 UTC
That it always falls back to Libre Sans, basically a new Arial, for all missing fonts is probably the worse crime than that it can't do enough with subset fonts, although it's become better at matching letter height metrics than it used to be. Horizontal metrics are still terrible, but more legible than vertical overlap.

As long as we're still in enhancement mode, it would be nice if the fonts were simply left alone until a block was actually unlocked to be edited. After all, fonts can be embedded, and there are no restrictions on how many glyphs a font requires.
Comment 21 Kevin Suo 2021-10-10 12:45:24 UTC
For the pdf document provided in attachment 104251 [details], if you run:
./instdir/program/xpdfimport /home/suokunlong/lo/bugs/xpdfimport/WTR.pdf -f /tmp/tmp.txt

You will see the output like this:
...
updateFont 15 0 0 0 0 5975.099000 0 WtAtArtwork38JBRbw
drawChar 55.375425 112.999280 169.440065 112.999280 59.750990 0.000000 0.000000 -59.750990 1.000000 3
endTextObject
setTextRenderMode 0
updateLineWidth 0.300000
updateFont 15 0 0 0 0 5975.099000 0 WtAtArtwork38JBRbw
drawChar 170.049525 112.999280 226.036203 112.999280 59.750990 0.000000 0.000000 -59.750990 1.000000 4
endTextObject
setTextRenderMode 0
updateLineWidth 0.300000
updateFont 15 0 0 0 0 5975.099000 0 WtAtArtwork38JBRbw
drawChar 226.167655 112.999280 290.459720 112.999280 59.750990 0.000000 0.000000 -59.750990 1.000000 5
endTextObject
setTextRenderMode 0
updateLineWidth 0.300000
updateFont 15 0 0 0 0 5975.099000 0 WtAtArtwork38JBRbw
drawChar 290.650923 112.999280 364.025139 112.999280 59.750990 0.000000 0.000000 -59.750990 1.000000 6
endTextObject
setTextRenderMode 0
updateLineWidth 0.300000
updateFont 15 0 0 0 0 5975.099000 0 WtAtArtwork38JBRbw
drawChar 364.276093 112.999280 423.130818 112.999280 59.750990 0.000000 0.000000 -59.750990 1.000000 7
...

sdext.pdfimport uses these xpdfimport drawchar information to build-up an ODF XML document and then show it to the user. Thus the bug is not on font rendering - the bug is related to xpdfimport (sdext/source/pdfimport/xpdfwrapper/pdfioutdev_gpl.{hxx, cxx}), or more likely a bug on poppler (i.e. the third-party library which libreoffice has used in the above mentioned xpdfimport part.
Comment 22 Kevin Suo 2021-10-10 12:50:27 UTC
(In reply to Michael Warner from comment #16)

> I've noticed this message about an unknown attribute being spammed on the console:
> warn:xmloff:8364:8364:xmloff/source/text/txtparai.cxx:137: unknown attribute urn:oasis:names:tc:opendocument:xmlns:text:1.0 text:style-name value=text101

No, these warnings are not related at all, and these warnings have already been fixed by commit:

author	Kevin Suo <suokunlong@126.com>	2021-07-14 09:44:30 +0800
committer	Noel Grandin <noel.grandin@collabora.co.uk>	2021-07-14 18:53:44 +0200
commit b1ca6d3aae3b75ec3e5c1ef17d582bcec01fc7eb (patch)
    sdext.pdfimport: <text:s> and <text:tab> don't have "text:style-name" attribute
Comment 23 ⁨خالد حسني⁩ 2022-12-25 02:43:13 UTC
The magazine title is not embedded as text, but as a logo split into 5 glyphs and the Unicode mapping for them gives: 34567, so there is nothing we can do here to get the actual text for the title since the PDF does not have it.