59870 – FILEOPEN PDF: Incorrect text encoding

Bug 59870 - FILEOPEN PDF: Incorrect text encoding

Summary: FILEOPEN PDF: Incorrect text encoding

Status:	VERIFIED NOTOURBUG

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	3.6.4.3 release
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	PDF-Import-Writer
	Show dependency tree / graph

Reported:	2013-01-25 22:43 UTC by Dennis Roczek
Modified:	2025-04-11 22:47 UTC (History)
CC List:	7 users (show)

See Also:	78427
Crash report or crash signature:

Attachments
pdf missing text, images (181.50 KB, image/png) 2013-05-19 12:11 UTC, Brenda Granados	Details
incorrect text codification (Libo 4.4) (68.99 KB, image/jpeg) 2014-08-11 11:29 UTC, Xisco Faulí	Details
PDF for reproducing the problem (basis for previous screenshots) (1.38 MB, application/pdf) 2025-01-19 12:27 UTC, Eyal Rozenberg	Details
Screenshot of #198612 with LibreOffice 25.8 nightly (587.61 KB, image/png) 2025-01-19 12:30 UTC, Eyal Rozenberg	Details
first page of PDF inserted to ODF with good fidelity (2.40 MB, application/vnd.oasis.opendocument.text) 2025-04-11 16:05 UTC, V Stuart Foote	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dennis Roczek 2013-01-25 22:43:10 UTC

See http://www.computerwoche.de/archiv/pdf/2007/leseprobe_37.pdf 

PDFExchange Viewer isn't also able to copy and paste the text, so this is something harder than the usual text stuff.

Comment 1 gounistat 2013-05-06 00:33:06 UTC

While opening a PDF doc by LibreOffice 4.0.2.2
it opens but shows double images instead of a single, just like viewed by astigmatism. It is very difficult to read if not impossible.
However, it showed as well that I am able to delete the one on top of each page and read normally. LibreOffice seems to open twice the same PDF doc on a single view of such document.

Comment 2 Brenda Granados 2013-05-19 12:11:14 UTC

Created attachment 79523 [details]
pdf missing text, images

Comment 3 Brenda Granados 2013-05-19 12:12:33 UTC

When I open this pdf, I cannot even see the text. Other pdf's with images and text do not have the same issue. I wonder why this one is different. 

Version: 4.0.3.1 (Build ID: a67943cd4d125208f4ea7fa29439551825cfb39)
Platform: Ubuntu 13.04 

-------------------------------

LibreOffice is powered by a team of volunteers, every bug is confirmed (triaged) by human beings who mostly give their time for free. We invite you to join our triaging by checking out this link: https://wiki.documentfoundation.org/QA/BugTriage
There are also other ways to get involved including with marketing, UX, documentation, and of course developing -  http://www.libreoffice.org/get-help/mailing-lists

Comment 4 Dennis Roczek 2013-08-15 17:04:47 UTC

heh, I started to analyze the file more and searched a bit in some forums and blogs. It turned out that this is a DRM protected file and thus the decoding doesn't work. Moreover I realized that PDF readers (doesn't matter which one, neither Adobe Reader nor other free ones) can actually copy texts out of this PDF document. The text which is copied is either"rubbish" or  the white boxes with black border (as seen on Asian websites without having installed special fonts on older web browsers)...

LibO should at least warn the user (that LibO can't handle some fonts / texts) if the user try to open/edit an DRM file.

Comment 5 Xisco Faulí 2014-08-11 11:29:20 UTC

Created attachment 104433 [details]
incorrect text codification (Libo 4.4)

How it looks with Version: 4.4.0.0.alpha0+
Build ID: 33fd0d8ae6a6b4e5226991e39fe755d84cb78280
TinderBox: Win-x86@51-TDF, Branch:MASTER, Time: 2014-07-14_10:10:0

Comment 6 Xisco Faulí 2014-08-11 13:22:28 UTC

I update the title as the images are displayed correctly, however the text isn't.
It looks like a commit in b3f41543851e9985c6c7ba133c32753c9bc732c1..b1f7dd66b898b03cb4bd8d434b6370310ea95946 fixed the missing text problem but did not fix it completely and the text encoding is still wrong.

Comment 7 QA Administrators 2015-09-04 02:48:33 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present on a currently supported version of LibreOffice (5.0.0.5 or later)
https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the version of LibreOffice and your operating system, and any changes you see in the bug behavior

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a short comment that includes your version of LibreOffice and Operating System

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:

1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.

4a. If the bug was present with 3.3 - set version to "inherited from OOo";
4b. If the bug was not present in 3.3 - add "regression" to keyword

Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for your help!

-- The LibreOffice QA Team This NEW Message was generated on: 2015-09-03

Comment 8 Xisco Faulí 2015-09-04 08:28:19 UTC Comment hidden (obsolete)

This issue is still present in

Version: 5.0.1.2
Build ID: 81898c9f5c0d43f3473ba111d7b351050be20261
Locale: es-ES (es_ES)

on Windows 7 (64-bit)

Comment 9 QA Administrators 2016-09-20 10:29:37 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present on a currently supported version of LibreOffice
(5.1.5 or 5.2.1 https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the version of LibreOffice and
your operating system, and any changes you see in the bug behavior

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave
a short comment that includes your version of LibreOffice and Operating System

Please DO NOT

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to "inherited from OOo";
4b. If the bug was not present in 3.3 - add "regression" to keyword

Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug-20160920

Comment 10 Dennis Roczek 2016-09-26 20:07:23 UTC

still repro

Version: 5.2.1.2 (x64)
Build-ID: 31dd62db80d4e60af04904455ec9c9219178d620
CPU-Threads: 4; BS-Version: Windows 6.19; UI-Render: Standard; 
Gebietsschema: de-DE (de_DE); Calc: group

@xisco: you changed the title of that bug, did you read my comment #4 (not text encoding problem, but drm!)

Comment 11 Xisco Faulí 2017-09-29 08:50:07 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present on a currently supported version of LibreOffice
(5.4.1 or 5.3.6 https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the version of LibreOffice and
your operating system, and any changes you see in the bug behavior

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave
a short comment that includes your version of LibreOffice and Operating System

Please DO NOT

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to "inherited from OOo";
4b. If the bug was not present in 3.3 - add "regression" to keyword

Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug-20170929

Comment 12 Dennis Roczek 2017-10-08 15:32:10 UTC Comment hidden (obsolete)

still repro with
Version: 5.4.1.2 (x64)
Build-ID: ea7cb86e6eeb2bf3a5af73a8f7777ac570321527
CPU-Threads: 4; Betriebssystem:Windows 6.19; UI-Render: Standard; 
Gebietsschema: de-DE (de_DE); Calc: group

Comment 13 Urmas 2017-10-16 08:17:09 UTC

The text is extracted fine, it's just encoded.

You can simply replace the symbols with corresponding letters.

Comment 14 Timur 2018-04-19 10:15:34 UTC Comment hidden (obsolete)

Could someone write what this bug is about, what's "expected"?

Comment 15 Dennis Roczek 2018-04-24 08:47:19 UTC

(In reply to Timur from comment #14)
> Could someone write what this bug is about, what's "expected"?

for example the biggest headline in the middle should display "IBM steigt bei OpenOffice ein" and not some "rubbish".

As my analysis a few years ago showed that this is DRM stuff, I do not believe that we can resolve this...

Comment 16 QA Administrators 2019-10-28 03:30:10 UTC Comment hidden (obsolete)

Dear Dennis Roczek,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 17 Timur 2019-12-15 19:21:19 UTC Comment hidden (obsolete)

No file to test.

Comment 18 Dennis Roczek 2019-12-15 21:42:34 UTC

archive.org copy:
http://web.archive.org/web/20160304091821/https://www.computerwoche.de/archiv/pdf/2007/leseprobe_37.pdf

Comment 19 QA Administrators 2019-12-16 03:32:20 UTC Comment hidden (obsolete)

[Automated Action] NeedInfo-To-Unconfirmed

Comment 20 Dieter 2019-12-16 09:24:16 UTC Comment hidden (obsolete)

Status back to NEW as it was before comment 17.

Comment 21 Timur 2020-11-02 14:14:39 UTC

Hi Khaled. Can you please see this bug of fileopen PDF. Opinions are different: DRM, encoding, toUnicode. I ask you because of seemingly related bug 66597. Thanks.

Comment 22 QA Administrators 2024-08-22 03:16:00 UTC Comment hidden (obsolete)

Dear Dennis Roczek,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://web.libera.chat/?settings=#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 23 Eyal Rozenberg 2025-01-19 12:27:02 UTC

Created attachment 198612 [details]
PDF for reproducing the problem (basis for previous screenshots)

Making sure we have the document here on our bugzilla....

Comment 24 Eyal Rozenberg 2025-01-19 12:30:55 UTC

Created attachment 198613 [details]
Screenshot of #198612 with LibreOffice 25.8 nightly

The bug still manifests with a recent nightly:

Version: 25.8.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 2305fe302e12c4256e452589e2533772d4213e59
CPU threads: 4; OS: Linux 6.6; UI render: default; VCL: gtk3
Locale: en-IL (en_IL); UI: en-US

_What_ aspects of the bug manifest?

* Mis-decoded text / misdetected character set / encoding scheme. Example text run: "ŐƠØ ŃɎĠƠŐŃĺØŃƽƚ Ɓ)«ØƄ ĊØƧƽØ«įƽƉ Gĝ"
* Text boxes exceed document boundaries and often overlap (probably because of mis-decoding)
* Images are fine, no doubling or anything like that.

I wonder, though - how common is this situation? In 2014, and today?

Comment 25 Khaled Hosny 2025-04-10 21:59:42 UTC

The PDF metadata shows that it was produce by Ghostscript. The PDF font dictionaries contain no ToUnicode CMaps, nor do they use any standard PDF font encoding. As such there are no much textual data that can be extracted from the PDF. That is a case of bad PDF producer (or at least PDF not intended to be preserve textual data), and we can’t do anything to extract data that do not exist.

Comment 26 Eyal Rozenberg 2025-04-11 13:12:43 UTC

(In reply to Khaled Hosny from comment #25)
> The PDF metadata shows that it was produce by Ghostscript. The PDF font
> dictionaries contain no ToUnicode CMaps, nor do they use any standard PDF
> font encoding. As such there are no much textual data that can be extracted
> from the PDF.

But the text is _there_... I'm no PDF expert (nor even have a decent tool for exploring PDF files' raw structure), but - if the encoding is iso-8859-1, or something similar - should we not be able to figure this out? Especially given the hint of lack-of-CMaps, rather than jarbled CMaps?

> That is a case of bad PDF producer (or at least PDF not
> intended to be preserve textual data), and we can’t do anything to extract
> data that do not exist.

But there is text, isn't there? So, can we really not do anything?

Comment 27 Khaled Hosny 2025-04-11 15:19:41 UTC

(In reply to Eyal Rozenberg from comment #26)
> (In reply to Khaled Hosny from comment #25)
> > The PDF metadata shows that it was produce by Ghostscript. The PDF font
> > dictionaries contain no ToUnicode CMaps, nor do they use any standard PDF
> > font encoding. As such there are no much textual data that can be extracted
> > from the PDF.
> 
> But the text is _there_... I'm no PDF expert (nor even have a decent tool
> for exploring PDF files' raw structure), but - if the encoding is
> iso-8859-1, or something similar - should we not be able to figure this out?
> Especially given the hint of lack-of-CMaps, rather than jarbled CMaps?

PDF text stream is often contains glyph indices (from subset font) and positions. The glyph indices are arbitrary and differ from font subset to font subset. A PDF font subset contains at most 256 glyphs. When there is no ToUnicode CMap for a given font, PDF tools will assume the glyph indices are codepoints and will try to use them for text extraction, and that the garbled text you are seeing. For example, (what appears to be when looking at the PDF) the string “Nr.” (at the top left corner), is encoded in the PDF as:

<002200CF>41.1893<00BD>

The hex numbers are glyph IDs and the decimal number is kerning. The hex numbers mean glyph index 34 (0x0022), glyph index 207 (0x00CF), and glyph index 189.

If the font had a ToUnicode CMap, it would have mapped 0x0022 to “N”, 0x00CF to “r”, and 0x00BD to “.”, but there isn’t and when interpreting these numbers as codepoints we get:

"Ï½

Which just makes no sense. There is no text encoding where “"Ï½” is “Nr.”, and even if there one it will be a pure coincidence and the next string or the next font will be broken.

Comment 28 V Stuart Foote 2025-04-11 16:05:57 UTC

Created attachment 200303 [details]
first page of PDF inserted to ODF with good fidelity

(In reply to Eyal Rozenberg from comment #26)
> 
> But there is text, isn't there? So, can we really not do anything?

Sure, use the pdfium based filter and work with the PDF pages as high fidelity images. Attached.

Comment 29 Eyal Rozenberg 2025-04-11 17:05:52 UTC

(In reply to Khaled Hosny from comment #27)

Ok, if the glyph indices are indeed obscure, then, it is what it is. Thanks for taking the time to elaborate.

Comment 30 Dennis Roczek 2025-04-11 22:47:30 UTC

(In reply to V Stuart Foote from comment #28)
> Created attachment 200303 [details]
> first page of PDF inserted to ODF with good fidelity
> 
> (In reply to Eyal Rozenberg from comment #26)
> > 
> > But there is text, isn't there? So, can we really not do anything?
> 
> Sure, use the pdfium based filter and work with the PDF pages as high
> fidelity images. Attached.

then OCR and put it back in the source code... 🤪