Bug 59870 - FILEOPEN PDF: Incorrect text encoding
Summary: FILEOPEN PDF: Incorrect text encoding
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.6.4.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: PDF-Import-Writer
  Show dependency treegraph
 
Reported: 2013-01-25 22:43 UTC by Dennis Roczek
Modified: 2022-08-22 23:18 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
pdf missing text, images (181.50 KB, image/png)
2013-05-19 12:11 UTC, Brenda Granados
Details
incorrect text codification (Libo 4.4) (68.99 KB, image/jpeg)
2014-08-11 11:29 UTC, Xisco Faulí
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dennis Roczek 2013-01-25 22:43:10 UTC
See http://www.computerwoche.de/archiv/pdf/2007/leseprobe_37.pdf 

PDFExchange Viewer isn't also able to copy and paste the text, so this is something harder than the usual text stuff.
Comment 1 gounistat 2013-05-06 00:33:06 UTC
While opening a PDF doc by LibreOffice 4.0.2.2
it opens but shows double images instead of a single, just like viewed by astigmatism. It is very difficult to read if not impossible.
However, it showed as well that I am able to delete the one on top of each page and read normally. LibreOffice seems to open twice the same PDF doc on a single view of such document.
Comment 2 Brenda Granados 2013-05-19 12:11:14 UTC
Created attachment 79523 [details]
pdf missing text, images
Comment 3 Brenda Granados 2013-05-19 12:12:33 UTC
When I open this pdf, I cannot even see the text. Other pdf's with images and text do not have the same issue. I wonder why this one is different. 

Version: 4.0.3.1 (Build ID: a67943cd4d125208f4ea7fa29439551825cfb39)
Platform: Ubuntu 13.04 

-------------------------------

LibreOffice is powered by a team of volunteers, every bug is confirmed (triaged) by human beings who mostly give their time for free. We invite you to join our triaging by checking out this link: https://wiki.documentfoundation.org/QA/BugTriage
There are also other ways to get involved including with marketing, UX, documentation, and of course developing -  http://www.libreoffice.org/get-help/mailing-lists
Comment 4 Dennis Roczek 2013-08-15 17:04:47 UTC
heh, I started to analyze the file more and searched a bit in some forums and blogs. It turned out that this is a DRM protected file and thus the decoding doesn't work. Moreover I realized that PDF readers (doesn't matter which one, neither Adobe Reader nor other free ones) can actually copy texts out of this PDF document. The text which is copied is either"rubbish" or  the white boxes with black border (as seen on Asian websites without having installed special fonts on older web browsers)...

LibO should at least warn the user (that LibO can't handle some fonts / texts) if the user try to open/edit an DRM file.
Comment 5 Xisco Faulí 2014-08-11 11:29:20 UTC
Created attachment 104433 [details]
incorrect text codification (Libo 4.4)

How it looks with Version: 4.4.0.0.alpha0+
Build ID: 33fd0d8ae6a6b4e5226991e39fe755d84cb78280
TinderBox: Win-x86@51-TDF, Branch:MASTER, Time: 2014-07-14_10:10:0
Comment 6 Xisco Faulí 2014-08-11 13:22:28 UTC
I update the title as the images are displayed correctly, however the text isn't.
It looks like a commit in b3f41543851e9985c6c7ba133c32753c9bc732c1..b1f7dd66b898b03cb4bd8d434b6370310ea95946 fixed the missing text problem but did not fix it completely and the text encoding is still wrong.
Comment 7 QA Administrators 2015-09-04 02:48:33 UTC Comment hidden (obsolete)
Comment 8 Xisco Faulí 2015-09-04 08:28:19 UTC Comment hidden (obsolete)
Comment 9 QA Administrators 2016-09-20 10:29:37 UTC Comment hidden (obsolete)
Comment 10 Dennis Roczek 2016-09-26 20:07:23 UTC
still repro

Version: 5.2.1.2 (x64)
Build-ID: 31dd62db80d4e60af04904455ec9c9219178d620
CPU-Threads: 4; BS-Version: Windows 6.19; UI-Render: Standard; 
Gebietsschema: de-DE (de_DE); Calc: group

@xisco: you changed the title of that bug, did you read my comment #4 (not text encoding problem, but drm!)
Comment 11 Xisco Faulí 2017-09-29 08:50:07 UTC Comment hidden (obsolete)
Comment 12 Dennis Roczek 2017-10-08 15:32:10 UTC Comment hidden (obsolete)
Comment 13 Urmas 2017-10-16 08:17:09 UTC
The text is extracted fine, it's just encoded.

You can simply replace the symbols with corresponding letters.
Comment 14 Timur 2018-04-19 10:15:34 UTC Comment hidden (obsolete)
Comment 15 Dennis Roczek 2018-04-24 08:47:19 UTC
(In reply to Timur from comment #14)
> Could someone write what this bug is about, what's "expected"?

for example the biggest headline in the middle should display "IBM steigt bei OpenOffice ein" and not some "rubbish".

As my analysis a few years ago showed that this is DRM stuff, I do not believe that we can resolve this...
Comment 16 QA Administrators 2019-10-28 03:30:10 UTC Comment hidden (obsolete)
Comment 17 Timur 2019-12-15 19:21:19 UTC Comment hidden (obsolete)
Comment 19 QA Administrators 2019-12-16 03:32:20 UTC Comment hidden (obsolete)
Comment 20 Dieter 2019-12-16 09:24:16 UTC Comment hidden (obsolete)
Comment 21 Timur 2020-11-02 14:14:39 UTC
Hi Khaled. Can you please see this bug of fileopen PDF. Opinions are different: DRM, encoding, toUnicode. I ask you because of seemingly related bug 66597. Thanks.