Bug 115117 - Incorrect PDF cmap entries for ligatures and broken text extraction
Summary: Incorrect PDF cmap entries for ligatures and broken text extraction
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
5.3.0.3 release
Hardware: All All
: medium minor
Assignee: Khaled Hosny
URL:
Whiteboard: target:6.1.0 target:6.0.4
Keywords:
: 116056 116284 116490 117451 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-01-19 22:52 UTC by Jacob Barhak
Modified: 2018-05-09 08:57 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Archive containing original document and constructed pdf files (358.37 KB, application/x-zip-compressed)
2018-01-19 22:52 UTC, Jacob Barhak
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Barhak 2018-01-19 22:52:19 UTC
Created attachment 139227 [details]
Archive containing original document and constructed pdf files

When using the export as PDF menu option and creating a pdf, some links cannot be copied properly using adobe acrobat reader DC version 2018.009.20050 under Windows 10

Links copied from the pdf have some letters changed or added and therefore become corrupt.

Similar misbehavior happens when printing to a pdf file using the printer driver in Windows 10,

Attached are :
1. Example docx document test.docx
2. pdf created using the libra office pdf export - test.pdf
3. pdf created using windows print to pdf driver - test_print_as_pdf.pdf

To reconstruct the problem copy the text from test.pdf into a notepad. it will result in:

 2012 - Present: The Reference Model: A Disease Model for Diabeti disease progression based
on use of iomputng power and literature referenies. See:
htps://simtk.org/proeeits/therefmodel

Do the same for test_print_as_pdf.pdf and it will result in:
 2012 - Present: The Reference Model: A Disease Model for Diabe􀆟c disease progression based
on use of compu􀆟ng power and literature references. See:
h􀆩ps://simtk.org/projects/therefmodel

Clearly there is some incompatibility between software components since different conversions to pdf create different outcomes. It is also possible this is a bug in adobe copy.

Hopefully this description is sufficient to reproduce the issue.

            Jacob
Comment 1 Timur 2018-01-23 19:04:21 UTC
Issue related to fonts used. I see it happens with Calibri and Carlito.
Can be seen when text from PDF is copied and pasted back. 
Started in 5.3.0
Comment 2 Khaled Hosny 2018-01-23 21:00:50 UTC
What is happening here is that Callibri has a ti ligature that is enabled by default and the PDF we produce has problems in copying ligatures from fonts built in certain ways. Before the switch to HarfBuzz we didn’t enabling ligatures for Latin text at all so such issue was masked.

Not actually regression, copying text with ligatures and other advanced text layout features have always been.

A simple workaround this is to disable ligatures, proper fix is tracked in bug 66597.

*** This bug has been marked as a duplicate of bug 66597 ***
Comment 3 Khaled Hosny 2018-03-20 02:23:38 UTC
*** Bug 116284 has been marked as a duplicate of this bug. ***
Comment 4 Khaled Hosny 2018-03-20 02:24:37 UTC
Bug 66597 is becoming a kind of meta bugs with different issues lumped together, lets separate different issues.
Comment 5 Khaled Hosny 2018-03-20 02:25:57 UTC
*** Bug 116056 has been marked as a duplicate of this bug. ***
Comment 6 Khaled Hosny 2018-03-20 02:26:38 UTC
*** Bug 116490 has been marked as a duplicate of this bug. ***
Comment 7 Commit Notification 2018-03-21 17:32:50 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=b94a66ebc8db6c5ca9c7dcfdfbb06b49deae4939

tdf#115117: Fix PDF ToUnicode CMAP for ligatures

It will be available in 6.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 8 Timur 2018-03-22 16:22:18 UTC
With fix we get "ti " and "htt ps" i.e. all chars but with some space:
on use of computi ng power and literature references. See: 
htt ps://simtk.org/projects/therefmodel  

Surely better than it was, but can you please explain the space.
Comment 9 Commit Notification 2018-03-23 08:26:41 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "libreoffice-6-0":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=90fb652ebbc4b16ae5001140076f52209e913345&h=libreoffice-6-0

tdf#115117: Fix PDF ToUnicode CMAP for ligatures

It will be available in 6.0.4.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 10 Harry.dent 2018-03-23 15:49:31 UTC
Confirmed fixed for me on 6.0.4. Thanks!
Comment 11 Khaled Hosny 2018-03-23 20:31:23 UTC
(In reply to Timur from comment #8)
> With fix we get "ti " and "htt ps" i.e. all chars but with some space:
> on use of computi ng power and literature references. See: 
> htt ps://simtk.org/projects/therefmodel  

Which file is this?
Comment 12 Timur 2018-03-26 07:35:15 UTC
(In reply to Khaled Hosny from comment #11)
> Which file is this?
DOCX from this bug. 
And also in duplicate 116490 with word "final" that becomes "fi nal".
Comment 13 Khaled Hosny 2018-03-26 10:23:31 UTC
(In reply to Timur from comment #12)
> (In reply to Khaled Hosny from comment #11)
> > Which file is this?
> DOCX from this bug. 
> And also in duplicate 116490 with word "final" that becomes "fi nal".

I cannot reproduce that, here is the text extracted with pdftotext:



2012 - Present: The Reference Model: A Disease Model for Diabetic disease progression based
on use of computing power and literature references. See:
https://simtk.org/projects/therefmodel
Comment 14 Khaled Hosny 2018-03-26 10:26:43 UTC
(In reply to Khaled Hosny from comment #13)
> (In reply to Timur from comment #12)
> > (In reply to Khaled Hosny from comment #11)
> > > Which file is this?
> > DOCX from this bug. 
> > And also in duplicate 116490 with word "final" that becomes "fi nal".
> 
> I cannot reproduce that, here is the text extracted with pdftotext:
> 
> 
> 
> 2012 - Present: The Reference Model: A Disease Model for Diabetic disease
> progression based
> on use of computing power and literature references. See:
> https://simtk.org/projects/therefmodel

The text copied from Acrobat Reader DC:

 2012 - Present: The Reference Model: A Disease Model for Diabetic disease progression based
on use of computing power and literature references. See:
https://simtk.org/projects/therefmodel
Comment 15 Timur 2018-03-26 10:52:32 UTC
I copy the text from LO exported test.pdf in PDF-Xchange Viewer 2.5. into a notepad or again to LO and I see space. But when I copy text from the same test.pdf from within Adobe Reader or Master PDF Editor, it's fine. Sorry I didn't test both. 

Maybe it's about my Viewer (https://www.tracker-software.com/product/pdf-xchange-viewer/download?fileid=446). But text copied from MSO exported test.pdf from within the same PDF-Xchange Viewer is fine. So I thought it's LO issue. Something is different, I don't say wrong.
Comment 16 Timur 2018-03-26 11:07:37 UTC
Let me write a conclusion: looks like a Viewer bug. 
I use an old version because I have a license. 
New version from https://www.tracker-software.com/product/pdf-xchange-editor/download?fileid=613 doesn't copy a space. 
Thank you.
Comment 17 Khaled Hosny 2018-05-09 08:57:32 UTC
*** Bug 117451 has been marked as a duplicate of this bug. ***