Bug 124191 - Text copied from a PDF exported using Linux Libertine G Graphite font is missing characters. (comment 24)
Summary: Text copied from a PDF exported using Linux Libertine G Graphite font is miss...
Status: RESOLVED DUPLICATE of bug 66597
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
6.0.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-19 00:13 UTC by Frank Zimmerman
Modified: 2019-04-25 13:23 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
Print to PDF using Foxit Phantom Printer Driver (17.18 KB, application/pdf)
2019-03-20 18:32 UTC, Frank Zimmerman
Details
Print to PDF using MS PDF printer driver (157.10 KB, application/pdf)
2019-03-20 18:33 UTC, Frank Zimmerman
Details
Print to XPS using MS XPS printer driver (162.25 KB, application/zip)
2019-03-20 18:34 UTC, Frank Zimmerman
Details
zipped 55MB pg10 of ref with its streams uncompressed with qpdf (22.62 MB, application/x-zip-compressed)
2019-03-20 19:47 UTC, V Stuart Foote
Details
extracted text portion of page 10 of the writer 6.0 exported PDF (9.22 KB, text/plain)
2019-03-20 19:49 UTC, V Stuart Foote
Details
Font features Linux Libertine (v5.3.0) (15.75 KB, image/png)
2019-03-22 18:19 UTC, V Stuart Foote
Details
Graphite Font features Linux Libertine G (V5.1.3) (44.14 KB, image/png)
2019-03-22 18:20 UTC, V Stuart Foote
Details
Font features Linux Biolinum (v1.1.8) (14.23 KB, image/png)
2019-03-22 18:21 UTC, V Stuart Foote
Details
Graphite Font features Linux Biolinum G (v1.1.0) (41.44 KB, image/png)
2019-03-22 18:22 UTC, V Stuart Foote
Details
Word Doc with Ligatures exported to PDF (152.29 KB, application/pdf)
2019-04-24 19:33 UTC, Frank Zimmerman
Details
Word Doc with Ligatures used to create PDF (11.83 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2019-04-24 19:35 UTC, Frank Zimmerman
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Frank Zimmerman 2019-03-19 00:13:37 UTC
Description:
I use the Linux Libertine G font extensively in LibreOffice Writer to prepare PDF eBooks. A while ago, I noticed that while trying to copy a paragraph from such a PDF, the resulting text was missing characters, or sometimes had duplicated characters.

Steps to Reproduce:
1. Create a new document.
2. Type in the following line: "The fire flying coffee left Quickly."
3. Make sure this line is using Linux Libertine G font.
4. Export the document to PDF.
5. Open the PDF. The text looks fine.
6. Select the text line in the PDF, and paste into Writer, or into any text editor.
7. You will see something like this: "The fir flying coffe lft Quiickl."

Actual Results:
The fir flying coffe lft Quiickl.


Expected Results:
The fire flying coffee left Quickly.


Reproducible: Always


User Profile Reset: No



Additional Info:
The underlying text in the PDF, or perhaps some kind of translation layer or lookup table for ligatures, is causing strange problems in text copied out of a PDF that was created using fonts that support ligatures.

I found a book that I created with LibreOffice 3, back in the days when ligatures were not supported, and it exports fine.

I also tried the same test with Microsoft Word (with ligature support enabled) and it worked fine.

The problems with the text get worse as the document size grows. I have one book of about 600 pages where the copy/pasted text is quite awful, but if you take just that one page out of the document and put it in a new document, the copy/pasted text is significantly better (not perfect though).

So it seems to be a problem that compounds as the document size grows.
Comment 1 V Stuart Foote 2019-03-19 05:02:20 UTC
I can not reproduce on Windows 10 Pro 64-bit en-US (1803) with
LinLibertine_R_G.ttf, v5.10 (mod 13-1-2012) of Linux Libertine G is installed to system. Testing:

Version: 6.1.4.2 (x64)
Build ID: 9d0f32d1f0b509096fd65e0d4bec26ddd1938fd3
CPU threads: 8; OS: Windows 10.0; UI render: GL; 
Locale: en-US (en_US); Calc: group threaded

or current master/6.3.0alpha0+

with or without OpenGL rendering enabled.

Only issue I notice is that the "Linux Libertine G" font name gets exported to PDF as "LinuxLibertineG" and is treated as a missing font on paste back to Writer. So could be a font fall back issue.

But, guess the Linux Libertine G font could have issues with ligature tables causing the missing glyphs from the text strings "copied" to clipboard--I'm just not seeing it with this build 

A couple of past issues as in the see also bug 115117, but did not want to dupe to that without some review.

Believe proposed work implementing tagged PDF with /ActualText structs for bug 117428 would improve this work flow--giving better fidelity of text copied out from body of PDF.

@Khaled, Miklos any thoughts?
Comment 2 Frank Zimmerman 2019-03-19 17:06:55 UTC
Cannot reproduce? I hope that's a good sign.

Here are some examples from my website. The books on the following page were produced with LibreOffice 5. Try downloading one of the PDF's and copy/paste a paragraph out into a text editor. Let me know what you get.

https://www.practicaprophetica.com/books/edward-irving/

The following book was produced with LibreOffice 6, not too long ago:

https://www.practicaprophetica.com/books/ftw/#Last-Day-Events

Go to the 10th page (part of the intro pages before the numbering starts), the "Cover Picture" description, and copy out the first paragraph.

BTW I'm also using Win 10-64 bit, Version 1809. My current LibreOffice is 6.2.0.3 64 bit, but as shown from the first books linked in the page above, this has been around since LibreOffice 5, so I'm not sure that makes any difference.

I'm also pasting directly into a text editor (EditPad), so the fontname issue you mention below is not applicable in that scenario (although I noticed it also when pasting into LO).
Comment 3 Frank Zimmerman 2019-03-19 17:08:18 UTC
Correction: for the first webpage link - only the first book was created with LibreOffice 5. The rest are all tagged LibreOffice 6 (I think they were started with LO 5 and finished with LO 6).
Comment 4 ⁨خالد حسني⁩ 2019-03-20 13:28:01 UTC
(In reply to Frank Zimmerman from comment #2)
> https://www.practicaprophetica.com/books/ftw/#Last-Day-Events
> 
> Go to the 10th page (part of the intro pages before the numbering starts),
> the "Cover Picture" description, and copy out the first paragraph.

I can reproduce this with the PDF on the website, but not with the PDF I geberate locally from the ODT file using LibreOffice 6.2. So please make sure the PDFs are generated with the latest LibreOffice version, if you still have a problem, please test with different PDF readers and report which one(s) have issues with copying text.
Comment 5 Frank Zimmerman 2019-03-20 16:47:41 UTC
(In reply to Khaled Hosny (inactive) from comment #4)
I have two laptops currently, both Dell Precision, so high-end business class. Both show the problem, using the latest (6.2.0.3) LibreOffice, x64, on the latest W10 x64.

I have tried Foxit and Acrobat. Both show the copy/paste problem.

The fact that you can reproduce it with a PDF I generated means that it's not a problem with the PDF reader, but rather something happening in the writing process.

I have Foxit Phantom (PDF Editor) also installed, but I can't see how this would affect LibreOffice's PDF writing routines. Nevertheless, I will try the export from another unrelated computer in the office (later today if I can get to it), and report back.
Comment 6 Frank Zimmerman 2019-03-20 18:32:03 UTC
Created attachment 150127 [details]
Print to PDF using Foxit Phantom Printer Driver

Try to copy the text from this PDF and paste into a text editor. I am seeing ligature-related problems.
Comment 7 Frank Zimmerman 2019-03-20 18:33:39 UTC
Created attachment 150128 [details]
Print to PDF using MS PDF printer driver

Try to copy the text from this PDF and paste into a text editor. I am seeing Ligature-related problems.
Comment 8 Frank Zimmerman 2019-03-20 18:34:38 UTC
Created attachment 150129 [details]
Print to XPS using MS XPS printer driver

Try to copy the text from this XPS and paste into a text editor. I am seeing Ligature-related problems.
Comment 9 Frank Zimmerman 2019-03-20 18:37:55 UTC
I've attached three sample outputs using PDF and XPS printer drivers. These all have the same problems, when I try to copy text from them and paste into a text editor. This shows that it's not strictly related to the PDF output routines in the LibreOffice PDF export.
Comment 10 V Stuart Foote 2019-03-20 19:41:58 UTC
Poking at this I extracted page 10 of the Last-days-events PDF with Acrobat Pro DC v.2019.008.20080, and uncompressed its streams with qpdf v.8.4.0

While the page is extracted using Acrobat--I think it is correct to say its original structure was LibreOffice generated with rdf--xmp CreatorTool Writer, and Producer LibreOffice 6.0. 

The Linux Libertine G regular, along with other fonts, get recorded as a /BaseFont struct [1], while its /ToUnicode map is also created [2].

What is odd is that the character <01> is mapped to unicode glyphs "005400680065" or "The"; there is no <02> in the map, and the U+0065 "e" is never defined as a single glyph.  Character <26> is "ffe" --suffering--, <36> is "tte", and <52> is "Que"
 
Attaching the uncompressed Stream, where the BT & ET bracketed strings with /F6 font are the Linux Libertine G. The /F2 stanza is the opening "Cover Picture" in Linux Biolinum G. 

While there is a character <02> used in the passages--it does not appear in the /ToUnicode lookup talbe.

Since LibreOffice should have written out the /ToUnicode struct for subsetted fonts, believe issue could be there. Some of the original PDF stuff in pdfwriter_impl?


=-ref-=
[1] 
<< /BaseFont /GAAAAA+LinuxLibertineG /FirstChar 0 /FontDescriptor 58 0 R /LastChar 89 /Subtype /TrueType /ToUnicode 59 0 R /Type /Font /Widths [ 500 1047 446 250 427 503 496 371 518 270 315 530 746 456 389 511 541 537 337 530 500 263 789 309 464 464 464 492 505 514 219 219 559 423 476 489 484 587 581 235 645 548 586 596 694 296 698 267 694 838 729 953 271 828 613 525 595 701 574 502 700 464 464 464 375 375 464 464 660 464 704 321 539 741 464 434 235 297 297 748 651 287 1250 636 814 547 327 603 701 587 ] >>

[2]
59 0 obj
<< /Length 1454 >>
stream
/CIDInit/ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo<<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName/Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
88 beginbfchar
<01> <005400680065>
<03> <0020>
<04> <0063>
<05> <006F>
<06> <0076>
<07> <0072>
<08> <0070>
<09> <0069>
<0A> <0074>
<0B> <0075>
<0C> <0077>
<0D> <0061>
<0E> <0073>
<0F> <006B>
<10> <006E>
<11> <0068>
<12> <002D>
<13> <0050>
<14> <0067>
<15> <006C>
<16> <006D>
<17> <0066>
<18> <0031>
<19> <0039>
<1A> <0030>
<1B> <0062>
<1C> <0064>
<1D> <0079>
<1E> <002E>
<1F> <002C>
<20> <006600690072>
<21> <007A>
<22> <0046>
<23> <0078>
<24> <0053>
<25> <0042>
<26> <006600660065>
<27> <003A>
<28> <0043>
<29> <0045>
<2A> <0052>
<2B> <0054>
<2C> <0041>
<2D> <0049>
<2E> <004E>
<2F> <2019>
<30> <0047>
<31> <004D>
<32> <0048>
<33> <0057>
<34> <006A>
<35> <0066006600690063>
<36> <007400740065>
<37> <004C>
<38> <006600740020>
<39> <004F>
<3A> <0059>
<3B> <0071>
<3C> <0044>
<3D> <0032>
<3E> <0037>
<3F> <0033>
<40> <201C>
<41> <201D>
<42> <0036>
<43> <0038>
<44> <0055>
<45> <0034>
<46> <0026>
<47> <004A>
<48> <0066006C0069>
<49> <2014>
<4A> <0035>
<4B> <003F>
<4C> <003B>
<4D> <0028>
<4E> <0029>
<4F> <2026>
<50> <0056>
<51> <0021>
<52> <005100750065>
<53> <004B>
<54> <00660066006C0069>
<55> <2013>
<56> <0066>
<57> <005A>
<58> <0051>
<59> <2020>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
endobj
Comment 11 V Stuart Foote 2019-03-20 19:47:16 UTC
Created attachment 150130 [details]
zipped 55MB pg10 of ref with its streams uncompressed with qpdf
Comment 12 V Stuart Foote 2019-03-20 19:49:20 UTC
Created attachment 150131 [details]
extracted text portion of page 10 of the writer 6.0 exported PDF
Comment 13 Frank Zimmerman 2019-03-20 21:22:00 UTC
Thanks for all the testing and comments from everyone!

I was able to get to another laptop in the office, identical to mine (also Dell Precision M4800).

I installed LibreOffice 6.2.1 (it had never been installed on this laptop) and ran some tests. Both the original text line posted in the first comment at the top of this report, and one of my books from the website were loaded and exported to PDF.

And...they worked. No problems with copy/paste into a text editor.

Since I'm seeing the problem on both my own laptops, I'm wondering at this point if there is some software that I have installed that is interfering? ...or some setting?

My next tests will be:

1. Upgrade to the very latest (6.2.1) LibreOffice and try again.
2. Remove Foxit Phantom to see if this makes any difference.
3. Uninstall LibreOffice entirely, wiping out all user data, and reinstall again.

I'm wondering at this point if some setting might have been migrated from an older install (and carried over from version to version) that could be messing with the export.

I'll post back after I've done these tests.
Comment 14 V Stuart Foote 2019-03-20 21:59:46 UTC
(In reply to Frank Zimmerman from comment #9)
> I've attached three sample outputs using PDF and XPS printer drivers. These
> all have the same problems, when I try to copy text from them and paste into
> a text editor. This shows that it's not strictly related to the PDF output
> routines in the LibreOffice PDF export.

No, that is correct. If you are able to set the text editor to use Linux Libertine G you probably would not see corruption. Not certain, but I think the issue was with your LibreOffice's generation of the /ToUnicode mappings.

Below are the /ToUnicode charts for the MS PDF and Phantom PDF for the test string "The fire flying coffee left Quickly"

With Linux Libertine G is encoded in both PDF generators as

'<E049>e <FB01>re <FB02>ying co<FB00>ee le<E039> <E048>ickly.'

glyph positioning omitted of course.

When that text is copied to a text editor--the text editor must able to be set to use the same Linux Libertine G font. If not the PUA glyphs (E039, E048, E049) can not be rendered, the same for needing to support the Unicode Alphabetic Presentation Forms (here just FB00-ff, FB01-fi, FB02-fl).

Both PDF generators look to have correctly generated /ToUnicode charts, the PUA are mapped and with a text editor using Linux Libertine G (I prefer BablePad for this)--the strings are fully rendered. Notice the sequence of glyphs added to the /ToUnicode chart depends on the PDF generator.

MS PDF
<0003> <0020> <sp>
<0011> <002E> .
<0046> <0063> c
<0048> <0065> e
<004A> <0067> g
<004C> <0069> i
<004E> <006B> k
<004F> <006C> l
<0051> <006E> n
<0052> <006F> o
<0055> <0072> r
<005C> <0079> y
<093D> <E039> ft
<094C> <E048> Qu
<094D> <E049> Th
<0A98> <FB00> ff
<0A99> <FB01> fi
<0A9A> <FB02> fl


Phantom PDF
<0001> <E049> Th
<0002> <0065> e 
<0003> <0020> <sp>
<0004> <FB01> fi
<0005> <0072> r
<0006> <FB02> fl
<0007> <0079> y
<0008> <0069> i
<0009> <006E> n
<000A> <0067> g
<000B> <0063> c
<000C> <006F> o
<000D> <FB00> ff
<000E> <006C> l
<000F> <E039> ft
<0010> <E048> Qu
<0011> <006B> k
<0012> <002E> .
Comment 15 Frank Zimmerman 2019-03-20 22:15:35 UTC
Okay, that's interesting. I was pasting into Notepad, as I wanted a plain text representation of the text, irrespective of ligatures. 

I went back to the computer in our office which is working properly, and with a PDF generated directly from LibreOffice, and which displayed the ligatures, I was able to copy/paste that line into Notepad and it displayed correctly. All the ligature-related characters were translated into their proper plain-text equivalents.

The MS PDF generated file was missing a few ligature-related characters when I copy/pasted into Notepad.
Comment 16 V Stuart Foote 2019-03-20 22:19:06 UTC
Yes, I'm pretty certain we have an issue in building the /ToUnicode

Here is result from a 6.2.1 build--note addition of the /ActualText structure, which helps with fidelity of pasted text. But that the LibreOffice generated /ToUnicode does look to have problems. Character <02>,<05>,<07>,<10> are used but are not mapped.

<01> <005400680065>  -- The
<03> <0020> -- <sp>
<04> <006600690072>  -- fir
<06> <0066006C0079> -- fly
<08> <0069> -- i
<09> <006E> -- n
<0A> <0067> -- g
<0B> <0063> -- c
<0C> <006F> -- o
<0D> <006600660065> -- ffe
<0E> <006C> -- l
<0F> <006600740020> -- ft<sp>
<10> <005100750069> -- Qui
<11> <006B> -- k


And the text run, now with /ActualText

BT
/Span<</ActualText<FEFF005400680065>>>
BDC
56.8 724.5 Td /F1 12 Tf[<01>5<02>]TJ
EMC
1 0 0 1 74.8 724.5 Tm
/F1 12 Tf<03>Tj
/Span<</ActualText<FEFF006600690072>>>
BDC
1 0 0 1 77.8 724.5 Tm
/F1 12 Tf[<04>1<05>]TJ
EMC
/Span<</ActualText<FEFF0065>>>
BDC
1 0 0 1 88.8 724.5 Tm
/F1 12 Tf<02>Tj
EMC
1 0 0 1 94.2 724.5 Tm
/F1 12 Tf<03>Tj
/Span<</ActualText<FEFF0066006C0079>>>
BDC
1 0 0 1 97.2 724.5 Tm
/F1 12 Tf[<06>-2<07>]TJ
EMC
1 0 0 1 109.8 724.5 Tm
/F1 12 Tf[<08>-4<090A030B>2<0C>]TJ
/Span<</ActualText<FEFF006600660065>>>
BDC
1 0 0 1 139.8 724.5 Tm
/F1 12 Tf[<0D>-1<02>]TJ
EMC
/Span<</ActualText<FEFF0065>>>
BDC
1 0 0 1 152.2 724.5 Tm
/F1 12 Tf<02>Tj
EMC
1 0 0 1 157.6 724.5 Tm
/F1 12 Tf<030E>Tj
/Span<</ActualText<FEFF0065>>>
BDC
1 0 0 1 163.7 724.5 Tm
/F1 12 Tf<02>Tj
EMC
/Span<</ActualText<FEFF006600740020>>>
BDC
1 0 0 1 169.1 724.5 Tm
/F1 12 Tf[<0F>3<03>]TJ
EMC
/Span<</ActualText<FEFF005100750069>>>
BDC
1 0 0 1 179.2 724.5 Tm
/F1 12 Tf<1008>Tj
EMC
1 0 0 1 197.5 724.5 Tm
/F1 12 Tf[<0B>2<11>3<0E>]TJ
/Span<</ActualText<FEFF0079>>>
BDC
1 0 0 1 211.9 724.5 Tm
/F1 12 Tf<07>Tj
EMC
ET
Comment 17 Frank Zimmerman 2019-03-21 00:53:42 UTC
Okay folks. The problem has been found!

I tried many things (clean install, removing Foxit Phantom, new user, etc). None of these worked. Then I thought of checking the Fonts folder.

I had a selection of Linux Libertine fonts installed.

1. The full set of (almost new) fonts. These were 5.x series, but not the very latest download (the latest is 5.3.0).

2. Three older 4.x series Libertine fonts.

I removed the whole lot, and installed the very latest, and IT WORKS!

I'm not sure now how I picked up those three older ones, as I was usually pretty careful setting up my system.

Somehow, when working with the fonts in LibreOffice, I was able to see the full set (without duplicates), but the older fonts were getting preferred over the newer ones, and also getting embedded into the PDF.
Comment 18 Frank Zimmerman 2019-03-21 00:56:10 UTC
Should I mark this as Resolved?

Is there some way LibreOffice could prefer the newer fonts over any older ones that were installed? Or is this handled by Windows?

I'll check my home computer later, I'm pretty sure it has the same problem.
Comment 19 Frank Zimmerman 2019-03-21 02:34:01 UTC
So I got home and checked my laptop there.

It did not have any old fonts, but the Libertine were the 5.1.x series, and Biolinum were 1.1.0. These still caused the problems. I tried updating to LibreOffice 6.2.1 first (I was at 6.2.0.3), but that did not help.

But once I uninstalled those fonts, and installed the latest (5.3.0 for Libertine, and 1.1.8 for Biolinum) all was well there also.

I also noticed that the latest Linux Libertine and Biolinum are named without the "G" character, so: "Linux Libertine" instead of "Linux Libertine G".

Maybe it would be a good idea to force-install these latest versions when LibreOffice is installed, seeing as they work perfectly? (and apparently have "ttfautohint" applied, which I'm not sure the older ones have).
Comment 20 ⁨خالد حسني⁩ 2019-03-21 09:38:54 UTC
(In reply to Frank Zimmerman from comment #6)
> Created attachment 150127 [details]
> Print to PDF using Foxit Phantom Printer Driver
> 
> Try to copy the text from this PDF and paste into a text editor. I am seeing
> ligature-related problems.

Printing (regardless of the driver) and exporting to PDF use two different code paths on Windows so issues in the two are most likely unrelated. Printing using Windows GDI APIs, IIRC, while PDF export is done internally by LibreOffice.

Lets keep things focused, this bug report should be about PDF export, issues with printing should be in a different bug report.
Comment 21 ⁨خالد حسني⁩ 2019-03-21 09:45:09 UTC
(In reply to V Stuart Foote from comment #16)
> Yes, I'm pretty certain we have an issue in building the /ToUnicode
> 
> Here is result from a 6.2.1 build--note addition of the /ActualText
> structure, which helps with fidelity of pasted text. But that the
> LibreOffice generated /ToUnicode does look to have problems. Character
> <02>,<05>,<07>,<10> are used but are not mapped.

That is fine, it means there is no unique one to one, or one to many mapping between these glyphs (not characters) and the input text, so no /ToUnicode and /ActualText tagging is used for them.
Comment 22 ⁨خالد حسني⁩ 2019-03-21 09:48:51 UTC
(In reply to Frank Zimmerman from comment #19)
> So I got home and checked my laptop there.
> 
> It did not have any old fonts, but the Libertine were the 5.1.x series, and
> Biolinum were 1.1.0. These still caused the problems. I tried updating to
> LibreOffice 6.2.1 first (I was at 6.2.0.3), but that did not help.
> 
> But once I uninstalled those fonts, and installed the latest (5.3.0 for
> Libertine, and 1.1.8 for Biolinum) all was well there also.
> 
> I also noticed that the latest Linux Libertine and Biolinum are named
> without the "G" character, so: "Linux Libertine" instead of "Linux Libertine
> G".

There are two different Libertine fonts; the original OpenType fonts (without G) and the modified Graphite fonts (with G). Your document is using the later and they come bundled with LibreOffice. I suggest you uninstall LibreOffice, and any of the fonts that you use that possibly come with LibreOffice, then reinstall LibreOffice again and the latest version of any missing fonts you use.

> Maybe it would be a good idea to force-install these latest versions when
> LibreOffice is installed, seeing as they work perfectly? (and apparently
> have "ttfautohint" applied, which I'm not sure the older ones have).
Comment 23 Frank Zimmerman 2019-03-21 21:27:38 UTC
> There are two different Libertine fonts; the original OpenType fonts
> (without G) and the modified Graphite fonts (with G). Your document is using
> the later and they come bundled with LibreOffice. I suggest you uninstall
> LibreOffice, and any of the fonts that you use that possibly come with
> LibreOffice, then reinstall LibreOffice again and the latest version of any
> missing fonts you use.

I just checked the office computer on which I put a clean installation of LibreOffice yesterday. This was a laptop with a fresh install of Windows 10, and with no previous installations of LO. It has the following versions of the Libertine fonts:

Linux Libertine - 5.1.3
Linux Biolinum - 1.1.0

Unfortunately these fonts (and any earlier ones) are the ones that have problems when trying to copy/paste text from an exported PDF.

I had to migrate to the following:

Linux Libertine - 5.3.0
Linux Biolinum - 1.1.8

Those are the ones I downloaded from the Libertine SourceForge page, the latest ones. Those are the ones that SHOULD BE distributed with LO, but are not, so far as I can see.
Comment 24 V Stuart Foote 2019-03-21 21:35:07 UTC
The see also bug 62846 relates to specific issue here of handling Graphite fonts, which did seem to have issues with the /ToUnicode mapping--but it was merged with bug 66597 for work implementing an /ActualText tagging that came in for 6.1 release removing dependence on the /ToUnicode tables where the Graphite mappings are still wrong. So, this is actually fixed for 6.1.

Additional work of bug 117428 to make the /ActualText word boundary aware will further improve fidelity of copy/paste for all LO content export to PDF.  And should move us closer to supporting tagged PDF/UA (bug 45636) output.

*** This bug has been marked as a duplicate of bug 66597 ***
Comment 25 V Stuart Foote 2019-03-21 22:10:38 UTC
@Frank, could you dig out the build of 6.0 you used to export the "Last-Day-Events.pdf" with. And, please verify that if you open the ODF document with a 6.1 build, or a 6.2 build  and export to PDF that, with the Graphite Linux Libertine G font build 5.1.3 present, the copy & paste of page 10 is clean.

Also, as Khaled was indicating--there is a difference in font features between the Graphite modified Libertine G (5.1.3) and Biolinum G (1.1.0) fonts provided to LibreOffice by László Németh (http://numbertext.org/linux/), and the last Libertine font project 5.3 build.

So installing the 5.3.0/1.1.8 builds is fine, but only if you do not need the SIL Graphite font features. Improvements to font feature handling at the LO 6.2 release improve OTF/TTF so the Graphite fonts are not as necessary.
Comment 26 Frank Zimmerman 2019-03-22 04:56:32 UTC
Thanks, I will try that.

I realized after my last post that there is still one contradiction. I have three laptops in testing here: my office laptop, my home laptop, and a coworker's office laptop. They are all Dell Precision (M4800 and M6800), so not much difference in hardware; all running W10 64 latest release.

On the coworker's laptop, where I did a first-time install of LibreOffice 6.2, the graphite fonts were present, yet the export was fine.

On my home laptop, the graphite fonts were present, yet the export was not good. Changing to the 5.3 Libertine fonts fixed it, at least for now.

On my office laptop, the graphite fonts were present, along with three older fonts (4.x releases of Libertine, I can't remember if they were graphite or not, but most likely they were, as I did not see duplicate fonts in LO). The export problem was also present there. I also changed to the latest 5.3 Libertine fonts and it seemed to fix the problem.

So I still have one laptop in the equation which has the graphite fonts, and yet the export works.

Could there be something else going on here, like a font-cache issue?
Comment 27 Frank Zimmerman 2019-03-22 07:03:26 UTC
Here are the results of the tests you requested.

I think it was LO 6.0.7.3 that exported the PDF for the Last-Day-Events book. At least I found that listed in the meta.xml.

1. I removed LO 6.2 and all the fonts related.
2. I installed LO 6.0.7.3. I verified that the graphite fonts were installed.
3. I loaded the ODT, and exported the PDF.
4. I copied out the first paragraph of p. 10 into a text editor. Here is the result:

The covr pictur was takn in north-wstrn Portugal in th summr of 1990 whn a trribl drought had th land in its dadly grip.
The grass was tindr dry, th strams had drid up, grass and forst
firrs wr blazing out of control, and th sun unmaskd by clouds,
burnd with dstructiv powr. Fortunatly rlif firnally cam
whn rain vntually arrivd.

Definitely all the "e" characters are missing. There is an extra "r" in the word "fires" and the word "finally". I tested with Foxit Phantom, Adobe Reader DC, Chrome, Edge. All yielded pretty much the same result as above.

1. I removed LO 6.0.7.3 and all the fonts related.
2. I installed LO 6.1.5. I verified that the graphite fonts were installed.
3. I loaded the ODT, and exported the PDF.
4. I copied out the first paragraph of p. 10 into a text editor. Here is the result:

Acrobat: text copies fine
Foxit Phantom: text is missing characters
Chrome: text copies fine
Edge: text is missing characters

I repeated the same test with LO 6.2.2 and saw the same results as 6.1.5.

This explains the anomaly of the one office computer that worked...I obviously was only testing the copy/paste with Acrobat Reader, whereas on my own laptop, I was testing with Foxit Phantom. Likewise when I did test with Acrobat, it was only with the older PDF's made with LO 6.0.7 (which failed, of course).

Next I removed the graphite fonts and installed the latest non-graphite Libertine fonts. I exported the PDF. Here is the result:

Acrobat: text copies fine
Foxit Phantom: text copies fine
Chrome: text copies fine
Edge: text copies fine

So the PDF's with non-graphite fonts work with all PDF readers, whereas the PDF's with graphite fonts only work with some PDF readers.

I did one more test. I loaded page 10 of the four PDF's into Inkscape. Here are the results:

1. In the PDF's made with graphite fonts, all the "e" characters were missing, and the second letter of each ligature-set. In the non-graphite PDF, only the second letter of each ligature-set was missing.

2. All the PDF's using graphite fonts came into Inkscape as small sections of text (a few words in an object), whereas the non-graphite version came in as whole lines of text (much easier to edit).

3. The non-graphite PDF had the font name changed to "LinLibertine", which required a search/replace to set back to "Linux Libertine" so the font could display properly. The graphite PDF's imported with the correct fontname.

So other than the fontname issue, the non-graphite PDF imported into Inkscape in a much nicer way.

This puts me in a bit of a dilemma. Do I continue with the graphite fonts, knowing that there will be issues with some PDF readers, and that import into other programs (like Inkscape) could be problematic? Or do I switch to the non-graphite fonts, which could be confusing to others who download my ODT files but  only have the graphite fonts that LO installed?

It would be nice if we could have the same comprehensive support in the graphite fonts that the non-graphite ones seem to have.
Comment 28 ⁨خالد حسني⁩ 2019-03-22 08:54:16 UTC
(In reply to Frank Zimmerman from comment #27) 
> It would be nice if we could have the same comprehensive support in the
> graphite fonts that the non-graphite ones seem to have.

That is not something we can fix, unfortunately. The way Graphite works and the way these fonts are built requires using /ActualText tags for some glyphs, and the faulty applications most likely don’t support /ActualText tagging. So you either change the fonts or the faulty applications (or report to the faulty applications and hope they get fixed), there is no other option AFAIK.
Comment 29 V Stuart Foote 2019-03-22 13:10:14 UTC
@Khaled, László

Can the Grraphite built fonts be salvaged?

With changes to font handling, and dropping Uniscribe on Windows, maybe worth revisiting Jonathan & László's correction to building the Unicode mapping for Graphite fonts? Tor had to back it out 

https://cgit.freedesktop.org/libreoffice/core/commit/?id=0b70e4ea4fcf0adccdfdf4886e5cc45d46479692 

https://cgit.freedesktop.org/libreoffice/core/commit/?id=d664f279602ae6ea9275b222f3f33634aeec97b3
Comment 30 Frank Zimmerman 2019-03-22 17:28:37 UTC
I'm interested to know what the Graphite fonts offer that the non-graphite ones don't? In other words, what do I stand to lose by switching to non-graphite (other than convenience)?
Comment 31 V Stuart Foote 2019-03-22 18:04:44 UTC
(In reply to Khaled Hosny (inactive) from comment #21)
> > Here is result from a 6.2.1 build--note addition of the /ActualText
> > structure, which helps with fidelity of pasted text. But that the
> > LibreOffice generated /ToUnicode does look to have problems.

> That is fine, it means there is no unique one to one, or one to many mapping
> between these glyphs (not characters) and the input text, so no /ToUnicode
> and /ActualText tagging is used for them.

While things are much improved with HarfBuzz and moving the font handling into CommonSalLayout. But I'm still not sure this is correct, at least not in handling digraphs for the Graphite fonts. 

When LO exports to PDF the mapping of "The fire flying coffee left Quickly.", with Graphite font(s), the /ToUnicode stuct is getting an additional glyph added to the digraphs (both PUA and , and then is not mapping that glyph when it probably should.

Use the below /ToUnicode chart with annotations, and read out the Tf[.*]TJ text runs (from LO 6.2.1) in comment 16

<01> <005400680065>  --> "The", but maybe should be just "Th"?
x <02> -- "e" not mapped
<03> <0020> -- <sp>
<04> <006600690072>  -- "fir", but maybe should be just "fi"?
x <05> -- "r" not mapped
<06> <0066006C0079> -- "fly", but maybe should be just "fl"?
x <07> -- "y" not mapped
<08> <0069> -- i
<09> <006E> -- n
<0A> <0067> -- g
<0B> <0063> -- c
<0C> <006F> -- o
<0D> <006600660065> -- "ffe", but maybe should be just "ff"?
<0E> <006C> -- l
<0F> <006600740020> -- "ft<sp>", but maybe should be just "ft"?
<10> <005100750069> -- "Qui", but maybe should be just "Qu"?
<11> <006B> -- k

Seems consistently incorrect. A logic flaw in building the map(s)? Would that be our pdfwriter_impl, or now the grapite2 hb shaper?
Comment 32 V Stuart Foote 2019-03-22 18:19:24 UTC
Created attachment 150199 [details]
Font features Linux Libertine (v5.3.0)

(In reply to Frank Zimmerman from comment #30)
> I'm interested to know what the Graphite fonts offer that the non-graphite
> ones don't? In other words, what do I stand to lose by switching to
> non-graphite (other than convenience)?

considerable support for typography.

Use a 6.1 or 6.2 build of LibreOffice. Create a new paragraph with both the Graphite 5.1.3 and the last unmodified 5.3.0 build of Libertine fonts.

Open the Format -> Character dialog and select the Features button.

Clips attached.
Comment 33 V Stuart Foote 2019-03-22 18:20:21 UTC
Created attachment 150200 [details]
Graphite Font features Linux Libertine G (V5.1.3)
Comment 34 V Stuart Foote 2019-03-22 18:21:52 UTC
Created attachment 150201 [details]
Font features Linux Biolinum (v1.1.8)
Comment 35 V Stuart Foote 2019-03-22 18:22:36 UTC
Created attachment 150202 [details]
Graphite Font features Linux Biolinum G (v1.1.0)
Comment 36 Frank Zimmerman 2019-03-22 18:46:23 UTC
OK, thanks! I didn't even notice that "Features" button before. I'll have to spend some time assessing if I need all those extra features.
Comment 37 ⁨خالد حسني⁩ 2019-03-22 19:18:26 UTC
(In reply to V Stuart Foote from comment #31)
> (In reply to Khaled Hosny (inactive) from comment #21)
> > > Here is result from a 6.2.1 build--note addition of the /ActualText
> > > structure, which helps with fidelity of pasted text. But that the
> > > LibreOffice generated /ToUnicode does look to have problems.
> 
> > That is fine, it means there is no unique one to one, or one to many mapping
> > between these glyphs (not characters) and the input text, so no /ToUnicode
> > and /ActualText tagging is used for them.
> 
> While things are much improved with HarfBuzz and moving the font handling
> into CommonSalLayout. But I'm still not sure this is correct, at least not
> in handling digraphs for the Graphite fonts. 
> 
> When LO exports to PDF the mapping of "The fire flying coffee left
> Quickly.", with Graphite font(s), the /ToUnicode stuct is getting an
> additional glyph added to the digraphs (both PUA and , and then is not
> mapping that glyph when it probably should.
> 
> Use the below /ToUnicode chart with annotations, and read out the Tf[.*]TJ
> text runs (from LO 6.2.1) in comment 16
> 
> <01> <005400680065>  --> "The", but maybe should be just "Th"?
> x <02> -- "e" not mapped

That is how the fonts are built:
$ hb-shape /usr/share/fonts/TTF/LinLibertine_R_G.ttf "The fire" --no-positions
[T_h=0|e=0|space=3|f_i=4|r=4|e=7]

The numbers after each glyph is the index of the character in belongs to in the input string. Here both <T_h> and <e> glyphs get index 0 and the next glyph, <space>, gets index 3. So for us this means that, the first three characters, “the”, make a two glyph cluster, <T_h><e>, and we can’t tell which of the three characters belongs to which glyph and thus bundle them as a single unit.

Now, /ToUnicode allows only one to one and one to many mappings, but not many to many that we need here, so we use an /ActualText tag.

For maximum compatibility with PDF readers not supporting /ActualText we also add, as a last resort, a /ToUnicode entry for the first glyph, <T_h>, mapping it to the three characters and skip the second glyph <e>. This is not ideal, but at least one gets some text (and spurious chars for the unmapped glyphs) on such readers.

So basically it is a faulty font, that <e> should have gotten index 2 not 0, and us are doing our best to accommodate limitations of PDF format and PDF readers.
Comment 38 V Stuart Foote 2019-03-22 19:53:42 UTC
(In reply to Khaled Hosny (inactive) from comment #37)

> So basically it is a faulty font, that <e> should have gotten index 2 not 0,
> and us are doing our best to accommodate limitations of PDF format and PDF
> readers.

Thanks, and that makes perfect sense now. So is there a chance the hb graphite2 shaper will be tweaked--or would László (or another Graphite expert) have to build the Liberation fonts over again and tweak the Graphite support?
Comment 39 V Stuart Foote 2019-03-22 20:39:58 UTC
(In reply to V Stuart Foote from comment #38)
s/Liberation/Libertine
Comment 40 ⁨خالد حسني⁩ 2019-03-22 21:05:24 UTC
(In reply to V Stuart Foote from comment #38)
> (In reply to Khaled Hosny (inactive) from comment #37)
> 
> > So basically it is a faulty font, that <e> should have gotten index 2 not 0,
> > and us are doing our best to accommodate limitations of PDF format and PDF
> > readers.
> 
> Thanks, and that makes perfect sense now. So is there a chance the hb
> graphite2 shaper will be tweaked--or would László (or another Graphite
> expert) have to build the Liberation fonts over again and tweak the Graphite
> support?

I’m pretty sure this is a bug in the font, I can’t reproduce this behavior with other graphite fonts.

$ hb-shape /usr/share/fonts/TTF/CharisSIL-R.ttf "file flow suffice" --no-positions --shaper=graphite2
[f_i=0|l=2|e=3|space=4|f_l=5|o=7|w=8|space=9|s=10|u=11|f_f_i=12|c=15|e=16]
Comment 41 Frank Zimmerman 2019-04-24 19:29:50 UTC
Fellows,

I wanted to revisit this issue and add one more interesting detail.

I recently did a bug report for Foxit Phantom, so they could hopefully modify their code to include support for the /ActualText tags mentioned above.

During the testing procedure, I realized that using Linux Libertine G in Microsoft Word (2016) with Ligatures turned on, I could export to PDF (using the MS Word Export function), and the resulting PDF worked fine for copying out text, with all PDF readers (Foxit, Acrobat, Chrome, Edge).

So, how is it that the Word PDF export can bypass the /ActualText tagging problem? Does it have some internal way of preparing the PDF that avoids this or substitutes a more compatible method?

I also realized that the resulting PDF is about 10 times larger (in file size) than a PDF printed using a PDF printer driver.

I'm going to attach the Word Doc and Exported PDF. Maybe someone can examine the PDF and see what is going on here?

If they (MS Word) can do it, why can't we?
Comment 42 Frank Zimmerman 2019-04-24 19:33:20 UTC
Created attachment 150987 [details]
Word Doc with Ligatures exported to PDF

Here is a PDF created from a simple Word doc (with Ligatures turned on, and using Linux Libertine G). This was created using the Word export function.

This PDF can be loaded into any PDF reader, and the text copies out correctly.

A similar LibreOffice doc exported to PDF only works with PDF readers that support /ActualText tagging (Acrobat, Chrome) and not with others (Foxit, Edge).

What is Word doing differently?
Comment 43 Frank Zimmerman 2019-04-24 19:35:24 UTC
Created attachment 150988 [details]
Word Doc with Ligatures used to create PDF

Here is the Word doc used to create the PDF referred in the comments.
Comment 44 ⁨خالد حسني⁩ 2019-04-25 13:23:25 UTC
(In reply to Frank Zimmerman from comment #41)
> During the testing procedure, I realized that using Linux Libertine G in
> Microsoft Word (2016) with Ligatures turned on, I could export to PDF (using
> the MS Word Export function), and the resulting PDF worked fine for copying
> out text, with all PDF readers (Foxit, Acrobat, Chrome, Edge).
> 
> So, how is it that the Word PDF export can bypass the /ActualText tagging
> problem? Does it have some internal way of preparing the PDF that avoids
> this or substitutes a more compatible method?

Word (like most applications) does not support Graphite fonts, so it is using the OpenType layout tables in the font (which is basically like using Linux Libertine O) and these don’t have the same problem. The problem is specifically in the Graphite part of the font which seems to be misbehaving.