Bug Hunting Session
Bug 62846 - Incorrect glyph to Unicode mappings in PDFs (Graphite)
Summary: Incorrect glyph to Unicode mappings in PDFs (Graphite)
Status: RESOLVED DUPLICATE of bug 66597
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
4.1.0.0.alpha0+ Master
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: BSA target:4.2.0
Keywords:
Depends on:
Blocks: Font-Rendering
  Show dependency treegraph
 
Reported: 2013-03-28 05:45 UTC by Jonathan
Modified: 2019-03-21 21:35 UTC (History)
10 users (show)

See Also:
Crash report or crash signature:


Attachments
Patch so that trailing multicharacter glyph is correctly mapped to multiple Unicode characters (1.27 KB, patch)
2013-03-28 06:00 UTC, Jonathan
Details
Experimental patch that seems to make multicharcter glyph Unicode mappings work (669 bytes, text/plain)
2013-03-28 06:01 UTC, Jonathan
Details
Test document (8.75 KB, application/vnd.oasis.opendocument.text)
2013-04-08 07:54 UTC, László Németh
Details
Send attempt at a patch. Supercedes previous patches. (6.67 KB, patch)
2013-04-15 00:32 UTC, Jonathan
Details
problem with the cursive gy (14.54 KB, application/pdf)
2013-04-15 14:42 UTC, László Németh
Details
problem with the cursive gy (source document) (9.14 KB, application/vnd.oasis.opendocument.text)
2013-04-15 14:43 UTC, László Németh
Details
test file for comment 29, wrong number export to PDF (11.18 KB, application/vnd.oasis.opendocument.text)
2015-04-25 14:49 UTC, Gerry
Details
test result on master (41.05 KB, application/pdf)
2015-04-26 04:19 UTC, martin_hosken
Details
Awami Nastaliq Type Sample generated by LibreOffice 5.4 (135.00 KB, application/pdf)
2017-09-06 08:37 UTC, Volga
Details
Problem with Cyrillic (34.63 KB, application/pdf)
2017-09-06 08:44 UTC, Volga
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jonathan 2013-03-28 05:45:03 UTC
Problem description: Glyphs incorrectly mapped to Unicode characters in PDFs produced using Graphite fonts

Steps to reproduce:
1. Create a text document containing the text "This is the official version." using the font Libertine G.
2. Export the document to PDF.
3. Cut the text from the produced PDF and paste it somewhere.

Current behavior:

The pasted text says "This stiff"

Expected behavior:

The pasted text says "This sthif"
Operating System: All
Version: 4.1.0.0.alpha0+ Master
Comment 1 Jonathan 2013-03-28 06:00:13 UTC
Created attachment 77143 [details]
Patch so that trailing multicharacter glyph is correctly mapped to multiple Unicode characters
Comment 2 Jonathan 2013-03-28 06:01:28 UTC
Created attachment 77144 [details]
Experimental patch that seems to make multicharcter glyph Unicode mappings work
Comment 3 Jonathan 2013-03-28 06:02:03 UTC
I have debugged this a little and found (at least) two problems.

The first is simple - if the PDF ends with a multi-character glyph, then only the first character is put into the character mapping (that's why the my demonstration ends with a single 'f'). I've attached a patch to fix this, which is simple enough that I'm pretty confident is correct.

The effect of the second bug is easy to see - only the first character of a multi-character glyph are mapped to that glyph and subsequent characters are mapped onto the next glyph (that's what produces the unwanted 'h' in 'sthif'). I've made a patch that seems to fix this problem, but as the code is not only used for producing PDFs I cannot tell whether the bug in producing PDFs is actually correct code for other functions. I've attached that patch too.
Comment 4 Jonathan 2013-03-28 06:04:03 UTC
> 1. Create a text document containing the text "This is the official
> version." using the font Libertine G.

Oops sorry I change my test case. That should read "This stiff"
Comment 5 Michael Stahl (CIB) 2013-03-28 12:43:22 UTC
Martin, could you please give your opinion of the patch to graphite_layout.cxx ?
Comment 6 László Németh 2013-04-05 12:46:25 UTC
I have tested the patches, but I have got a similar bad result:

“This sthiff the official version. the offichial vershion.”

Maybe this documentation will help to fix the problem:

http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf

Also the Graphite developer SIL's font Charis SIL has got the same problem in its PDF export.
Comment 7 Jonathan 2013-04-08 00:53:26 UTC
I tried to replicate that with my patches and it worked for me. Perhaps you could attach an .odt file that causes the problem?
Comment 8 László Németh 2013-04-08 07:54:47 UTC
Created attachment 77575 [details]
Test document

Test document with the text “This stiff the official version.”
Comment 9 Jonathan 2013-04-08 08:02:28 UTC
I got something much more strange from that attached file:

Thisstifftheofficialversion.stifftheofficial version.stifftheofficialversion.thestifftheofficialversion.officialstifftheofficialversion.version.

At least we are agreed there is a problem! Strange that when I typed the same text into a blank Writer document the PDF came out perfectly.

I'll get the debugger out when I have a free hour or two.
Comment 10 László Németh 2013-04-08 08:15:37 UTC
(In reply to comment #9)
I have got problems with the new empty file, too:

This sthif the offichial vershion.

(Likely it is indifferent, but my build uses the internal PDF backend:

libo$ grep -1 pdf\ backend config.log
configure:31658: result: yes
configure:31662: checking which pdf backend to use
configure:31833: result: internal)
Comment 11 Jonathan 2013-04-09 00:31:36 UTC
1. I get the same result so my build is also using the internal PDF backend. The only explanation I can find is that you have somehow not incorporated the patches. That is, the behaviour you are showing is exactly that of the unpatched sources.

2. I've found out why I got strange behaviour. The problem is related to my first patch (trailing multicharacter glyph...). You can't map the last character from the sequence and map it to the rest of the characters in the input file because the sequence may not be the final sequence in the file.

I can see no obvious solution. The problem (or at least one problem) is that 
pCharPosAry, which PDFWriterImpl::drawLayout uses to construct the Unicode mapping, isn't really right for the job. The 'Experimental patch' seems to go some way to improving this but there is still the problem of knowing the number of characters that correspond to the final glyph in a sequence. It's not just a matter of fixing this in graphite_layout.css because it must remain compatible with GenericSalLayout::GetNextGlyphs

I suspect the answer is in the comments in pdfwriter_impl.cxx to the effect that the Unicode mapping has been implemented with a quick fix and really needs to be done properly.
Comment 12 László Németh 2013-04-09 15:59:32 UTC
@Jonathan. You are right, I had a problem with the second patch. Your patch works well for the first paragraph of a newly filled document.
Comment 13 Jonathan 2013-04-10 06:11:03 UTC
Looking at the configuration files, it seems to me that the internal/external PDF configuration is for PDF importing, not exporting. Can anyone confirm?

So, since this really is a killer bug for me I'd like to try to fix it. Can anyone advise on what is at stake if I slightly modify the calling interface for the GetNextGlyphs function? The caller needs to be able to calculate the number of Unicode characters are represented by the final glyph in a sequence. An easy way to do this would be to fill in one further item in the pCharPosAry array. This will also have the effect of reducing by one the maximum number of glyphs in a sequence.

It also means that any other GetNextGlyphs functions will need to do the same thing. As far as I can tell there is at present only one other such function GenericSalLayout::GetNextGlyphs  But is it possible that there will be others in the future?

Anyway, it all seems harmless enough to me. But then a little knowledge is a dangerous thing.
Comment 14 László Németh 2013-04-10 06:46:49 UTC
(In reply to comment #13)
> Anyway, it all seems harmless enough to me. But then a little knowledge is a
> dangerous thing.

I strongly support the fix, it is one of the annoying problems of Graphite (the other ones are the following regressions: character duplication at the hyphenation between ligatures, and the not optimal hyphenation at the explicite hyphens).
Comment 15 Jonathan 2013-04-10 23:27:38 UTC
> I strongly support the fix, it is one of the annoying problems of Graphite
> (the other ones are the following regressions: character duplication at the
> hyphenation between ligatures, and the not optimal hyphenation at the
> explicite hyphens).

Have there been bugs reported for these problems and/or can you give some example files that demonstrate them?
Comment 16 László Németh 2013-04-11 00:49:44 UTC
(In reply to comment #15)
> > I strongly support the fix, it is one of the annoying problems of Graphite
> > (the other ones are the following regressions: character duplication at the
> > hyphenation between ligatures, and the not optimal hyphenation at the
> > explicite hyphens).
> 
> Have there been bugs reported for these problems and/or can you give some
> example files that demonstrate them?

Completely bad hyphenation at Graphite ligatures: Bug 52540 (I will attach an English test file there). (Maybe related Bug 53245 with Graphite ligature handling, also the possible root of these problems, a new feature: Bug 52028.)

I haven't found the test files, but here is a description about the second problem: http://lists.freedesktop.org/archives/libreoffice/2011-October/019232.html (I had to modify this libhyphen patch later).

Thanks. László
Comment 17 Jonathan 2013-04-15 00:32:50 UTC
Created attachment 77964 [details]
Send attempt at a patch. Supercedes previous patches.

- All instances of GetNextGlyphs modified to fill in an extra item in the character indexes array so that the Unicode mapping for the final glyph can be calculated correctly.
- All calls to GetNextGlyphs have the length of the character index increased by one to allow space for the final item.
Comment 18 Jonathan 2013-04-15 00:38:20 UTC
OK I've had another attempt. This one works in all my text cases so far. Some comments:

- I've created to corresponding patches for the Windows code but do not have the environment to test it.
- Presumably a similar patch in GraphiteLayout::fillFrom is required to handle RtL. Anyone out there familiar with a right-to-left script?
- This patch modifies (very slightly) the calling interface to for GetNextGlyphs, and should presumably be documented somewhere.

I haven't looked at the other Graphite bugs yet.

Jonathan
Comment 19 László Németh 2013-04-15 14:42:42 UTC
Created attachment 77996 [details]
problem with the cursive gy

Original text: 

Magyar maffia is paffan.
A final version.

Result:

Magyagr magffiag is pagffagn.
A finagl version.
Comment 20 László Németh 2013-04-15 14:43:59 UTC
Created attachment 77997 [details]
problem with the cursive gy (source document)
Comment 21 László Németh 2013-04-15 14:46:20 UTC
The recent patch is quite good (I have checked only under Linux), many thanks for it! I have found only the attached problem.
Comment 22 Jonathan 2013-04-16 01:38:35 UTC
Interesting. When you run the unpatched Libreoffice, the cursive gy is handled properly.

How do you even generate that cursive gy? When I tried to create the same file I got g and y as separate characters.
Comment 23 László Németh 2013-04-16 06:50:01 UTC
(In reply to comment #22)
> Interesting. When you run the unpatched Libreoffice, the cursive gy is
> handled properly.

Sometimes it is handled well by the patched version, too.

> 
> How do you even generate that cursive gy? When I tried to create the same
> file I got g and y as separate characters.

This is a language-dependent Graphite ligature, it works only in Hungarian language documents. There are also some similar German replacements, for example, small capital ß.
Comment 24 Commit Notification 2013-05-22 12:14:58 UTC
Jonathan Schultz committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=0b70e4ea4fcf0adccdfdf4886e5cc45d46479692

fdo#62846 incorrect glyph to Unicode mappings in PDFs



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 25 Commit Notification 2013-05-22 12:40:30 UTC
Tor Lillqvist committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=d664f279602ae6ea9275b222f3f33634aeec97b3

Revert "fdo#62846 incorrect glyph to Unicode mappings in PDFs"



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 26 László Németh 2013-05-22 14:01:42 UTC
(In reply to comment #18)
I tried to commit the patch, but you were right, the patch for Windows UniScribe  has had a bug yet:

Breaks the build for Windows: vcl/win/source/gdi/winlayout.cxx(1897) : error C2065: 'nCharPos' : undeclared identifier
Comment 27 Gerry 2015-04-24 17:42:54 UTC
@László @Tor:

I wanted to ask whether this bug on "Incorrect glyph to Unicode mappings in PDFs" is actually fixed or not, or whether the patches can be applied for future versions??
Comment 28 Jonathan 2015-04-25 06:07:43 UTC
As far as I can tell the patch was never quite right, as it didn't work with the cursive gy. I just tried with the version I currently have running - Debian build Version 4.3.3.2 Build ID: 430m0(Build:2) - and the behaviour has changed but still isn't correct: "This stiff" becomes "This stif".
Comment 29 Gerry 2015-04-25 14:48:34 UTC
I did some further tests in LO 4.4.2.2 (Windows 7) and most of the problems are solved. However, there is a wrong mapping of glyphs to unicode in PDF export in all Graphite fonts (tested with Linux Libertine G and Linux Biolinum G).

If you export numbers to PDF and then export the text from the PDF you get:

Times New Roman (no problem here)
69115
12345
99999
12121
21212
11111
Linux Libertine G
691115
121445
991999
121121
211212
111111
Linux Biolinum G
691115
121445
991999
121121
211212
111111

Please see attached .odt for testing.
Comment 30 Gerry 2015-04-25 14:49:53 UTC
Created attachment 115088 [details]
test file for comment 29, wrong number export to PDF
Comment 31 martin_hosken 2015-04-25 15:49:39 UTC
Works for me against master, based on the patch for bug #52540.
Comment 32 martin_hosken 2015-04-25 15:51:11 UTC
That is, the PDF generated correctly reflects the document. It looks the bug is fixed in master.
Comment 33 Gerry 2015-04-25 20:32:01 UTC
(In reply to martin_hosken from comment #32)
> That is, the PDF generated correctly reflects the document. It looks the bug
> is fixed in master.

@Martin: I tested the problem as described in comment 29 with LO 5.0.0.0.alpha1 and the problem still exists with Graphite fonts. Can you please try the test with 11111,12211 (numbers with more than three digits) against master?

You can use attached .odt file for testing: https://bugs.documentfoundation.org/attachment.cgi?id=115088
Comment 34 martin_hosken 2015-04-26 04:19:59 UTC
Created attachment 115108 [details]
test result on master
Comment 35 martin_hosken 2015-04-26 04:22:44 UTC
I also tried 111111 111111 1111111 all which rendered find in pdf complete with appropriate spacings.
Comment 36 Gerry 2015-04-26 07:45:27 UTC
(In reply to martin_hosken from comment #34)
> Created attachment 115108 [details]
> test result on master

@Martin: Thanks for testing the document, but the bug is unfortunately not fixed. I opened your PDF attachment and copied the text to a text editor. I found the problem as described in comment 29: Times is correct, but Libertine and Biolinum are wrong. There is always a "1" too much (e.g. 99999-> 991999) in the Unicode text which I copied from the PDF to the text editor. Please see:

Times New Roman (okay)
69115
12345
99999
12121
21212
11111
Linux Libertine G (bug appears)
691115
121445
991999
121121
211212
111111
Linux Biolinum G (bug appears)
691115
121445
991999
121121
211212
111111

Thanks
Comment 37 martin_hosken 2015-04-27 06:07:12 UTC
I'm sorry. Yes, this is about extracting text from a PDF.

tldr; The basic way to fix this font is to get the associations correct. But this requires both a compiler change and a serious fixup of the GDL in the font.

OK. What's going on here is a combination of things. Firstly a font riddle with bugs. Second a compiler that is far too kind and that therefore outputs a font that is less than ideal and thirdly a pdf engine that doesn't give us any help.

The compiler has problems at the moment with glyph associations for deletion and insertion. We will fix that. But when we do we will also fix it that having a different number of slots on the left hand side of a rule to the right hand side or to the context will be an error. You really really need to fix those.

There is a work around but it will take a lot of work. If you make all the associations in deletions explicit as in:

gEscape ga gb > _ _ gab:(1 3) / ^ _ _ _;

then the compiler will output a font that the engine will accept. You should do this anyway. IOW try seriously to get your font down to there being no warnings except ignored ones. The warnings are trying to tell you something that you really should listen to.

Why is it outputting a 1 all the time instead of nothing? As a text is converted to PDF each glyph's association with its underlying Unicode is tracked and stored as the glyph mapping in the font's ToUnicode table. Since the inserted narrow non-breaking space is associated with one of the digits in the underlying text, it gets associated with a digit in the ToUnicode table for the font. The last such association is taken and that is in a line of 1s hence using 1 everywhere. Ideally it should output nothing. This is why associations in a font are important, and to be honest, tricky.

Does anyone know a way to get the pdf writer to store ActualText elements in the generated PDF containing the actual Unicode for a string rather than trying to back infer it from a sequence of glyphs?
Comment 38 martin_hosken 2015-04-27 10:07:55 UTC
I would like to point out that this same bug exists in OpenType fonts as well. If an OT font creates a ligature it will do the wrong associations with the glyphs. There is no reliable way to use just a ToUnicode cmap to ensure correct reconstruction of text from a pdf file. Instead if you want reliability the only way is to use ActualText. This will involve development work on the pdf writer to generate.

The reason we can't get the results we want is that the information used by the pdf writer is the same information required for cursor positioning and when it comes to ligatures (in the pdf case to ignore the glyph and in the cursor tracking case to map to the underlying character) these values are inherently at odds with each other.

This is a wider bug than just graphite integration.

For example, test the following string ពពកកឿ in an OT font you will get this out from text extraction ពពកកកក.

If you want a workaround for this particular case in Linux Libertine G, then I would use kerning to introduce the space rather then trying to insert a glyph.
Comment 39 László Németh 2015-04-27 10:16:10 UTC
(In reply to martin_hosken from comment #38)
If you want a workaround for this particular case in Linux Libertine G, then
> I would use kerning to introduce the space rather then trying to insert a
> glyph.

Thanks, I will use this in the next release, within a few months. Many thanks for your investigations!
Comment 40 martin_hosken 2015-04-27 10:25:44 UTC
@Laslzo in that case may I beg you to try to get the warnings count as close to 0 as you can. If there are any that you find particularly difficulty, please feel free to contact me.

In addition, rules of the kind:

x > y / a _ b run much faster that a x b > a y b

in addition the engine will look backwards for the a and will start processing at the b for the next rule. You can still use ^ if you want to reprocess your results:

x > y / a ^ _ b

notice that you can do multiple replacements in complex contexts:

x y z > u v w / a _ b _ c _ d

if you are just setting attributes then:

x > @x{attr=3} / a _=x b

or x > @2{attr=3} / a _ b

the less work you can get your font to do the faster it will run.

Another trick is to split big passes into smaller ones with pass constraints. E.g. how about all those \ replacement rules going in their own pass with a pass constraint saying: don't do this pass if the feature isn't enabled. That'll make the font smaller and faster.

There are other speed up tricks that you can do. But let's start with these simple ones first. Happy hacking.

You may also like graide as a good test and development environment. Look for it on graphite.sil.org
Comment 41 László Németh 2015-04-27 12:48:16 UTC
@Martin: Many thanks for your great help! I will try to avoid the warning messages, also simplify the rules. Graide seems to be a very useful tool for these goals. Thanks for the link, too!

@Martin, @Gerry: Many thanks for the tests. I think, it's possible to close this issue now, thanks to Martin's LibreOffice fix, and I will fix the Graphite font problem with numbers in the next Linux Libertine/Biolinum G release in the near future.
Comment 42 Gerry 2015-08-10 09:26:49 UTC
(In reply to László Németh from comment #41)
> @Martin, @Gerry: Many thanks for the tests. I think, it's possible to close
> this issue now, thanks to Martin's LibreOffice fix, and I will fix the
> Graphite font problem with numbers in the next Linux Libertine/Biolinum G
> release in the near future.

@László: I just wanted to ask you when you plan to update the Linux Libertine/Biolinum G fonts to fix the wrong glyph mapping in the PDF output.

Shall the bug be closed already now or after the new font versions are out?

Thanks!
Comment 43 QA Administrators 2016-09-20 10:21:19 UTC Comment hidden (obsolete)
Comment 44 Jonathan 2016-09-20 15:52:37 UTC
I can confirm that the bug is present and the behaviour unchanged in version 5.2.0.4 (Debian build ID 1:5.2.0-2) which is the version installed on my work notebook. I am away from my development machine and unable to test a more recent or upstream version for another 2 weeks.
Comment 45 Vera 2016-10-17 16:05:55 UTC
I can confirm that the bug is present in LibreOffice 5.2.2.2 in Ubuntu 16.04.
Comment 46 Volga 2017-06-25 04:29:22 UTC
This bug still affect LO 5.3. SIL Awami Nastaliq website has a font type sample , this sample produced with LibreOffice 5.3.1.2, when I open the file, copy Urdu text from page 5, I get the following text:

Awami Running Text
One paragraph from Urdu UDHR
À
Ã
¢
Õ
Œ
œ
ö
–
—
ý 
“
”
¢
‘
ö
Õ
’
÷
 ô
◊
ÿ
∞
Ÿ
/
"
ý
⁄
¤
∞
ý
~
“
”
ö
‹
—
õ

áÇÜ
ï
x›
œ
û
fi
÷
"›
fl
! ‡
·
fl
ý 
›
‚
÷
 òý 
À
„
÷
∆
‰
÷
 ó
Â
Ê
¢
Á
¢
Ÿ
Ë
ó
Â
È
Í
Î
¢
Ï
ù
Ì
Ó
›
fl
›
‚
÷
Â
±
∞
Ô
Õ

¢
›
Ò
Ú
¢
ý 
Ë
›
‚
÷

ó
Û
∆
«
Ù
 ́
™
›
ı
ˆ
“
”
ö
‹
ß
"›
ı
 ̃
∞
À
†
Ù
 ̄
¢
ý 
À
Ã
¢
Õ
Œ
œ
ö
–
—
ý 
ô
 ̆
 ̇
ö
À
„
÷
À
 ̊
›
ú
¢
ó
›
‚
÷
ù
 ̧
¢
 ̋
û
ó
›
ú
∞
 òý 
xÀ
 ̨

 øóõ 
°
¢

∞

®
ı

÷

›
‚
÷
  ó
Â
È
Í
Î
¢
Ï
 òý 
∆
«
Ù
 ú
›
◊

¢

À

÷
ý
p›
ú
û
›
ı
 ̃
¢
À
ý 
÷
û
4
‡
·
œ
Í

x°
£
û

.


°

û

ƒ
∞

›

Í
ý 

“

Í
Ú
¢
Õ
’
÷
 òý ó

ý 
°

û
∆
‰
÷
"›
fl
/
! ‡
·
fl
ý 
›
‚
÷
 òý 
p›


À
†
Ù
 ̄
¢
ý 
À
†
Ù
 ̄
¢
ý 
ù


ö
 
÷
›
ú
û
õ
Õ
’
÷
 òý ó

ý 
À
Ã
+
›

ö
›
ú
û
›
œ
¢

∆
‰
÷
m∆
õ
«
Ù
À
ý 
°

û

p
óýõý 
ù
Ì

û

 ̆
 ̇
∞
 ó

ý 
p⁄

⁄

÷

ý 
∆
«
Ù
  ó
Â

›

¢
 ó

ý 
xÀ
Ã
+
›

ö
›
œ
û
fi
÷
p
ý
∆
¢
«
û

∆
«
Ù
 ú
›

›
∞
!
À
Ã
+
›

ö
›
ú
∞
∆
«
ö
¢
Û›
œ
û
"
∞
Ï
ý 
Õ

+
⁄
#
÷
À
›
◊
$
À
„
÷
ƒ
∞
≈
û
%
Í
&
û
'
ù
(
›
œ
û

Õ
’
÷
À
)
∞
‡
·
fl
›
ú
û
 ́
©

ù
*
+
÷
°

û

°
¢
,
-
¢
 òý ó

ý 
°
£
û
§
+
›

ö
Õ
’
÷
¡
.
¢
ý

 ú
‡
·
œ
û
/
0
¢
1
∞

This document is directly available in http://software.sil.org/awami/design/ , also available in their download page http://software.sil.org/awami/download/
Comment 47 martin_hosken 2017-06-28 11:12:35 UTC
Looks like this is fixed in 5.4. I ran a test and for the 3 fonts: NotoNastaliqUrdu, Awami Nastaliq and Scheherazade, the PDF copied arabic text (even with correct characters with nuqtas). Which is all pretty amazing given the Awami font doesn't have appropriately named glyphs and also decomposes its nuqtas.
Comment 48 martin_hosken 2017-06-28 15:04:37 UTC
I lied. It's not producing good text, even if it is somewhat Arabic like. For a start the text seems to be backwards.

Here's what is going on. Inside the PDF there is a 1:n mapping between glyphs and characters. That's destined for failure just there because if you break off your nuqtas, you are in for trouble. So, while libo does the best it can, the results are going to be really bad regardless.

This has nothing to do with graphite vs harfbuzz, since by the time the pdf writing is happening, everything has been shaped into the same structures. It's just the nature of the problem that PDF cannot map n:1 glyphs:chars on output, especially for the case [xy]:z and x:w. The only way to do this properly is to output the unicode text along with the glyphed text as part of the PDF page stream.

One way might be in vcl/source/gdi/pdf_impl.cxx to have another MARK() function that takes a OUString&, nIndex and nLen and outputs that as the /ActualText as part of the structure element dictionary in the /Span. This would only get output if structured marking was turned on. I'm not sure if there would need to be any other limiting factors like: the text contains CTL codepoints.

Suffice it to say that libo isn't up to handling CTL text for text export from PDF. But let's not blame libo too much. This is really a bug in PDF since the PDF specification only allows 1:n glyph:char mapping. All very latin centric ;)
Comment 49 Volga 2017-09-06 08:37:55 UTC
Created attachment 136057 [details]
Awami Nastaliq Type Sample generated by LibreOffice 5.4

I have already got the font package from SIL (noted in comment 46), then I extract the sample ODF, open with LibreOffice 5.4.1, expert as PDF. When I get the PDF file, I copy the Urdu UDHR again, the character mapping seems better, but many words are deformed and not correctly handling its direction.

版本:5.4.1.2 (x64)
Build ID:ea7cb86e6eeb2bf3a5af73a8f7777ac570321527
CPU 线程:4; 操作系统:Windows 6.19; UI 渲染:默认; 
区域语言:zh-CN (zh_CN); Calc: group

Here is what I copied from self generated PDF.
Awami Running Text
One paragraph from Urdu UDHR
ے ن
ن
ی ل ب
ڋ
مس
نرل ا
ن
ی جڋ ک حڋک ه ت
ق
م
قِوما
ق
ا ۰۱ ؍دسمڋنر ۸۴۹۱ ؁
با۔
ي
ا اعلان عا مکک ک ے اس ک
ک
ږ ک ک ظوک ر
ن
ن
ن
ور" م
ش
ش
ن
ن
م يم ل ا اع ک
قوک ق
ق
ح
ین
ن
اس
ن
ن
ا " ا ک ء ک
بږ زور الك پ ما ممڋنر مم مت
ق
ے
ي
پ
پ
ن
ے ا ن
ن
ی ل ب
ڋ
مس
عڋ ا ب ے ڋ
ک
ک ے
اب م
ن
را ک
خي ک
ن
ی
ي
بار
ق
ے۔ سا ہ
ن درج
ن
ت
ق
ل م
م ک
م ا ک ور ک ش
ش
ن
م بږ سا
ات پ
ح ف
ن
ص ے
ل گ کا
باں
ي
تما
ن
ے یہ ککہ ا س
ي
ً
بلا
ش
نن۔ م
ي
ہ ل
ّ
ص ح نت
ب م
ق
عا ش
ش
شر و ا
ن
ن
ی ک ین اور اس ک
ي
ږ ک عام ک
ِ
ا اعلان ک ہاں اس ک
ہ
ے
ي
پ
پ
ن
ے ا
ي
پ
پ
ھي ا ب
ڋ
با ککہ هو
ي
د
ی ک ے اور اس ک ن
◌
ا ج
ڋ
اب
ي
اب
ن
س ږ ک ک ھږ
ڋ
ب
ے پ
تن ا س
ي
مي اداروں م ی
ي
ل ع ب
ق
ولوں اور ک
ک
ش بږ ا اص طور پ ج
ن
ے۔ اور ن
◌
ا ج
ڋ
اب
ي
ږا ںکک
ن
ب
ي
بږ وآ قامات پ
ق
م
اب
ق
یہ ڋبږ
ن
باز
ي
ت
ق
م نی ا
◌
و ک ک ے اظ س ح ل ے
ک
ک ب
ق
ثيب
ش
یي
ح
بايس
س ی ک ک ے
ق
ق
با علا
ي
ملك
ي س ک
ک نت
ي
ن م
ن
م ض
ن
نن، اور سا
ي
ن
◌
ا ج
ڋ
ی ک ضج ک
ن
بلات او
ي
ص ف
ن
ت
ق
ے ن
◌
ا ج
ڋ
Comment 50 Volga 2017-09-06 08:44:49 UTC
Created attachment 136058 [details]
Problem with Cyrillic

The problem still appearing with Cyrillic. I installed Ponomar Unicode and its TTF version (Ponomar Unicode TT) on my computer, and I copied a sample text from http://sci.ponomar.net/fonts.html twice, set to these fonts. After I expert to PDF, copy the text, I get the following result:

Ponomar Unicode
Хрⷭ ҇ то́ съ воскре́ се и҆ з̾ ме́ ртвыхъ, сме́
ртїю сме́ рть попра́ въ, и҆
сꙋ́
щымъ во гробѣхъ иво́ тъ дарова́ въ.
Ponomar Unicode TT
Хртоосъ воскреосе иизз еортвыхъ, с еортїю с еорть попраовъ, ии
сꙋ
о
щы ъ во гробѣхъ живоотъ дароваовъ.
Comment 51 martin_hosken 2017-09-06 10:28:14 UTC
Sorry to be somewhat brutal. But until we get the PDF writer to produce the necessary PDF to allow for data extraction, using tagged PDF, it doesn't matter what magic we do with our fonts, it isn't going to work. You can give example after example, it won't help fix the problem.

One of the difficulties with attaching text to a PDF text run is that the text has to be output before the glyphs that give the presentation. So there are a number of tradeoffs we can employ in resolving this. So I'll ask, which you prefer:

speed vs size? Do you want to make small PDFs that only output unicode strings for runs that really need them, but take a bit longer to produce (since the strings have to be analysed to make the decision) or do you OK with having a complete copy of the text in your pdf?

Do we want to make this an option that says: make me extractable PDF or do we always want to generate extractable PDF even if the result is bigger or slower to produce?
Comment 52 Volga 2017-09-06 14:52:31 UTC
(In reply to martin_hosken from comment #51)
> One of the difficulties with attaching text to a PDF text run is that the
> text has to be output before the glyphs that give the presentation. So there
> are a number of tradeoffs we can employ in resolving this. So I'll ask,
> which you prefer:
No, I have no prefer when I report here. I just reproduced by clicking “Expert to PDF” at toolbar. Sorry.
Comment 53 Jonathan 2017-09-06 22:53:52 UTC
Thanks for the update martin_hosken@sil.org. Personally I concur with the previous comment in that I don't have a strong preference. Neither space nor time is a constraint, but having a searchable PDF is essential. Perhaps if it came to it, getting the PDF right is more important than speed, so I'd go with the slow and small option.

I might repeat that by manually editing my PDF (I forget how I did it, this was years ago) I managed to fix the glyph mapping and make it correctly searchable. I'm not sure what this says about the time/space trade-off you mention, but to my naive interpretation, it does make the current implementation look more like a bug than a design flaw.
Comment 54 Shree Devi Kumar 2018-01-17 15:01:32 UTC
The problem of copying text from pdfs created with unicode fonts for complex scripts has been solved by Jonathan Kew by use of actualtext in xelatex.


It uses the new \XeTeXgenerateactualtext feature - please see http://tug.org/pipermail/xetex/2016-February/026445.html for the announcement.

Is it possible to use a similar approach for Libre Office?
Comment 55 Shree Devi Kumar 2018-01-18 04:40:56 UTC
Please also see https://bugs.documentfoundation.org/show_bug.cgi?id=66597#c20

Comment # 20 on bug 66597 from Khaled Hosny

LibreOfice has limited support for actual text already and I think it shouldn’t
be hard to extend it and make it an option at least. If someone is interested
in giving this a try, check SetActualText() calls in
sw/source/core/text/EnhancedPDFExportHelper.cxx.
Comment 56 martin_hosken 2018-01-18 07:23:25 UTC
(In reply to shreeshrii from comment #54)
> The problem of copying text from pdfs created with unicode fonts for complex
> scripts has been solved by Jonathan Kew by use of actualtext in xelatex.
> 
> 
> It uses the new \XeTeXgenerateactualtext feature - please see
> http://tug.org/pipermail/xetex/2016-February/026445.html for the
> announcement.
> 
> Is it possible to use a similar approach for Libre Office?

No. XeTeX is XeTeX and libo, libo. They are completely different animals with completely different processing engines, pdf output mechanisms. There is no overlap. All XeTeX is doing is inserting \actualText elements just as I suggested a while back (see comment #48). This will require some programming from someone who has the time to do it. Either that or you can pay one of the consulting companies to do it. Since this is a new feature, no amount of complaining or trying to say it's a regression on some font or other is going to fix it.

The only way forward on this bug is for someone to commit code to add the capability to libo.
Comment 57 Khaled Hosny 2018-01-25 12:29:12 UTC
We have one common code path for Graphite and non-Graphite fonts now, so whatever the fix for bug 66597 it should work here too.

*** This bug has been marked as a duplicate of bug 66597 ***