Bug 66597 - Problems with copying and extracting text from generated PDF
Summary: Problems with copying and extracting text from generated PDF
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: Other All
: medium normal
Assignee: ⁨خالد حسني⁩
URL:
Whiteboard: BSA target:6.1.0
Keywords:
: 62846 124191 (view as bug list)
Depends on:
Blocks: Font-Rendering 117428
  Show dependency treegraph
 
Reported: 2013-07-04 19:23 UTC by Steve White
Modified: 2019-03-21 21:35 UTC (History)
9 users (show)

See Also:
Crash report or crash signature:


Attachments
More thorough description of the problem. (6.64 KB, text/plain)
2013-07-04 19:23 UTC, Steve White
Details
LOWriter doc as described in report (13.00 KB, application/msword)
2013-07-04 19:27 UTC, Steve White
Details
PDF as exported on my system (53.88 KB, application/pdf)
2013-07-04 19:31 UTC, Steve White
Details
LOWriter document with a modified set of devanagari fonts (14.00 KB, application/msword)
2016-09-18 03:26 UTC, Shree Devi Kumar
Details
Exported PDF for LOWriter document with a modified set of devanagari fonts (170.59 KB, application/pdf)
2016-09-18 03:27 UTC, Shree Devi Kumar
Details
Copied text from PDF for LOWriter document with a modified set of devanagari fonts (1.28 KB, text/plain)
2016-09-18 03:28 UTC, Shree Devi Kumar
Details
Copied text from PDF for the new LOWriter document (3.08 KB, text/plain)
2016-09-18 03:34 UTC, Shree Devi Kumar
Details
Sample text in multiple Indian scripts - ODT file (18.71 KB, application/vnd.oasis.opendocument.text)
2018-01-24 09:16 UTC, Shree Devi Kumar
Details
Sample text in multiple Indian scripts - Exported PDF (81.11 KB, application/pdf)
2018-01-24 09:17 UTC, Shree Devi Kumar
Details
Sample text in multiple Indian scripts - Text copied from exported pdf (1.92 KB, text/plain)
2018-01-24 09:18 UTC, Shree Devi Kumar
Details
Sample text in multiple Indian scripts - Original Text copied from ODT (2.29 KB, text/plain)
2018-01-24 09:28 UTC, Shree Devi Kumar
Details
Devanagari QA files (165.11 KB, application/x-zip-compressed)
2018-04-28 12:39 UTC, Shree Devi Kumar
Details
Devanagari QA2 (30.65 KB, application/x-zip-compressed)
2018-04-29 12:54 UTC, Shree Devi Kumar
Details
Devanagari QA3 files (178.86 KB, application/x-zip-compressed)
2018-04-30 11:25 UTC, Shree Devi Kumar
Details
Indic including Devanagari - QA4 (343.58 KB, application/x-zip-compressed)
2018-05-01 11:17 UTC, Shree Devi Kumar
Details
RTL Languages - Arabic, Hebrew QA5 (80.39 KB, application/x-zip-compressed)
2018-05-01 11:45 UTC, Shree Devi Kumar
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Steve White 2013-07-04 19:23:51 UTC
Created attachment 82038 [details]
More thorough description of the problem.

Problem description: 

Steps to reproduce:
1. In a LOWriter doc, put several copies of the lines (Article 1 of the UDHR)
सभी मनुष्यों को गौरव और अधिकारों के मामले में जन्मजात स्वतन्त्रता और समानता प्राप्त है । 
उन्हें बुद्धि और अन्तरात्मा की देन प्राप्त है और परस्पर उन्हें भाईचारे के भाव से बर्ताव करना चाहिए ।
Format with a different font supporting Hindi.  I used
distro Lohit Hindi and Gargi, as well
as GNU FreeSerif and GNU FreeSans (latest versions from SVN).
2. Export as PDF
3. Open the resulting file with Adobe Reader.
Select and copy the text from the PDF file,
and paste it into a text editor.

Current behavior:
Lohit Hindi
सभी मनुष्यों को गौरव और अधिधिकारों के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुि औद्धि और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बतार्ताव करना चाि औहए ।
FreeSerif
सभी मनुष्यो को गौरव और अधिधिकारो के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुिद और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बतारव करना चािहए ।
FreeSans
सभी मनुष्यों को गौरव और अधिधिकारों के मामले में जन्मजात स्वतन्त्रता और समानता प्ाप्त है ।
उन्हें बुिद्धि और अधन्तरात्मा की देन प्ाप्त है और परस्पर उन्हें भाईचारे के भाव से बताव करना चािहए ।
Gargi
सभी मनुष्यो को गौरव और अधिधिकारो के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुिद और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बताव करना चािहए ।

Expected behavior:
Should get something more like the original text back.

Operating System: All
Version: 4.0.2.2 release
Comment 1 Steve White 2013-07-04 19:27:30 UTC
Created attachment 82039 [details]
LOWriter doc as described in report
Comment 2 Steve White 2013-07-04 19:31:02 UTC
Created attachment 82040 [details]
PDF as exported on my system
Comment 3 ⁨خالد حسني⁩ 2013-07-04 21:05:29 UTC
Text extraction from PDF is a very unreliable process. Glyph names plays an important rule, and using proper glyph names in accordance with Adobe Glyph Naming convention (http://www.adobe.com/devnet/opentype/archives/glyph.html) plays a big rule and should help extractability of text set in GNU FreeFont which currently contain useless (for text extraction) glyph names like dev_rakaar and aasigndeva. Glyph names does not help with re-ordering, and there is probably some LibreOffice bugs in setting ToUnicode values in PDF, but proper glyph names is the start.
Comment 4 ⁨خالد حسني⁩ 2013-07-04 21:15:42 UTC
Gargi and Lohit Hindi (at least my version of them) have some wrong glyph names as well.
Comment 5 Steve White 2013-07-04 21:25:48 UTC
Hi Khaled.

Of course we're aware that copying text from PDF is unreliable.
In fact, with the currrent technology, based on ToUnicode, it is impossible to reproduce the original text.

I am sure however, in the case of Indic scripts, it could be done in such a way that results in mostly readable text.

The reason I submitted this report to LibreOffice is that this product does the best job of the several approaches I tested.  I think it could be improved with the least effort, and serve as a model for other systems.

Regarding the AGLFN, as I said, it could be used it to break a tie, but otherwise, you should reconsider your statements.  The AGLFN cannot carry more information than the ToUnicode stream does, and OpenType feature tables carry more information than either can.  The best approach would be to judiciously use the OpenType featues to populate the ToUnicode stream.

As I said, the AGLFN could be used to break a tie in OpenType feature tables.  But if it conflicts with the feature tables, it cannot be right.  (And in fact, that's what my tests showed: technologies that relied on AGLFN often showed mistakes because of failure to code a glyph name...which is a pity because correct info was available.) It would be better to drop the technology.

Cheers!
Comment 6 Steve White 2013-07-05 08:31:50 UTC
Khaled,

Several of the bugs pointed out are logic errors in the generation code (for sure the duplicated characters, and I think also the disappearing/reeappearing one).  These have nothing to do with glyph naming.

I also pointed out that although Gargi and Lohit attempt (different) AGLFN schemes, each has bugs in that regard.  This is part of my complaint with the AGLFN.  In each case, there was sufficient information in the font's feature tables to produce ToUnicode entries which would have correctly decomposed the glyphs. Although often LibreOffice PDF generation algorithms use OpenType tables to populate ToUnicode, here the algorithms instead fell back to AGLFN, and failed.

It would be best to prefer the OpenType features in building ToUnicode, and fall back to AGLFN only to break a tie, in case those features would specify more than one character string for a given glyph.

Another thought:

How to tackle the re-ordering of glyphs (especially, the 'i' and 'ii' vowel signs) using ToUnicode?  (I don't know if LibreOffice attempts something like this, I just see it's mostly wrong.)  The idea is based on making compound glyphs in the internal representation of the PDF file  -- they need not correspond to slots in the original font.

When a glyph that needs re-ordering (as 'i' and 'ii') is detected, it should be possible to identify the following consonant cluster.  The entire group, including the vowel and cluster, could be made a single glyph.  Then the fake entry for that glyph in the ToUnicode stream would specify characters for the decomposed cluster, with the vowel re-ordered to the end of the cluster.

Of course, identifying the cluster could be tricky in some cases, but in modern Devanagari at least, it usually consists of a few half-form consonants followed by a consonant, or else a single consonant ligature.  (That may be all--need to consult Unicode ch. 9)

And of course, there are other ways to do it!
Comment 7 Shriramana Sharma 2014-03-28 16:19:50 UTC
Khaled, is this perhaps related to bug 62728, since adding support for PDF/A-2U will/should fix the problem? I also find that any Indic text does not get copied correctly from PDFs exported by LibO. Using latest release LibO 4.2.2 on Kubuntu Saucy.
Comment 8 ⁨خالد حسني⁩ 2014-03-28 21:20:16 UTC
I can’t find a complete specfication of PDF/A-2 level U, but it seems to require preserving the Unicode reprisentation of the text, which is indeed a goal shared with this bug as well.
Comment 9 QA Administrators 2016-02-21 08:36:55 UTC Comment hidden (obsolete)
Comment 10 Shree Devi Kumar 2016-09-18 03:24:33 UTC
This bug is still present. 

Tested on Windows 10 with 
LibreOffice Writer 5.2.1.2
and Adobe Acrobat Reader 11.0.17.

I tested the same Devanagari text as reported by Steve White with a slightly different mix of fonts. The LOWriter document, resulting pdf and utf-8 text document with text copied from the pdf in Adobe Acrobat Reader and pasted in Notepad++ are attached.

FYI, this problem has been solved in Xetex with the new \XeTeXgenerateactualtext feature - please see http://tug.org/pipermail/xetex/2016-February/026445.html for the announcement.
Comment 11 Shree Devi Kumar 2016-09-18 03:26:27 UTC
Created attachment 127393 [details]
LOWriter document with a modified set of devanagari fonts
Comment 12 Shree Devi Kumar 2016-09-18 03:27:14 UTC
Created attachment 127394 [details]
Exported PDF for LOWriter document with a modified set of devanagari fonts
Comment 13 Shree Devi Kumar 2016-09-18 03:28:03 UTC
Created attachment 127395 [details]
Copied text from PDF for LOWriter document with a modified set of devanagari fonts
Comment 14 Shree Devi Kumar 2016-09-18 03:30:10 UTC
Comment on attachment 127395 [details]
Copied text from PDF for LOWriter document with a modified set of devanagari fonts

Sorry, this is the output from Save as text from Adobe Acrobat Reader. The copied text is being added in a different attachment.
Comment 15 Shree Devi Kumar 2016-09-18 03:34:07 UTC
Created attachment 127398 [details]
Copied text from PDF for the new LOWriter document

This is the attachment with the text copied from the pdf and pasted in Notepad++.
Comment 16 Xisco Faulí 2017-09-29 08:51:34 UTC Comment hidden (obsolete)
Comment 17 Shree Devi Kumar 2018-01-17 14:49:13 UTC
I tested this again today with Version: 5.4.4.2 (x64)
Build ID: 2524958677847fb3bb44820e40380acbe820f960
CPU threads: 4; OS: Windows 6.19; UI render: default; 
Locale: hi-IN (en_IN); Calc: group

The problem still exists.

Please let me know what additional information is required.

As I had mentioned earlier in this thread, this problem has been solved in Xetex with the new \XeTeXgenerateactualtext feature - please see http://tug.org/pipermail/xetex/2016-February/026445.html for the announcement.
Comment 18 Shree Devi Kumar 2018-01-17 15:29:53 UTC Comment hidden (obsolete)
Comment 19 Shree Devi Kumar 2018-01-17 15:40:40 UTC
> 
> As I had mentioned earlier in this thread, this problem has been solved in
> Xetex with the new \XeTeXgenerateactualtext feature - please see
> http://tug.org/pipermail/xetex/2016-February/026445.html for the
> announcement.

Here is a link to the actualtext branch for xetex on sourceforge.

https://sourceforge.net/p/xetex/code/ci/actualtext/tree/
Comment 20 ⁨خالد حسني⁩ 2018-01-17 21:53:06 UTC
LibreOfice has limited support for actual text already and I think it shouldn’t be hard to extend it and make it an option at least. If someone is interested in giving this a try, check SetActualText() calls in sw/source/core/text/EnhancedPDFExportHelper.cxx.
Comment 21 Shree Devi Kumar 2018-01-18 11:40:10 UTC Comment hidden (obsolete)
Comment 22 Shree Devi Kumar 2018-01-18 13:50:48 UTC
Code referred by Khaled can be viewed at https://github.com/LibreOffice/core/blob/master/sw/source/core/text/EnhancedPDFExportHelper.cxx#L761
Comment 23 ⁨خالد حسني⁩ 2018-01-19 12:35:19 UTC
(In reply to shreeshrii from comment #21)
> Thank you @Khaled Hosny for your response and pointer to  SetActualText()
> calls.
> 
> I think this must be a problem not just for Hindi but for all complex
> scripts.
> 
> Do you know whether the text copy paste from pdf works correctly for Arabic?

Copying Arabic from PDF can work without /ActualText if the fonts are carefully prepared; only one to one or many to one glyph substitutions, naming glyphs following Adobe Glyph Names (https://github.com/adobe-type-tools/agl-specification), but this is only because no re-ordering happens in Arabic. But even then there are still issues with text direction.
Comment 24 Shree Devi Kumar 2018-01-19 13:08:12 UTC Comment hidden (obsolete)
Comment 25 ⁨خالد حسني⁩ 2018-01-19 17:44:32 UTC Comment hidden (obsolete)
Comment 26 ⁨خالد حسني⁩ 2018-01-23 21:00:50 UTC
*** Bug 115117 has been marked as a duplicate of this bug. ***
Comment 27 Shree Devi Kumar 2018-01-24 09:16:48 UTC
Created attachment 139317 [details]
Sample text in multiple Indian scripts - ODT file
Comment 28 Shree Devi Kumar 2018-01-24 09:17:40 UTC
Created attachment 139318 [details]
Sample text in multiple Indian scripts - Exported PDF
Comment 29 Shree Devi Kumar 2018-01-24 09:18:27 UTC
Created attachment 139319 [details]
Sample text in multiple Indian scripts - Text copied from exported pdf
Comment 30 Shree Devi Kumar 2018-01-24 09:22:28 UTC
This problem is not limited to just Hindi. Rather it applies to all Indian language scripts, other Indic scripts and probably other complex scripts too.

I have attached a sample showing the errors in copied text in Devanagari, Bengali, Gujarati, Gurmukhi, Kannada, Malayalam, Tamil and Telugu scripts. 

Sample text in multiple Indian scripts - ODT file 
Sample text in multiple Indian scripts - Exported PDF 
Sample text in multiple Indian scripts - Text copied from exported pdf
Comment 31 Shree Devi Kumar 2018-01-24 09:28:16 UTC
Created attachment 139320 [details]
Sample text in multiple Indian scripts - Original Text copied from ODT

This is the original text - ground truth that can be compared with the exported text from the pdf.

Extending the Actualtext feature in pdfwriter can fix the issue. However, I do not know enough about the code or C++ to provide a patch.
Comment 32 ⁨خالد حسني⁩ 2018-01-25 12:29:12 UTC
*** Bug 62846 has been marked as a duplicate of this bug. ***
Comment 33 Shree Devi Kumar 2018-02-01 17:12:39 UTC
Just FYI, for users looking for a solution.

https://www.wikihow.com/index.php?title=Create-a-Searchable-Hindi-PDF-Using-Lyx-with-Xetex
Comment 34 Timur 2018-03-02 17:50:10 UTC
*** Bug 116056 has been marked as a duplicate of this bug. ***
Comment 35 Timur 2018-03-08 10:44:31 UTC
*** Bug 116284 has been marked as a duplicate of this bug. ***
Comment 36 Jim Avera 2018-03-08 18:04:24 UTC Comment hidden (obsolete)
Comment 37 ⁨خالد حسني⁩ 2018-03-20 01:37:07 UTC
*** Bug 116490 has been marked as a duplicate of this bug. ***
Comment 38 ⁨خالد حسني⁩ 2018-04-23 09:52:49 UTC Comment hidden (obsolete)
Comment 39 Commit Notification 2018-04-27 09:24:43 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c688b01d9102832226251fc84045408afe392459

tdf#66597 Fix PDF text extraction for complex text

It will be available in 6.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 40 Timur 2018-04-27 16:33:15 UTC Comment hidden (obsolete)
Comment 41 ⁨خالد حسني⁩ 2018-04-27 17:45:20 UTC
(In reply to Timur from comment #40)
> Congrats to Khaled and Miklos and Tomaž for all those patch sets and related
> changes. 
> I guess backport to 6.0 is not to be expected.

Too many changes to backport, also strictly speaking this is a new feature not a bug fix.
Comment 42 Shree Devi Kumar 2018-04-27 17:52:41 UTC Comment hidden (obsolete)
Comment 43 Volga 2018-04-28 04:47:57 UTC Comment hidden (obsolete)
Comment 44 Shree Devi Kumar 2018-04-28 12:22:19 UTC Comment hidden (obsolete)
Comment 45 Shree Devi Kumar 2018-04-28 12:39:54 UTC
Created attachment 141740 [details]
Devanagari QA files
Comment 46 Shree Devi Kumar 2018-04-28 12:52:33 UTC
(In reply to shreeshrii from comment #45)
> Created attachment 141740 [details]
> Devanagari QA files

The zip file has the original text file, same copied to a Libre Office document in different fonts and exported to pdf without any PDF options checked.

I will post below the first two lines of the file under different scenarios:

1. Original text

Mangal
नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी प्रत्यक्षमाहेश्वरी ।। 

2. Saving from Adobe Reader DC to text does not extract any Devanagari text. Some control characters are output along with ... where Devanagari should be.

Mangal 
............ ........ ............ .. 

3. Copying from Adobe Reader DC to pasting in Notepad++ utf-8 document transfers the Devanagari text. There are the additional control characters as in the above case.

Mangal
नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी प्रत्यक्षमाहेश्वरी ।।

4. Copying from Chrome Browser to pasting in Notepad++ utf-8 document transfers the Devanagari text. There are the additional control characters, some are different from the Adobe Reader case. Some Devanagari characters are missing.

Mangal
नित्यान्दकरी वराभयकरी सौन्दर्यरत्ाकरी । निर्धूताखि  लघोरपावकरी प्रत्यक्षमाहेश्वरी ।।

5. Copying from Microsoft Edge Browser to pasting in Notepad++ utf-8 document transfers the Devanagari text. There are fewer control characters addded. More Devanagari characters are missing.

Mangal नित्यान्दकरी वराभयकरी सौन्दयरत्ाकरी । निर्धूताखिलघोरपावकरी प्रत्यक्षमाहेश्वरी ।। 

I am reopening the bug. 

Please let me know if any additional information is needed. I have NOT tested with any other script/language.
Comment 47 ⁨خالد حسني⁩ 2018-04-28 23:07:36 UTC
(In reply to shreeshrii from comment #46)
> (In reply to shreeshrii from comment #45)
> > Created attachment 141740 [details]
> > Devanagari QA files
> 
> The zip file has the original text file, same copied to a Libre Office
> document in different fonts and exported to pdf without any PDF options
> checked.
> 
> I will post below the first two lines of the file under different scenarios:
> 
> 1. Original text
> 
> Mangal
> नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी
> प्रत्यक्षमाहेश्वरी ।। 
> 
> 2. Saving from Adobe Reader DC to text does not extract any Devanagari text.
> Some control characters are output along with ... where Devanagari should be.
> 
> Mangal 
> ............ ........ ............ .. 

That is a bug in Adobe Reader.

> 3. Copying from Adobe Reader DC to pasting in Notepad++ utf-8 document
> transfers the Devanagari text. There are the additional control characters
> as in the above case.
> 
> Mangal
> नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी
> प्रत्यक्षमाहेश्वरी ।।

Thanks for testing! there was a typo in the code that seems to have gotten into some late iterations that I failed to test properly. Should be fixed shortly (with a test case, to prevent such breakage in the future).

> 4. Copying from Chrome Browser to pasting in Notepad++ utf-8 document
> transfers the Devanagari text. There are the additional control characters,
> some are different from the Adobe Reader case. Some Devanagari characters
> are missing.
> 
> Mangal
> नित्यान्दकरी वराभयकरी सौन्दर्यरत्ाकरी । निर्धूताखि  लघोरपावकरी
> प्रत्यक्षमाहेश्वरी ।।

Chrome’s PDF reader does not suport /ActualText, so the changes here are unlikely to help that much. There is nothing we can do about it, unfortunately (apart from reporting to Chrome developers, of course).

> 5. Copying from Microsoft Edge Browser to pasting in Notepad++ utf-8
> document transfers the Devanagari text. There are fewer control characters
> addded. More Devanagari characters are missing.
> 
> Mangal नित्यान्दकरी वराभयकरी सौन्दयरत्ाकरी । निर्धूताखिलघोरपावकरी
> प्रत्यक्षमाहेश्वरी ।। 

I don’t know if this supports /ActualText or not, please wait for the next fix and re-test.
Comment 48 ⁨خالد حسني⁩ 2018-04-29 09:01:03 UTC
This should be fixed now, please retest.
Comment 49 Shree Devi Kumar 2018-04-29 12:54:26 UTC Comment hidden (obsolete)
Comment 50 ⁨خالد حسني⁩ 2018-04-29 15:02:19 UTC Comment hidden (obsolete)
Comment 51 ⁨خالد حسني⁩ 2018-04-30 09:20:19 UTC Comment hidden (obsolete)
Comment 52 Shree Devi Kumar 2018-04-30 11:25:40 UTC
Created attachment 141772 [details]
Devanagari QA3 files

I tested with 
Version: 6.1.0.0.alpha1+ (x64)
Build ID: 5f2073fbc995fb619f398a55187413813578b62e
CPU threads: 4; OS: Windows 10.0; UI render: default; 
TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-30_00:51:08
Locale: en-IN (en_IN); Calc: group

Thank you, Khaled Hosny. Your new patch is applied in this build. The results are much improved.

1. In Adobe Reader, The control characters have disappeared. All Devanagari characters and glyphs are displaying. However, there certain extra spaces within words.

These seem related to certain constant conjunct glyphs and the combining i mark (which is repositioned before the constants). There location seems to change based on fonts used.

I have created a wdiff file with the original text vs the text copied from Adobe Reader. Here are top few lines in it:

Mangal
[-नित्यानन्दकरी-]
{+नि त्यानन्दकरी+} वराभयकरी सौन्दर्यरत्नाकरी । [-निर्धूताखिलघोरपावनकरी-] {+नि र्धूताखि लघोरपावनकरी+} प्रत्यक्षमाहेश्वरी ।। 
[-अग्निशामक अभिज्ञान-]
{+अग्नि शामक अभि ज्ञान+} अनुक्रम काष्ठवाद्य [-अंतर्राष्ट्रीय-] {+अंतर्रा ष्ट्रीय+} ख़ूँखार [-मूत्रविज्ञान द्विध्रुव-] {+मूत्रवि ज्ञान द्वि ध्रुव+}

2. Chrome is still displaying some control characters. wdiff is included.

3. The pdf generated by xelatex allows text to be copied correctly.
Comment 53 Shree Devi Kumar 2018-04-30 11:27:53 UTC
A few issues still remain with Devanagari text being copy-pasted. I have not tested with other Indian scripts yet.
Comment 54 ⁨خالد حسني⁩ 2018-04-30 22:46:51 UTC
(In reply to Shree Devi Kumar from comment #52)
> Created attachment 141772 [details]
> Devanagari QA3 files
> 
> I tested with 
> Version: 6.1.0.0.alpha1+ (x64)
> Build ID: 5f2073fbc995fb619f398a55187413813578b62e
> CPU threads: 4; OS: Windows 10.0; UI render: default; 
> TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-30_00:51:08
> Locale: en-IN (en_IN); Calc: group
> 
> Thank you, Khaled Hosny. Your new patch is applied in this build. The
> results are much improved.
> 
> 1. In Adobe Reader, The control characters have disappeared. All Devanagari
> characters and glyphs are displaying. However, there certain extra spaces
> within words.

That is a bug in the reader, it tries to guess spaces based on some threshold distances between glyphs. It is a heuristic and it often fails. The only way I know to fix this is to use /ActualText per word, but this completely breaks the ability to select individual characters inside the word, so it is out of question, at least by default. It might be a good idea to have an option to do this, please open a new issue if you are interested in such an option.

> 
> 2. Chrome is still displaying some control characters. wdiff is included.

Again bug(s) in the reader, not sure if there is anything we can do here.

> 3. The pdf generated by xelatex allows text to be copied correctly.

But you can’t select individual characters (or grapheme clusters) as it embeds /ActualText per word (see above).
Comment 55 Volga 2018-05-01 04:44:05 UTC
(In reply to Shree Devi Kumar from comment #53)
> A few issues still remain with Devanagari text being copy-pasted. I have not
> tested with other Indian scripts yet.

You can get the sample from here: http://www.gnu.org/software/freefont/ranges/
Comment 56 Shree Devi Kumar 2018-05-01 10:42:16 UTC
(In reply to Volga from comment #55)
> (In reply to Shree Devi Kumar from comment #53)
> > A few issues still remain with Devanagari text being copy-pasted. I have not
> > tested with other Indian scripts yet.
> 
> You can get the sample from here:
> http://www.gnu.org/software/freefont/ranges/

Volga,
Thank you for the link. I have created a test document from the same.
Comment 57 Shree Devi Kumar 2018-05-01 10:55:33 UTC
(In reply to Khaled Hosny from comment #54)

> That is a bug in the reader, it tries to guess spaces based on some
> threshold distances between glyphs. It is a heuristic and it often fails.

I was using Adobe Reader as the best case scenario. 

Is there any other viewer which work correctly to copy and extract text from generated pdfs for complex scripts?

Which viewer do you test with?

> The only way I know to fix this is to use /ActualText per word, but this
> completely breaks the ability to select individual characters inside the
> word, so it is out of question, at least by default. 

> It might be a good idea
> to have an option to do this, please open a new issue if you are interested
> in such an option.

I think such an option should be used internally by the program based on the languages/scripts, since a number of Indic/Complex scripts are having the same problem.

I will add a zip file with test cases for various Indic scripts including Devanagari.

While looking for a viewer/reader of pdfs, I read that LibreOffice supports opening of pdf files. I tried opening the generated pdf through the daily build and it showed a number of errors (it was opened in LibreDraw). I can open a new issue for that, though I haven't quite figured out how to copy the text from text box in it.

Thanks, Khaled. Appreciate your efforts in fixing this issue which has been open for 5 years!
Comment 58 Shree Devi Kumar 2018-05-01 11:17:01 UTC
Created attachment 141808 [details]
Indic including Devanagari - QA4

Marked Older attachments as Obsolete.

This zip file has two .odt documents, one has the text used earlier in Devanagari QA1-3, the other uses samples from the freefont site, suggested by Volga.

wdiff is provided for both sample documents. The summary below gives an idea of differences.

indic-freefont-sample-qa4.txt: 
536 words  
371 69% common  
0 0% deleted  
165 31% changed
indic-freefont-sample-qa4.adobe-reader.txt: 
814 words  
371 46% common  
0 0% inserted  
443 54% changed

indic-pdf-export-qa4.txt: 
242 words  
151 62% common  
0 0% deleted  
91 38% changed
indic-pdf-export-qa4.adobe-reader.txt: 
362 words  
151 42% common  
0 0% inserted  
211 58% changed

Languages Tested
----------------
Devanagari Script – Hindi, Sanskrit, Marathi, Nepali languages

Bengali Script - Assamese, Bengali

Gurmukhi script – Panjabi/Punjabi language

Gujarati

Kannada

Malayalam

Oriya

Telugu

Burmese

Khmer

Sinhala

Tamil

Thaana
Comment 59 Shree Devi Kumar 2018-05-01 11:45:05 UTC
Created attachment 141809 [details]
RTL Languages - Arabic, Hebrew QA5

This zip file has an .odt with Arabic and Hebrew samples taken from the freefont page. .txt, .pdf and the text copied from adobe reader are included alongwith the wdiff.

For Arabic, most of the errors seem to be related to usage of ( and ) with the Arabic text. The number of errors is much smaller.


RTL-pdf-export-QA5.txt: 
407 words  
347 85% common  
0 0% deleted  
60 15% changed
RTL-pdf-export-QA5.adobe-reader.txt: 
424 words  
347 82% common  
2 0% inserted  
75 18% changed
Comment 60 ⁨خالد حسني⁩ 2018-05-02 10:00:27 UTC
(In reply to Shree Devi Kumar from comment #57)
> (In reply to Khaled Hosny from comment #54)
> 
> > That is a bug in the reader, it tries to guess spaces based on some
> > threshold distances between glyphs. It is a heuristic and it often fails.
> 
> I was using Adobe Reader as the best case scenario. 
> 
> Is there any other viewer which work correctly to copy and extract text from
> generated pdfs for complex scripts?
> 
> Which viewer do you test with?

Viewers based on Poppler seem to be good, comparable to Adobe’s at least.
 
> > The only way I know to fix this is to use /ActualText per word, but this
> > completely breaks the ability to select individual characters inside the
> > word, so it is out of question, at least by default. 
> 
> > It might be a good idea
> > to have an option to do this, please open a new issue if you are interested
> > in such an option.
> 
> I think such an option should be used internally by the program based on the
> languages/scripts, since a number of Indic/Complex scripts are having the
> same problem.

The extra space issue can happen to any script, even Latin, I have certainly seen it with purely Latin text. The solutions comes with a big downside, so I’d not want to do it automatically.

> I will add a zip file with test cases for various Indic scripts including
> Devanagari.
> 
> While looking for a viewer/reader of pdfs, I read that LibreOffice supports
> opening of pdf files. I tried opening the generated pdf through the daily
> build and it showed a number of errors (it was opened in LibreDraw). I can
> open a new issue for that, though I haven't quite figured out how to copy
> the text from text box in it.

There are several issues open about this, but it is completely different matter. LibreOffice is trying to convert PDFs into editable documents (which is lost cause, if you ask for my opinion), and that has its own set of issues.
Comment 61 ⁨خالد حسني⁩ 2018-05-02 10:05:30 UTC
(In reply to Shree Devi Kumar from comment #59)
> Created attachment 141809 [details]
> RTL Languages - Arabic, Hebrew QA5
> 
> This zip file has an .odt with Arabic and Hebrew samples taken from the
> freefont page. .txt, .pdf and the text copied from adobe reader are included
> alongwith the wdiff.
> 
> For Arabic, most of the errors seem to be related to usage of ( and ) with
> the Arabic text. The number of errors is much smaller.
> 
> 
> RTL-pdf-export-QA5.txt: 
> 407 words  
> 347 85% common  
> 0 0% deleted  
> 60 15% changed
> RTL-pdf-export-QA5.adobe-reader.txt: 
> 424 words  
> 347 82% common  
> 2 0% inserted  
> 75 18% changed

That is much better than I’d have expected for RTL, which is a totally different beast since PDF documents contain the final visual result (after applying bidirectional algorithm) and the logical direction of the text is totally lost and the viewer has to re-apply the bidirectional algorithm in reverse which will almost always fail for some cases. The only way to preserve the original text in its entirety is by using /ActualText with the whole paragraph (not just words or lines).
Comment 62 ⁨خالد حسني⁩ 2018-05-02 10:12:44 UTC
(In reply to Shree Devi Kumar from comment #58)
> Created attachment 141808 [details]
> Indic including Devanagari - QA4
> 
> Marked Older attachments as Obsolete.
> 
> This zip file has two .odt documents, one has the text used earlier in
> Devanagari QA1-3, the other uses samples from the freefont site, suggested
> by Volga.
> 
> wdiff is provided for both sample documents. The summary below gives an idea
> of differences.
> 
> indic-freefont-sample-qa4.txt: 
> 536 words  
> 371 69% common  
> 0 0% deleted  
> 165 31% changed
> indic-freefont-sample-qa4.adobe-reader.txt: 
> 814 words  
> 371 46% common  
> 0 0% inserted  
> 443 54% changed
> 
> indic-pdf-export-qa4.txt: 
> 242 words  
> 151 62% common  
> 0 0% deleted  
> 91 38% changed
> indic-pdf-export-qa4.adobe-reader.txt: 
> 362 words  
> 151 42% common  
> 0 0% inserted  
> 211 58% changed 


Thanks for testing, really appreciate it. All the changes seem to be space related. Tamil looks the worst, but after careful examination it seems to be also space related, it just happens to be in very unfortunate places that wreak havoc with the cluster formation.
Comment 63 Shree Devi Kumar 2018-05-04 08:42:14 UTC
Khaled,
Thank you for your responses.

What do you suggest as next steps to resolve this issue?

Is it possible to implement /Actualtext at word level based on language/script being used/ based on unicode range.
Comment 64 ⁨خالد حسني⁩ 2018-05-04 13:35:00 UTC
(In reply to Shree Devi Kumar from comment #63)
> Khaled,
> Thank you for your responses.
> 
> What do you suggest as next steps to resolve this issue?
> 
> Is it possible to implement /Actualtext at word level based on
> language/script being used/ based on unicode range.

My suggestion is to add an option to PDF export dialog to do ActualText per word (per sentence might be harder with the current implementation).

Please open a new issue for this, this one is getting too long and further fixes are out of scope of the original issue. We would need also the UX team to give there opinion about the UI.
Comment 65 Shree Devi Kumar 2018-05-04 14:42:52 UTC
(In reply to Khaled Hosny from comment #64)
> (In reply to Shree Devi Kumar from comment #63)
> > Khaled,
> > Thank you for your responses.
> > 
> > What do you suggest as next steps to resolve this issue?
> > 
> > Is it possible to implement /Actualtext at word level based on
> > language/script being used/ based on unicode range.
> 
> My suggestion is to add an option to PDF export dialog to do ActualText per
> word (per sentence might be harder with the current implementation).
> 
> Please open a new issue for this, this one is getting too long and further
> fixes are out of scope of the original issue. We would need also the UX team
> to give there opinion about the UI.

OK, opened a new bug at https://bugs.documentfoundation.org/show_bug.cgi?id=117428
Comment 66 V Stuart Foote 2019-03-21 21:35:07 UTC
*** Bug 124191 has been marked as a duplicate of this bug. ***