66597 – Problems with copying and extracting text from generated PDF

Bug 66597 - Problems with copying and extracting text from generated PDF

Summary: Problems with copying and extracting text from generated PDF

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Printing and PDF export (show other bugs)
Version: (earliest affected)	Inherited From OOo
Hardware:	Other All

Importance:	medium normal
Assignee:	Khaled Hosny

URL:
Whiteboard:	BSA target:6.1.0
Keywords:

Duplicates (2):	62846 124191 (view as bug list)
Depends on:
Blocks:	Font-Rendering 117428
	Show dependency tree / graph

Reported:	2013-07-04 19:23 UTC by Steve White
Modified:	2019-03-21 21:35 UTC (History)
CC List:	9 users (show)

See Also:	115117
Crash report or crash signature:

Attachments
More thorough description of the problem. (6.64 KB, text/plain) 2013-07-04 19:23 UTC, Steve White	Details
LOWriter doc as described in report (13.00 KB, application/msword) 2013-07-04 19:27 UTC, Steve White	Details
PDF as exported on my system (53.88 KB, application/pdf) 2013-07-04 19:31 UTC, Steve White	Details
LOWriter document with a modified set of devanagari fonts (14.00 KB, application/msword) 2016-09-18 03:26 UTC, Shree Devi Kumar	Details
Exported PDF for LOWriter document with a modified set of devanagari fonts (170.59 KB, application/pdf) 2016-09-18 03:27 UTC, Shree Devi Kumar	Details
Copied text from PDF for LOWriter document with a modified set of devanagari fonts (1.28 KB, text/plain) 2016-09-18 03:28 UTC, Shree Devi Kumar	Details
Copied text from PDF for the new LOWriter document (3.08 KB, text/plain) 2016-09-18 03:34 UTC, Shree Devi Kumar	Details
Sample text in multiple Indian scripts - ODT file (18.71 KB, application/vnd.oasis.opendocument.text) 2018-01-24 09:16 UTC, Shree Devi Kumar	Details
Sample text in multiple Indian scripts - Exported PDF (81.11 KB, application/pdf) 2018-01-24 09:17 UTC, Shree Devi Kumar	Details
Sample text in multiple Indian scripts - Text copied from exported pdf (1.92 KB, text/plain) 2018-01-24 09:18 UTC, Shree Devi Kumar	Details
Sample text in multiple Indian scripts - Original Text copied from ODT (2.29 KB, text/plain) 2018-01-24 09:28 UTC, Shree Devi Kumar	Details
Devanagari QA files (165.11 KB, application/x-zip-compressed) 2018-04-28 12:39 UTC, Shree Devi Kumar	Details
Devanagari QA2 (30.65 KB, application/x-zip-compressed) 2018-04-29 12:54 UTC, Shree Devi Kumar	Details
Devanagari QA3 files (178.86 KB, application/x-zip-compressed) 2018-04-30 11:25 UTC, Shree Devi Kumar	Details
Indic including Devanagari - QA4 (343.58 KB, application/x-zip-compressed) 2018-05-01 11:17 UTC, Shree Devi Kumar	Details
RTL Languages - Arabic, Hebrew QA5 (80.39 KB, application/x-zip-compressed) 2018-05-01 11:45 UTC, Shree Devi Kumar	Details
Show Obsolete (10) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Steve White 2013-07-04 19:23:51 UTC

Created attachment 82038 [details]
More thorough description of the problem.

Problem description: 

Steps to reproduce:
1. In a LOWriter doc, put several copies of the lines (Article 1 of the UDHR)
सभी मनुष्यों को गौरव और अधिकारों के मामले में जन्मजात स्वतन्त्रता और समानता प्राप्त है । 
उन्हें बुद्धि और अन्तरात्मा की देन प्राप्त है और परस्पर उन्हें भाईचारे के भाव से बर्ताव करना चाहिए ।
Format with a different font supporting Hindi.  I used
distro Lohit Hindi and Gargi, as well
as GNU FreeSerif and GNU FreeSans (latest versions from SVN).
2. Export as PDF
3. Open the resulting file with Adobe Reader.
Select and copy the text from the PDF file,
and paste it into a text editor.

Current behavior:
Lohit Hindi
सभी मनुष्यों को गौरव और अधिधिकारों के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुि औद्धि और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बतार्ताव करना चाि औहए ।
FreeSerif
सभी मनुष्यो को गौरव और अधिधिकारो के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुिद और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बतारव करना चािहए ।
FreeSans
सभी मनुष्यों को गौरव और अधिधिकारों के मामले में जन्मजात स्वतन्त्रता और समानता प्ाप्त है ।
उन्हें बुिद्धि और अधन्तरात्मा की देन प्ाप्त है और परस्पर उन्हें भाईचारे के भाव से बताव करना चािहए ।
Gargi
सभी मनुष्यो को गौरव और अधिधिकारो के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है ।
उन्हे बुिद और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बताव करना चािहए ।

Expected behavior:
Should get something more like the original text back.

Operating System: All
Version: 4.0.2.2 release

Comment 1 Steve White 2013-07-04 19:27:30 UTC

Created attachment 82039 [details]
LOWriter doc as described in report

Comment 2 Steve White 2013-07-04 19:31:02 UTC

Created attachment 82040 [details]
PDF as exported on my system

Comment 3 Khaled Hosny 2013-07-04 21:05:29 UTC

Text extraction from PDF is a very unreliable process. Glyph names plays an important rule, and using proper glyph names in accordance with Adobe Glyph Naming convention (http://www.adobe.com/devnet/opentype/archives/glyph.html) plays a big rule and should help extractability of text set in GNU FreeFont which currently contain useless (for text extraction) glyph names like dev_rakaar and aasigndeva. Glyph names does not help with re-ordering, and there is probably some LibreOffice bugs in setting ToUnicode values in PDF, but proper glyph names is the start.

Comment 4 Khaled Hosny 2013-07-04 21:15:42 UTC

Gargi and Lohit Hindi (at least my version of them) have some wrong glyph names as well.

Comment 5 Steve White 2013-07-04 21:25:48 UTC

Hi Khaled.

Of course we're aware that copying text from PDF is unreliable.
In fact, with the currrent technology, based on ToUnicode, it is impossible to reproduce the original text.

I am sure however, in the case of Indic scripts, it could be done in such a way that results in mostly readable text.

The reason I submitted this report to LibreOffice is that this product does the best job of the several approaches I tested.  I think it could be improved with the least effort, and serve as a model for other systems.

Regarding the AGLFN, as I said, it could be used it to break a tie, but otherwise, you should reconsider your statements.  The AGLFN cannot carry more information than the ToUnicode stream does, and OpenType feature tables carry more information than either can.  The best approach would be to judiciously use the OpenType featues to populate the ToUnicode stream.

As I said, the AGLFN could be used to break a tie in OpenType feature tables.  But if it conflicts with the feature tables, it cannot be right.  (And in fact, that's what my tests showed: technologies that relied on AGLFN often showed mistakes because of failure to code a glyph name...which is a pity because correct info was available.) It would be better to drop the technology.

Cheers!

Comment 6 Steve White 2013-07-05 08:31:50 UTC

Khaled,

Several of the bugs pointed out are logic errors in the generation code (for sure the duplicated characters, and I think also the disappearing/reeappearing one). These have nothing to do with glyph naming.

I also pointed out that although Gargi and Lohit attempt (different) AGLFN schemes, each has bugs in that regard. This is part of my complaint with the AGLFN. In each case, there was sufficient information in the font's feature tables to produce ToUnicode entries which would have correctly decomposed the glyphs. Although often LibreOffice PDF generation algorithms use OpenType tables to populate ToUnicode, here the algorithms instead fell back to AGLFN, and failed.

It would be best to prefer the OpenType features in building ToUnicode, and fall back to AGLFN only to break a tie, in case those features would specify more than one character string for a given glyph.

Another thought:

How to tackle the re-ordering of glyphs (especially, the 'i' and 'ii' vowel signs) using ToUnicode? (I don't know if LibreOffice attempts something like this, I just see it's mostly wrong.) The idea is based on making compound glyphs in the internal representation of the PDF file -- they need not correspond to slots in the original font.

When a glyph that needs re-ordering (as 'i' and 'ii') is detected, it should be possible to identify the following consonant cluster. The entire group, including the vowel and cluster, could be made a single glyph. Then the fake entry for that glyph in the ToUnicode stream would specify characters for the decomposed cluster, with the vowel re-ordered to the end of the cluster.

Of course, identifying the cluster could be tricky in some cases, but in modern Devanagari at least, it usually consists of a few half-form consonants followed by a consonant, or else a single consonant ligature. (That may be all--need to consult Unicode ch. 9)

And of course, there are other ways to do it!

Comment 7 Shriramana Sharma 2014-03-28 16:19:50 UTC

Khaled, is this perhaps related to bug 62728, since adding support for PDF/A-2U will/should fix the problem? I also find that any Indic text does not get copied correctly from PDFs exported by LibO. Using latest release LibO 4.2.2 on Kubuntu Saucy.

Comment 8 Khaled Hosny 2014-03-28 21:20:16 UTC

I can’t find a complete specfication of PDF/A-2 level U, but it seems to require preserving the Unicode reprisentation of the text, which is indeed a goal shared with this bug as well.

Comment 9 QA Administrators 2016-02-21 08:36:55 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present on a currently supported version of LibreOffice
(5.0.5 or 5.1.0) https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the version of LibreOffice and
your operating system, and any changes you see in the bug behavior

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave
a short comment that includes your version of LibreOffice and Operating System

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to "inherited from OOo"; 4b. If the bug was not present in 3.3 - add "regression" to keyword

Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for your help!

-- The LibreOffice QA Team This NEW Message was generated on: 2016-02-21

Comment 10 Shree Devi Kumar 2016-09-18 03:24:33 UTC

This bug is still present. 

Tested on Windows 10 with 
LibreOffice Writer 5.2.1.2
and Adobe Acrobat Reader 11.0.17.

I tested the same Devanagari text as reported by Steve White with a slightly different mix of fonts. The LOWriter document, resulting pdf and utf-8 text document with text copied from the pdf in Adobe Acrobat Reader and pasted in Notepad++ are attached.

FYI, this problem has been solved in Xetex with the new \XeTeXgenerateactualtext feature - please see http://tug.org/pipermail/xetex/2016-February/026445.html for the announcement.

Comment 11 Shree Devi Kumar 2016-09-18 03:26:27 UTC

Created attachment 127393 [details]
LOWriter document with a modified set of devanagari fonts

Comment 12 Shree Devi Kumar 2016-09-18 03:27:14 UTC

Created attachment 127394 [details]
Exported PDF for LOWriter document with a modified set of devanagari fonts

Comment 13 Shree Devi Kumar 2016-09-18 03:28:03 UTC

Created attachment 127395 [details]
Copied text from PDF for LOWriter document with a modified set of devanagari fonts

Comment 14 Shree Devi Kumar 2016-09-18 03:30:10 UTC

Comment on attachment 127395 [details]
Copied text from PDF for LOWriter document with a modified set of devanagari fonts

Sorry, this is the output from Save as text from Adobe Acrobat Reader. The copied text is being added in a different attachment.

Comment 15 Shree Devi Kumar 2016-09-18 03:34:07 UTC

Created attachment 127398 [details]
Copied text from PDF for the new LOWriter document

This is the attachment with the text copied from the pdf and pasted in Notepad++.

Comment 16 Xisco Faulí 2017-09-29 08:51:34 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present on a currently supported version of LibreOffice
(5.4.1 or 5.3.6 https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the version of LibreOffice and
your operating system, and any changes you see in the bug behavior

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave
a short comment that includes your version of LibreOffice and Operating System

Please DO NOT

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to "inherited from OOo";
4b. If the bug was not present in 3.3 - add "regression" to keyword

Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug-20170929

Comment 17 Shree Devi Kumar 2018-01-17 14:49:13 UTC

I tested this again today with Version: 5.4.4.2 (x64)
Build ID: 2524958677847fb3bb44820e40380acbe820f960
CPU threads: 4; OS: Windows 6.19; UI render: default; 
Locale: hi-IN (en_IN); Calc: group

The problem still exists.

Please let me know what additional information is required.

As I had mentioned earlier in this thread, this problem has been solved in Xetex with the new \XeTeXgenerateactualtext feature - please see http://tug.org/pipermail/xetex/2016-February/026445.html for the announcement.

Comment 18 Shree Devi Kumar 2018-01-17 15:29:53 UTC Comment hidden (obsolete)

Also tested with the pre-release version
Version: 6.0.0.2 (x64)
Build ID: 06b618bb6f431d27fd2def25aa19c833e29b61cd
CPU threads: 4; OS: Windows 10.0; UI render: default; 
Locale: hi-IN (en_IN); Calc: group

Problem has NOT been addressed in that also.

Comment 19 Shree Devi Kumar 2018-01-17 15:40:40 UTC

> 
> As I had mentioned earlier in this thread, this problem has been solved in
> Xetex with the new \XeTeXgenerateactualtext feature - please see
> http://tug.org/pipermail/xetex/2016-February/026445.html for the
> announcement.

Here is a link to the actualtext branch for xetex on sourceforge.

https://sourceforge.net/p/xetex/code/ci/actualtext/tree/

Comment 20 Khaled Hosny 2018-01-17 21:53:06 UTC

LibreOfice has limited support for actual text already and I think it shouldn’t be hard to extend it and make it an option at least. If someone is interested in giving this a try, check SetActualText() calls in sw/source/core/text/EnhancedPDFExportHelper.cxx.

Comment 21 Shree Devi Kumar 2018-01-18 11:40:10 UTC Comment hidden (obsolete)

Thank you @Khaled Hosny for your response and pointer to  SetActualText() calls.

I think this must be a problem not just for Hindi but for all complex scripts.

Do you know whether the text copy paste from pdf works correctly for Arabic?

Comment 22 Shree Devi Kumar 2018-01-18 13:50:48 UTC

Code referred by Khaled can be viewed at https://github.com/LibreOffice/core/blob/master/sw/source/core/text/EnhancedPDFExportHelper.cxx#L761

Comment 23 Khaled Hosny 2018-01-19 12:35:19 UTC

(In reply to shreeshrii from comment #21)
> Thank you @Khaled Hosny for your response and pointer to  SetActualText()
> calls.
> 
> I think this must be a problem not just for Hindi but for all complex
> scripts.
> 
> Do you know whether the text copy paste from pdf works correctly for Arabic?

Copying Arabic from PDF can work without /ActualText if the fonts are carefully prepared; only one to one or many to one glyph substitutions, naming glyphs following Adobe Glyph Names (https://github.com/adobe-type-tools/agl-specification), but this is only because no re-ordering happens in Arabic. But even then there are still issues with text direction.

Comment 24 Shree Devi Kumar 2018-01-19 13:08:12 UTC Comment hidden (obsolete)

(In reply to Khaled Hosny from comment #20)
> LibreOfice has limited support for actual text already and I think it
> shouldn’t be hard to extend it and make it an option at least. If someone is
> interested in giving this a try, check SetActualText() calls in
> sw/source/core/text/EnhancedPDFExportHelper.cxx.

I see that you had committed the code regarding soft hyphens using Actualtext. https://github.com/LibreOffice/core/commit/4dba6f5837539746293ef6808ea39a764ab7654d

Since you are already familiar with the code base, would it be possible for you to extend it? It would really help out many users. Thanks!

Comment 25 Khaled Hosny 2018-01-19 17:44:32 UTC Comment hidden (obsolete)

(In reply to shreeshrii from comment #24)
> (In reply to Khaled Hosny from comment #20)
> > LibreOfice has limited support for actual text already and I think it
> > shouldn’t be hard to extend it and make it an option at least. If someone is
> > interested in giving this a try, check SetActualText() calls in
> > sw/source/core/text/EnhancedPDFExportHelper.cxx.
> 
> I see that you had committed the code regarding soft hyphens using
> Actualtext.
> https://github.com/LibreOffice/core/commit/
> 4dba6f5837539746293ef6808ea39a764ab7654d
> 
> Since you are already familiar with the code base, would it be possible for
> you to extend it? It would really help out many users. Thanks!

No time, unfortunately.

Comment 26 Khaled Hosny 2018-01-23 21:00:50 UTC

*** Bug 115117 has been marked as a duplicate of this bug. ***

Comment 27 Shree Devi Kumar 2018-01-24 09:16:48 UTC

Created attachment 139317 [details]
Sample text in multiple Indian scripts - ODT file

Comment 28 Shree Devi Kumar 2018-01-24 09:17:40 UTC

Created attachment 139318 [details]
Sample text in multiple Indian scripts - Exported PDF

Comment 29 Shree Devi Kumar 2018-01-24 09:18:27 UTC

Created attachment 139319 [details]
Sample text in multiple Indian scripts - Text copied from exported pdf

Comment 30 Shree Devi Kumar 2018-01-24 09:22:28 UTC

This problem is not limited to just Hindi. Rather it applies to all Indian language scripts, other Indic scripts and probably other complex scripts too.

I have attached a sample showing the errors in copied text in Devanagari, Bengali, Gujarati, Gurmukhi, Kannada, Malayalam, Tamil and Telugu scripts. 

Sample text in multiple Indian scripts - ODT file 
Sample text in multiple Indian scripts - Exported PDF 
Sample text in multiple Indian scripts - Text copied from exported pdf

Comment 31 Shree Devi Kumar 2018-01-24 09:28:16 UTC

Created attachment 139320 [details]
Sample text in multiple Indian scripts - Original Text copied from ODT

This is the original text - ground truth that can be compared with the exported text from the pdf.

Extending the Actualtext feature in pdfwriter can fix the issue. However, I do not know enough about the code or C++ to provide a patch.

Comment 32 Khaled Hosny 2018-01-25 12:29:12 UTC

*** Bug 62846 has been marked as a duplicate of this bug. ***

Comment 33 Shree Devi Kumar 2018-02-01 17:12:39 UTC

Just FYI, for users looking for a solution.

https://www.wikihow.com/index.php?title=Create-a-Searchable-Hindi-PDF-Using-Lyx-with-Xetex

Comment 34 Timur 2018-03-02 17:50:10 UTC

*** Bug 116056 has been marked as a duplicate of this bug. ***

Comment 35 Timur 2018-03-08 10:44:31 UTC

*** Bug 116284 has been marked as a duplicate of this bug. ***

Comment 36 Jim Avera 2018-03-08 18:04:24 UTC Comment hidden (obsolete)

Even pure Latin script (e.g. U.S. English) can give wrong copy-and-paste results.  Reportedly has something to do with ligatures, e.g. "tt" becoming "t" etc. but only in tables or some specific constructs.

Can someone confirm that that is the same underlying bug as this one?

Ref: bug 116284 https://bugs.documentfoundation.org/show_bug.cgi?id=116284

Comment 37 Khaled Hosny 2018-03-20 01:37:07 UTC

*** Bug 116490 has been marked as a duplicate of this bug. ***

Comment 38 Khaled Hosny 2018-04-23 09:52:49 UTC Comment hidden (obsolete)

https://gerrit.libreoffice.org/#/c/53315/

Comment 39 Commit Notification 2018-04-27 09:24:43 UTC

Khaled Hosny committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c688b01d9102832226251fc84045408afe392459

tdf#66597 Fix PDF text extraction for complex text

It will be available in 6.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 40 Timur 2018-04-27 16:33:15 UTC Comment hidden (obsolete)

Congrats to Khaled and Miklos and Tomaž for all those patch sets and related changes. 
I guess backport to 6.0 is not to be expected.

Comment 41 Khaled Hosny 2018-04-27 17:45:20 UTC

(In reply to Timur from comment #40)
> Congrats to Khaled and Miklos and Tomaž for all those patch sets and related
> changes. 
> I guess backport to 6.0 is not to be expected.

Too many changes to backport, also strictly speaking this is a new feature not a bug fix.

Comment 42 Shree Devi Kumar 2018-04-27 17:52:41 UTC Comment hidden (obsolete)

(In reply to Commit Notification from comment #39)
> Khaled Hosny committed a patch related to this issue.
> It has been pushed to "master":
>
> It will be available in 6.1.0.
> 
> The patch should be included in the daily builds available at
> http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
> information about daily builds can be found at:
> http://wiki.documentfoundation.org/Testing_Daily_Builds
> 
> Affected users are encouraged to test the fix and report feedback.

Thank you, Khaled Hosny, for implementing this.

Which version should I download to test this on windows10?

Comment 43 Volga 2018-04-28 04:47:57 UTC Comment hidden (obsolete)

(In reply to shreeshrii from comment #42)
> Which version should I download to test this on windows10?
You can get the build from here:
https://dev-builds.libreoffice.org/daily/master/
Mind the build channel and date.

Comment 44 Shree Devi Kumar 2018-04-28 12:22:19 UTC Comment hidden (obsolete)

(In reply to Volga from comment #43)
> (In reply to shreeshrii from comment #42)
> > Which version should I download to test this on windows10?
> You can get the build from here:
> https://dev-builds.libreoffice.org/daily/master/
> Mind the build channel and date.

Thank you, @Volga.

I installed Version: 6.1.0.0.alpha1+ (x64)
Build ID: a6a38c6de9c18fd1269fc8cfc0e070ef429c8e2f
on Windows 10

Comment 45 Shree Devi Kumar 2018-04-28 12:39:54 UTC

Created attachment 141740 [details]
Devanagari QA files

Comment 46 Shree Devi Kumar 2018-04-28 12:52:33 UTC

(In reply to shreeshrii from comment #45)
> Created attachment 141740 [details]
> Devanagari QA files

The zip file has the original text file, same copied to a Libre Office document in different fonts and exported to pdf without any PDF options checked.

I will post below the first two lines of the file under different scenarios:

1. Original text

Mangal
नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी प्रत्यक्षमाहेश्वरी ।। 

2. Saving from Adobe Reader DC to text does not extract any Devanagari text. Some control characters are output along with ... where Devanagari should be.

Mangal 
............ ........ ............ .. 

3. Copying from Adobe Reader DC to pasting in Notepad++ utf-8 document transfers the Devanagari text. There are the additional control characters as in the above case.

Mangal
नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी प्रत्यक्षमाहेश्वरी ।।

4. Copying from Chrome Browser to pasting in Notepad++ utf-8 document transfers the Devanagari text. There are the additional control characters, some are different from the Adobe Reader case. Some Devanagari characters are missing.

Mangal
नित्यान्दकरी वराभयकरी सौन्दर्यरत्ाकरी । निर्धूताखि  लघोरपावकरी प्रत्यक्षमाहेश्वरी ।।

5. Copying from Microsoft Edge Browser to pasting in Notepad++ utf-8 document transfers the Devanagari text. There are fewer control characters addded. More Devanagari characters are missing.

Mangal नित्यान्दकरी वराभयकरी सौन्दयरत्ाकरी । निर्धूताखिलघोरपावकरी प्रत्यक्षमाहेश्वरी ।। 

I am reopening the bug. 

Please let me know if any additional information is needed. I have NOT tested with any other script/language.

Comment 47 Khaled Hosny 2018-04-28 23:07:36 UTC

(In reply to shreeshrii from comment #46)
> (In reply to shreeshrii from comment #45)
> > Created attachment 141740 [details]
> > Devanagari QA files
> 
> The zip file has the original text file, same copied to a Libre Office
> document in different fonts and exported to pdf without any PDF options
> checked.
> 
> I will post below the first two lines of the file under different scenarios:
> 
> 1. Original text
> 
> Mangal
> नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी
> प्रत्यक्षमाहेश्वरी ।। 
> 
> 2. Saving from Adobe Reader DC to text does not extract any Devanagari text.
> Some control characters are output along with ... where Devanagari should be.
> 
> Mangal 
> ............ ........ ............ .. 

That is a bug in Adobe Reader.

> 3. Copying from Adobe Reader DC to pasting in Notepad++ utf-8 document
> transfers the Devanagari text. There are the additional control characters
> as in the above case.
> 
> Mangal
> नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी
> प्रत्यक्षमाहेश्वरी ।।

Thanks for testing! there was a typo in the code that seems to have gotten into some late iterations that I failed to test properly. Should be fixed shortly (with a test case, to prevent such breakage in the future).

> 4. Copying from Chrome Browser to pasting in Notepad++ utf-8 document
> transfers the Devanagari text. There are the additional control characters,
> some are different from the Adobe Reader case. Some Devanagari characters
> are missing.
> 
> Mangal
> नित्यान्दकरी वराभयकरी सौन्दर्यरत्ाकरी । निर्धूताखि  लघोरपावकरी
> प्रत्यक्षमाहेश्वरी ।।

Chrome’s PDF reader does not suport /ActualText, so the changes here are unlikely to help that much. There is nothing we can do about it, unfortunately (apart from reporting to Chrome developers, of course).

> 5. Copying from Microsoft Edge Browser to pasting in Notepad++ utf-8
> document transfers the Devanagari text. There are fewer control characters
> addded. More Devanagari characters are missing.
> 
> Mangal नित्यान्दकरी वराभयकरी सौन्दयरत्ाकरी । निर्धूताखिलघोरपावकरी
> प्रत्यक्षमाहेश्वरी ।। 

I don’t know if this supports /ActualText or not, please wait for the next fix and re-test.

Comment 48 Khaled Hosny 2018-04-29 09:01:03 UTC

This should be fixed now, please retest.

Comment 49 Shree Devi Kumar 2018-04-29 12:54:26 UTC Comment hidden (obsolete)

Created attachment 141756 [details]
Devanagari QA2

I tested with 
Version: 6.1.0.0.alpha1+ (x64)
Build ID: 3bb3a9849c4262946013684495b18c0aa07e33c8
CPU threads: 4; OS: Windows 10.0; UI render: default; 
TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-29_02:56:02
Locale: en-IN (en_IN); Calc: group

The problem of additional control characters and missing characters still exists. The copied text is exactly the same as posted in the earlier zip file.

To help isolate the problem, I am attaching a new zip file for QA2. It has the same Devanagari text but only in one font. I have included the original utf-8 text, odt file and the generated pdf. For comparison I have also included a xelatex source file which has the same Devanagari text and the generated pdf with that.

Using Adobe Reader DC to open the xelatex generated pdf, when selecting the text the highlighting is not correct, however the copied and pasted text is complete and without the extra  control characters.

Using Adobe Reader DC to open the Libre Office generated pdf, when selecting the text the highlighting is complete, however the copied and pasted text is with the extra  control characters as well as missing Devanagari characters. There are differences in the control characters etc based on the font.

Devanagari may not have standard tounicode tables in different fonts. So it maybe better to use Actualtext fully.

You can install the Devanagari font used from https://packages.ubuntu.com/bionic/fonts-sahadeva

Please let me know if you need any additional information.

Comment 50 Khaled Hosny 2018-04-29 15:02:19 UTC Comment hidden (obsolete)

(In reply to shreeshrii from comment #49)
> Created attachment 141756 [details]
> Devanagari QA2
> 
> I tested with 
> Version: 6.1.0.0.alpha1+ (x64)
> Build ID: 3bb3a9849c4262946013684495b18c0aa07e33c8
> CPU threads: 4; OS: Windows 10.0; UI render: default; 
> TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-29_02:56:02
> Locale: en-IN (en_IN); Calc: group

That build does not have the fix yet. The patch was submitted just this morning, please wait a bit more and try again.

Comment 51 Khaled Hosny 2018-04-30 09:20:19 UTC Comment hidden (obsolete)

(In reply to Khaled Hosny from comment #50)
> (In reply to shreeshrii from comment #49)
> > Created attachment 141756 [details]
> > Devanagari QA2
> > 
> > I tested with 
> > Version: 6.1.0.0.alpha1+ (x64)
> > Build ID: 3bb3a9849c4262946013684495b18c0aa07e33c8
> > CPU threads: 4; OS: Windows 10.0; UI render: default; 
> > TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-29_02:56:02
> > Locale: en-IN (en_IN); Calc: group
> 
> That build does not have the fix yet. The patch was submitted just this
> morning, please wait a bit more and try again.

Please try now, the last build should have the fix.

Comment 52 Shree Devi Kumar 2018-04-30 11:25:40 UTC

Created attachment 141772 [details]
Devanagari QA3 files

I tested with 
Version: 6.1.0.0.alpha1+ (x64)
Build ID: 5f2073fbc995fb619f398a55187413813578b62e
CPU threads: 4; OS: Windows 10.0; UI render: default; 
TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-30_00:51:08
Locale: en-IN (en_IN); Calc: group

Thank you, Khaled Hosny. Your new patch is applied in this build. The results are much improved.

1. In Adobe Reader, The control characters have disappeared. All Devanagari characters and glyphs are displaying. However, there certain extra spaces within words.

These seem related to certain constant conjunct glyphs and the combining i mark (which is repositioned before the constants). There location seems to change based on fonts used.

I have created a wdiff file with the original text vs the text copied from Adobe Reader. Here are top few lines in it:

Mangal
[-नित्यानन्दकरी-]
{+नि त्यानन्दकरी+} वराभयकरी सौन्दर्यरत्नाकरी । [-निर्धूताखिलघोरपावनकरी-] {+नि र्धूताखि लघोरपावनकरी+} प्रत्यक्षमाहेश्वरी ।। 
[-अग्निशामक अभिज्ञान-]
{+अग्नि शामक अभि ज्ञान+} अनुक्रम काष्ठवाद्य [-अंतर्राष्ट्रीय-] {+अंतर्रा ष्ट्रीय+} ख़ूँखार [-मूत्रविज्ञान द्विध्रुव-] {+मूत्रवि ज्ञान द्वि ध्रुव+}

2. Chrome is still displaying some control characters. wdiff is included.

3. The pdf generated by xelatex allows text to be copied correctly.

Comment 53 Shree Devi Kumar 2018-04-30 11:27:53 UTC

A few issues still remain with Devanagari text being copy-pasted. I have not tested with other Indian scripts yet.

Comment 54 Khaled Hosny 2018-04-30 22:46:51 UTC

(In reply to Shree Devi Kumar from comment #52)
> Created attachment 141772 [details]
> Devanagari QA3 files
> 
> I tested with 
> Version: 6.1.0.0.alpha1+ (x64)
> Build ID: 5f2073fbc995fb619f398a55187413813578b62e
> CPU threads: 4; OS: Windows 10.0; UI render: default; 
> TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-30_00:51:08
> Locale: en-IN (en_IN); Calc: group
> 
> Thank you, Khaled Hosny. Your new patch is applied in this build. The
> results are much improved.
> 
> 1. In Adobe Reader, The control characters have disappeared. All Devanagari
> characters and glyphs are displaying. However, there certain extra spaces
> within words.

That is a bug in the reader, it tries to guess spaces based on some threshold distances between glyphs. It is a heuristic and it often fails. The only way I know to fix this is to use /ActualText per word, but this completely breaks the ability to select individual characters inside the word, so it is out of question, at least by default. It might be a good idea to have an option to do this, please open a new issue if you are interested in such an option.

> 
> 2. Chrome is still displaying some control characters. wdiff is included.

Again bug(s) in the reader, not sure if there is anything we can do here.

> 3. The pdf generated by xelatex allows text to be copied correctly.

But you can’t select individual characters (or grapheme clusters) as it embeds /ActualText per word (see above).

Comment 55 Volga 2018-05-01 04:44:05 UTC

(In reply to Shree Devi Kumar from comment #53)
> A few issues still remain with Devanagari text being copy-pasted. I have not
> tested with other Indian scripts yet.

You can get the sample from here: http://www.gnu.org/software/freefont/ranges/

Comment 56 Shree Devi Kumar 2018-05-01 10:42:16 UTC

(In reply to Volga from comment #55)
> (In reply to Shree Devi Kumar from comment #53)
> > A few issues still remain with Devanagari text being copy-pasted. I have not
> > tested with other Indian scripts yet.
> 
> You can get the sample from here:
> http://www.gnu.org/software/freefont/ranges/

Volga,
Thank you for the link. I have created a test document from the same.

Comment 57 Shree Devi Kumar 2018-05-01 10:55:33 UTC

(In reply to Khaled Hosny from comment #54)

> That is a bug in the reader, it tries to guess spaces based on some
> threshold distances between glyphs. It is a heuristic and it often fails.

I was using Adobe Reader as the best case scenario. 

Is there any other viewer which work correctly to copy and extract text from generated pdfs for complex scripts?

Which viewer do you test with?

> The only way I know to fix this is to use /ActualText per word, but this
> completely breaks the ability to select individual characters inside the
> word, so it is out of question, at least by default. 

> It might be a good idea
> to have an option to do this, please open a new issue if you are interested
> in such an option.

I think such an option should be used internally by the program based on the languages/scripts, since a number of Indic/Complex scripts are having the same problem.

I will add a zip file with test cases for various Indic scripts including Devanagari.

While looking for a viewer/reader of pdfs, I read that LibreOffice supports opening of pdf files. I tried opening the generated pdf through the daily build and it showed a number of errors (it was opened in LibreDraw). I can open a new issue for that, though I haven't quite figured out how to copy the text from text box in it.

Thanks, Khaled. Appreciate your efforts in fixing this issue which has been open for 5 years!

Comment 58 Shree Devi Kumar 2018-05-01 11:17:01 UTC

Created attachment 141808 [details]
Indic including Devanagari - QA4

Marked Older attachments as Obsolete.

This zip file has two .odt documents, one has the text used earlier in Devanagari QA1-3, the other uses samples from the freefont site, suggested by Volga.

wdiff is provided for both sample documents. The summary below gives an idea of differences.

indic-freefont-sample-qa4.txt: 
536 words  
371 69% common  
0 0% deleted  
165 31% changed
indic-freefont-sample-qa4.adobe-reader.txt: 
814 words  
371 46% common  
0 0% inserted  
443 54% changed

indic-pdf-export-qa4.txt: 
242 words  
151 62% common  
0 0% deleted  
91 38% changed
indic-pdf-export-qa4.adobe-reader.txt: 
362 words  
151 42% common  
0 0% inserted  
211 58% changed

Languages Tested
----------------
Devanagari Script – Hindi, Sanskrit, Marathi, Nepali languages

Bengali Script - Assamese, Bengali

Gurmukhi script – Panjabi/Punjabi language

Gujarati

Kannada

Malayalam

Oriya

Telugu

Burmese

Khmer

Sinhala

Tamil

Thaana

Comment 59 Shree Devi Kumar 2018-05-01 11:45:05 UTC

Created attachment 141809 [details]
RTL Languages - Arabic, Hebrew QA5

This zip file has an .odt with Arabic and Hebrew samples taken from the freefont page. .txt, .pdf and the text copied from adobe reader are included alongwith the wdiff.

For Arabic, most of the errors seem to be related to usage of ( and ) with the Arabic text. The number of errors is much smaller.


RTL-pdf-export-QA5.txt: 
407 words  
347 85% common  
0 0% deleted  
60 15% changed
RTL-pdf-export-QA5.adobe-reader.txt: 
424 words  
347 82% common  
2 0% inserted  
75 18% changed

Comment 60 Khaled Hosny 2018-05-02 10:00:27 UTC

(In reply to Shree Devi Kumar from comment #57)
> (In reply to Khaled Hosny from comment #54)
> 
> > That is a bug in the reader, it tries to guess spaces based on some
> > threshold distances between glyphs. It is a heuristic and it often fails.
> 
> I was using Adobe Reader as the best case scenario. 
> 
> Is there any other viewer which work correctly to copy and extract text from
> generated pdfs for complex scripts?
> 
> Which viewer do you test with?

Viewers based on Poppler seem to be good, comparable to Adobe’s at least.
 
> > The only way I know to fix this is to use /ActualText per word, but this
> > completely breaks the ability to select individual characters inside the
> > word, so it is out of question, at least by default. 
> 
> > It might be a good idea
> > to have an option to do this, please open a new issue if you are interested
> > in such an option.
> 
> I think such an option should be used internally by the program based on the
> languages/scripts, since a number of Indic/Complex scripts are having the
> same problem.

The extra space issue can happen to any script, even Latin, I have certainly seen it with purely Latin text. The solutions comes with a big downside, so I’d not want to do it automatically.

> I will add a zip file with test cases for various Indic scripts including
> Devanagari.
> 
> While looking for a viewer/reader of pdfs, I read that LibreOffice supports
> opening of pdf files. I tried opening the generated pdf through the daily
> build and it showed a number of errors (it was opened in LibreDraw). I can
> open a new issue for that, though I haven't quite figured out how to copy
> the text from text box in it.

There are several issues open about this, but it is completely different matter. LibreOffice is trying to convert PDFs into editable documents (which is lost cause, if you ask for my opinion), and that has its own set of issues.

Comment 61 Khaled Hosny 2018-05-02 10:05:30 UTC

(In reply to Shree Devi Kumar from comment #59)
> Created attachment 141809 [details]
> RTL Languages - Arabic, Hebrew QA5
> 
> This zip file has an .odt with Arabic and Hebrew samples taken from the
> freefont page. .txt, .pdf and the text copied from adobe reader are included
> alongwith the wdiff.
> 
> For Arabic, most of the errors seem to be related to usage of ( and ) with
> the Arabic text. The number of errors is much smaller.
> 
> 
> RTL-pdf-export-QA5.txt: 
> 407 words  
> 347 85% common  
> 0 0% deleted  
> 60 15% changed
> RTL-pdf-export-QA5.adobe-reader.txt: 
> 424 words  
> 347 82% common  
> 2 0% inserted  
> 75 18% changed

That is much better than I’d have expected for RTL, which is a totally different beast since PDF documents contain the final visual result (after applying bidirectional algorithm) and the logical direction of the text is totally lost and the viewer has to re-apply the bidirectional algorithm in reverse which will almost always fail for some cases. The only way to preserve the original text in its entirety is by using /ActualText with the whole paragraph (not just words or lines).

Comment 62 Khaled Hosny 2018-05-02 10:12:44 UTC

(In reply to Shree Devi Kumar from comment #58)
> Created attachment 141808 [details]
> Indic including Devanagari - QA4
> 
> Marked Older attachments as Obsolete.
> 
> This zip file has two .odt documents, one has the text used earlier in
> Devanagari QA1-3, the other uses samples from the freefont site, suggested
> by Volga.
> 
> wdiff is provided for both sample documents. The summary below gives an idea
> of differences.
> 
> indic-freefont-sample-qa4.txt: 
> 536 words  
> 371 69% common  
> 0 0% deleted  
> 165 31% changed
> indic-freefont-sample-qa4.adobe-reader.txt: 
> 814 words  
> 371 46% common  
> 0 0% inserted  
> 443 54% changed
> 
> indic-pdf-export-qa4.txt: 
> 242 words  
> 151 62% common  
> 0 0% deleted  
> 91 38% changed
> indic-pdf-export-qa4.adobe-reader.txt: 
> 362 words  
> 151 42% common  
> 0 0% inserted  
> 211 58% changed 


Thanks for testing, really appreciate it. All the changes seem to be space related. Tamil looks the worst, but after careful examination it seems to be also space related, it just happens to be in very unfortunate places that wreak havoc with the cluster formation.

Comment 63 Shree Devi Kumar 2018-05-04 08:42:14 UTC

Khaled,
Thank you for your responses.

What do you suggest as next steps to resolve this issue?

Is it possible to implement /Actualtext at word level based on language/script being used/ based on unicode range.

Comment 64 Khaled Hosny 2018-05-04 13:35:00 UTC

(In reply to Shree Devi Kumar from comment #63)
> Khaled,
> Thank you for your responses.
> 
> What do you suggest as next steps to resolve this issue?
> 
> Is it possible to implement /Actualtext at word level based on
> language/script being used/ based on unicode range.

My suggestion is to add an option to PDF export dialog to do ActualText per word (per sentence might be harder with the current implementation).

Please open a new issue for this, this one is getting too long and further fixes are out of scope of the original issue. We would need also the UX team to give there opinion about the UI.

Comment 65 Shree Devi Kumar 2018-05-04 14:42:52 UTC

(In reply to Khaled Hosny from comment #64)
> (In reply to Shree Devi Kumar from comment #63)
> > Khaled,
> > Thank you for your responses.
> > 
> > What do you suggest as next steps to resolve this issue?
> > 
> > Is it possible to implement /Actualtext at word level based on
> > language/script being used/ based on unicode range.
> 
> My suggestion is to add an option to PDF export dialog to do ActualText per
> word (per sentence might be harder with the current implementation).
> 
> Please open a new issue for this, this one is getting too long and further
> fixes are out of scope of the original issue. We would need also the UX team
> to give there opinion about the UI.

OK, opened a new bug at https://bugs.documentfoundation.org/show_bug.cgi?id=117428

Comment 66 V Stuart Foote 2019-03-21 21:35:07 UTC

*** Bug 124191 has been marked as a duplicate of this bug. ***