Created attachment 82038 [details] More thorough description of the problem. Problem description: Steps to reproduce: 1. In a LOWriter doc, put several copies of the lines (Article 1 of the UDHR) सभी मनुष्यों को गौरव और अधिकारों के मामले में जन्मजात स्वतन्त्रता और समानता प्राप्त है । उन्हें बुद्धि और अन्तरात्मा की देन प्राप्त है और परस्पर उन्हें भाईचारे के भाव से बर्ताव करना चाहिए । Format with a different font supporting Hindi. I used distro Lohit Hindi and Gargi, as well as GNU FreeSerif and GNU FreeSans (latest versions from SVN). 2. Export as PDF 3. Open the resulting file with Adobe Reader. Select and copy the text from the PDF file, and paste it into a text editor. Current behavior: Lohit Hindi सभी मनुष्यों को गौरव और अधिधिकारों के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है । उन्हे बुि औद्धि और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बतार्ताव करना चाि औहए । FreeSerif सभी मनुष्यो को गौरव और अधिधिकारो के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है । उन्हे बुिद और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बतारव करना चािहए । FreeSans सभी मनुष्यों को गौरव और अधिधिकारों के मामले में जन्मजात स्वतन्त्रता और समानता प्ाप्त है । उन्हें बुिद्धि और अधन्तरात्मा की देन प्ाप्त है और परस्पर उन्हें भाईचारे के भाव से बताव करना चािहए । Gargi सभी मनुष्यो को गौरव और अधिधिकारो के मामले मे जन्मजात स्वतन्त्रता और समानता प्राप्त है । उन्हे बुिद और अधन्तरात्मा की देन प्राप्त है और परस्पर उन्हे भाईचारे के भाव से बताव करना चािहए । Expected behavior: Should get something more like the original text back. Operating System: All Version: 4.0.2.2 release
Created attachment 82039 [details] LOWriter doc as described in report
Created attachment 82040 [details] PDF as exported on my system
Text extraction from PDF is a very unreliable process. Glyph names plays an important rule, and using proper glyph names in accordance with Adobe Glyph Naming convention (http://www.adobe.com/devnet/opentype/archives/glyph.html) plays a big rule and should help extractability of text set in GNU FreeFont which currently contain useless (for text extraction) glyph names like dev_rakaar and aasigndeva. Glyph names does not help with re-ordering, and there is probably some LibreOffice bugs in setting ToUnicode values in PDF, but proper glyph names is the start.
Gargi and Lohit Hindi (at least my version of them) have some wrong glyph names as well.
Hi Khaled. Of course we're aware that copying text from PDF is unreliable. In fact, with the currrent technology, based on ToUnicode, it is impossible to reproduce the original text. I am sure however, in the case of Indic scripts, it could be done in such a way that results in mostly readable text. The reason I submitted this report to LibreOffice is that this product does the best job of the several approaches I tested. I think it could be improved with the least effort, and serve as a model for other systems. Regarding the AGLFN, as I said, it could be used it to break a tie, but otherwise, you should reconsider your statements. The AGLFN cannot carry more information than the ToUnicode stream does, and OpenType feature tables carry more information than either can. The best approach would be to judiciously use the OpenType featues to populate the ToUnicode stream. As I said, the AGLFN could be used to break a tie in OpenType feature tables. But if it conflicts with the feature tables, it cannot be right. (And in fact, that's what my tests showed: technologies that relied on AGLFN often showed mistakes because of failure to code a glyph name...which is a pity because correct info was available.) It would be better to drop the technology. Cheers!
Khaled, Several of the bugs pointed out are logic errors in the generation code (for sure the duplicated characters, and I think also the disappearing/reeappearing one). These have nothing to do with glyph naming. I also pointed out that although Gargi and Lohit attempt (different) AGLFN schemes, each has bugs in that regard. This is part of my complaint with the AGLFN. In each case, there was sufficient information in the font's feature tables to produce ToUnicode entries which would have correctly decomposed the glyphs. Although often LibreOffice PDF generation algorithms use OpenType tables to populate ToUnicode, here the algorithms instead fell back to AGLFN, and failed. It would be best to prefer the OpenType features in building ToUnicode, and fall back to AGLFN only to break a tie, in case those features would specify more than one character string for a given glyph. Another thought: How to tackle the re-ordering of glyphs (especially, the 'i' and 'ii' vowel signs) using ToUnicode? (I don't know if LibreOffice attempts something like this, I just see it's mostly wrong.) The idea is based on making compound glyphs in the internal representation of the PDF file -- they need not correspond to slots in the original font. When a glyph that needs re-ordering (as 'i' and 'ii') is detected, it should be possible to identify the following consonant cluster. The entire group, including the vowel and cluster, could be made a single glyph. Then the fake entry for that glyph in the ToUnicode stream would specify characters for the decomposed cluster, with the vowel re-ordered to the end of the cluster. Of course, identifying the cluster could be tricky in some cases, but in modern Devanagari at least, it usually consists of a few half-form consonants followed by a consonant, or else a single consonant ligature. (That may be all--need to consult Unicode ch. 9) And of course, there are other ways to do it!
Khaled, is this perhaps related to bug 62728, since adding support for PDF/A-2U will/should fix the problem? I also find that any Indic text does not get copied correctly from PDFs exported by LibO. Using latest release LibO 4.2.2 on Kubuntu Saucy.
I can’t find a complete specfication of PDF/A-2 level U, but it seems to require preserving the Unicode reprisentation of the text, which is indeed a goal shared with this bug as well.
** Please read this message in its entirety before responding ** To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present on a currently supported version of LibreOffice (5.0.5 or 5.1.0) https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the version of LibreOffice and your operating system, and any changes you see in the bug behavior If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a short comment that includes your version of LibreOffice and Operating System Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) http://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to "inherited from OOo"; 4b. If the bug was not present in 3.3 - add "regression" to keyword Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa Thank you for your help! -- The LibreOffice QA Team This NEW Message was generated on: 2016-02-21
This bug is still present. Tested on Windows 10 with LibreOffice Writer 5.2.1.2 and Adobe Acrobat Reader 11.0.17. I tested the same Devanagari text as reported by Steve White with a slightly different mix of fonts. The LOWriter document, resulting pdf and utf-8 text document with text copied from the pdf in Adobe Acrobat Reader and pasted in Notepad++ are attached. FYI, this problem has been solved in Xetex with the new \XeTeXgenerateactualtext feature - please see http://tug.org/pipermail/xetex/2016-February/026445.html for the announcement.
Created attachment 127393 [details] LOWriter document with a modified set of devanagari fonts
Created attachment 127394 [details] Exported PDF for LOWriter document with a modified set of devanagari fonts
Created attachment 127395 [details] Copied text from PDF for LOWriter document with a modified set of devanagari fonts
Comment on attachment 127395 [details] Copied text from PDF for LOWriter document with a modified set of devanagari fonts Sorry, this is the output from Save as text from Adobe Acrobat Reader. The copied text is being added in a different attachment.
Created attachment 127398 [details] Copied text from PDF for the new LOWriter document This is the attachment with the text copied from the pdf and pasted in Notepad++.
** Please read this message in its entirety before responding ** To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present on a currently supported version of LibreOffice (5.4.1 or 5.3.6 https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the version of LibreOffice and your operating system, and any changes you see in the bug behavior If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a short comment that includes your version of LibreOffice and Operating System Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) http://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to "inherited from OOo"; 4b. If the bug was not present in 3.3 - add "regression" to keyword Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team MassPing-UntouchedBug-20170929
I tested this again today with Version: 5.4.4.2 (x64) Build ID: 2524958677847fb3bb44820e40380acbe820f960 CPU threads: 4; OS: Windows 6.19; UI render: default; Locale: hi-IN (en_IN); Calc: group The problem still exists. Please let me know what additional information is required. As I had mentioned earlier in this thread, this problem has been solved in Xetex with the new \XeTeXgenerateactualtext feature - please see http://tug.org/pipermail/xetex/2016-February/026445.html for the announcement.
Also tested with the pre-release version Version: 6.0.0.2 (x64) Build ID: 06b618bb6f431d27fd2def25aa19c833e29b61cd CPU threads: 4; OS: Windows 10.0; UI render: default; Locale: hi-IN (en_IN); Calc: group Problem has NOT been addressed in that also.
> > As I had mentioned earlier in this thread, this problem has been solved in > Xetex with the new \XeTeXgenerateactualtext feature - please see > http://tug.org/pipermail/xetex/2016-February/026445.html for the > announcement. Here is a link to the actualtext branch for xetex on sourceforge. https://sourceforge.net/p/xetex/code/ci/actualtext/tree/
LibreOfice has limited support for actual text already and I think it shouldn’t be hard to extend it and make it an option at least. If someone is interested in giving this a try, check SetActualText() calls in sw/source/core/text/EnhancedPDFExportHelper.cxx.
Thank you @Khaled Hosny for your response and pointer to SetActualText() calls. I think this must be a problem not just for Hindi but for all complex scripts. Do you know whether the text copy paste from pdf works correctly for Arabic?
Code referred by Khaled can be viewed at https://github.com/LibreOffice/core/blob/master/sw/source/core/text/EnhancedPDFExportHelper.cxx#L761
(In reply to shreeshrii from comment #21) > Thank you @Khaled Hosny for your response and pointer to SetActualText() > calls. > > I think this must be a problem not just for Hindi but for all complex > scripts. > > Do you know whether the text copy paste from pdf works correctly for Arabic? Copying Arabic from PDF can work without /ActualText if the fonts are carefully prepared; only one to one or many to one glyph substitutions, naming glyphs following Adobe Glyph Names (https://github.com/adobe-type-tools/agl-specification), but this is only because no re-ordering happens in Arabic. But even then there are still issues with text direction.
(In reply to Khaled Hosny from comment #20) > LibreOfice has limited support for actual text already and I think it > shouldn’t be hard to extend it and make it an option at least. If someone is > interested in giving this a try, check SetActualText() calls in > sw/source/core/text/EnhancedPDFExportHelper.cxx. I see that you had committed the code regarding soft hyphens using Actualtext. https://github.com/LibreOffice/core/commit/4dba6f5837539746293ef6808ea39a764ab7654d Since you are already familiar with the code base, would it be possible for you to extend it? It would really help out many users. Thanks!
(In reply to shreeshrii from comment #24) > (In reply to Khaled Hosny from comment #20) > > LibreOfice has limited support for actual text already and I think it > > shouldn’t be hard to extend it and make it an option at least. If someone is > > interested in giving this a try, check SetActualText() calls in > > sw/source/core/text/EnhancedPDFExportHelper.cxx. > > I see that you had committed the code regarding soft hyphens using > Actualtext. > https://github.com/LibreOffice/core/commit/ > 4dba6f5837539746293ef6808ea39a764ab7654d > > Since you are already familiar with the code base, would it be possible for > you to extend it? It would really help out many users. Thanks! No time, unfortunately.
*** Bug 115117 has been marked as a duplicate of this bug. ***
Created attachment 139317 [details] Sample text in multiple Indian scripts - ODT file
Created attachment 139318 [details] Sample text in multiple Indian scripts - Exported PDF
Created attachment 139319 [details] Sample text in multiple Indian scripts - Text copied from exported pdf
This problem is not limited to just Hindi. Rather it applies to all Indian language scripts, other Indic scripts and probably other complex scripts too. I have attached a sample showing the errors in copied text in Devanagari, Bengali, Gujarati, Gurmukhi, Kannada, Malayalam, Tamil and Telugu scripts. Sample text in multiple Indian scripts - ODT file Sample text in multiple Indian scripts - Exported PDF Sample text in multiple Indian scripts - Text copied from exported pdf
Created attachment 139320 [details] Sample text in multiple Indian scripts - Original Text copied from ODT This is the original text - ground truth that can be compared with the exported text from the pdf. Extending the Actualtext feature in pdfwriter can fix the issue. However, I do not know enough about the code or C++ to provide a patch.
*** Bug 62846 has been marked as a duplicate of this bug. ***
Just FYI, for users looking for a solution. https://www.wikihow.com/index.php?title=Create-a-Searchable-Hindi-PDF-Using-Lyx-with-Xetex
*** Bug 116056 has been marked as a duplicate of this bug. ***
*** Bug 116284 has been marked as a duplicate of this bug. ***
Even pure Latin script (e.g. U.S. English) can give wrong copy-and-paste results. Reportedly has something to do with ligatures, e.g. "tt" becoming "t" etc. but only in tables or some specific constructs. Can someone confirm that that is the same underlying bug as this one? Ref: bug 116284 https://bugs.documentfoundation.org/show_bug.cgi?id=116284
*** Bug 116490 has been marked as a duplicate of this bug. ***
https://gerrit.libreoffice.org/#/c/53315/
Khaled Hosny committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=c688b01d9102832226251fc84045408afe392459 tdf#66597 Fix PDF text extraction for complex text It will be available in 6.1.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Congrats to Khaled and Miklos and Tomaž for all those patch sets and related changes. I guess backport to 6.0 is not to be expected.
(In reply to Timur from comment #40) > Congrats to Khaled and Miklos and Tomaž for all those patch sets and related > changes. > I guess backport to 6.0 is not to be expected. Too many changes to backport, also strictly speaking this is a new feature not a bug fix.
(In reply to Commit Notification from comment #39) > Khaled Hosny committed a patch related to this issue. > It has been pushed to "master": > > It will be available in 6.1.0. > > The patch should be included in the daily builds available at > http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More > information about daily builds can be found at: > http://wiki.documentfoundation.org/Testing_Daily_Builds > > Affected users are encouraged to test the fix and report feedback. Thank you, Khaled Hosny, for implementing this. Which version should I download to test this on windows10?
(In reply to shreeshrii from comment #42) > Which version should I download to test this on windows10? You can get the build from here: https://dev-builds.libreoffice.org/daily/master/ Mind the build channel and date.
(In reply to Volga from comment #43) > (In reply to shreeshrii from comment #42) > > Which version should I download to test this on windows10? > You can get the build from here: > https://dev-builds.libreoffice.org/daily/master/ > Mind the build channel and date. Thank you, @Volga. I installed Version: 6.1.0.0.alpha1+ (x64) Build ID: a6a38c6de9c18fd1269fc8cfc0e070ef429c8e2f on Windows 10
Created attachment 141740 [details] Devanagari QA files
(In reply to shreeshrii from comment #45) > Created attachment 141740 [details] > Devanagari QA files The zip file has the original text file, same copied to a Libre Office document in different fonts and exported to pdf without any PDF options checked. I will post below the first two lines of the file under different scenarios: 1. Original text Mangal नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी प्रत्यक्षमाहेश्वरी ।। 2. Saving from Adobe Reader DC to text does not extract any Devanagari text. Some control characters are output along with ... where Devanagari should be. Mangal ............ ........ ............ .. 3. Copying from Adobe Reader DC to pasting in Notepad++ utf-8 document transfers the Devanagari text. There are the additional control characters as in the above case. Mangal नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी प्रत्यक्षमाहेश्वरी ।। 4. Copying from Chrome Browser to pasting in Notepad++ utf-8 document transfers the Devanagari text. There are the additional control characters, some are different from the Adobe Reader case. Some Devanagari characters are missing. Mangal नित्यान्दकरी वराभयकरी सौन्दर्यरत्ाकरी । निर्धूताखि लघोरपावकरी प्रत्यक्षमाहेश्वरी ।। 5. Copying from Microsoft Edge Browser to pasting in Notepad++ utf-8 document transfers the Devanagari text. There are fewer control characters addded. More Devanagari characters are missing. Mangal नित्यान्दकरी वराभयकरी सौन्दयरत्ाकरी । निर्धूताखिलघोरपावकरी प्रत्यक्षमाहेश्वरी ।। I am reopening the bug. Please let me know if any additional information is needed. I have NOT tested with any other script/language.
(In reply to shreeshrii from comment #46) > (In reply to shreeshrii from comment #45) > > Created attachment 141740 [details] > > Devanagari QA files > > The zip file has the original text file, same copied to a Libre Office > document in different fonts and exported to pdf without any PDF options > checked. > > I will post below the first two lines of the file under different scenarios: > > 1. Original text > > Mangal > नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी > प्रत्यक्षमाहेश्वरी ।। > > 2. Saving from Adobe Reader DC to text does not extract any Devanagari text. > Some control characters are output along with ... where Devanagari should be. > > Mangal > ............ ........ ............ .. That is a bug in Adobe Reader. > 3. Copying from Adobe Reader DC to pasting in Notepad++ utf-8 document > transfers the Devanagari text. There are the additional control characters > as in the above case. > > Mangal > नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी > प्रत्यक्षमाहेश्वरी ।। Thanks for testing! there was a typo in the code that seems to have gotten into some late iterations that I failed to test properly. Should be fixed shortly (with a test case, to prevent such breakage in the future). > 4. Copying from Chrome Browser to pasting in Notepad++ utf-8 document > transfers the Devanagari text. There are the additional control characters, > some are different from the Adobe Reader case. Some Devanagari characters > are missing. > > Mangal > नित्यान्दकरी वराभयकरी सौन्दर्यरत्ाकरी । निर्धूताखि लघोरपावकरी > प्रत्यक्षमाहेश्वरी ।। Chrome’s PDF reader does not suport /ActualText, so the changes here are unlikely to help that much. There is nothing we can do about it, unfortunately (apart from reporting to Chrome developers, of course). > 5. Copying from Microsoft Edge Browser to pasting in Notepad++ utf-8 > document transfers the Devanagari text. There are fewer control characters > addded. More Devanagari characters are missing. > > Mangal नित्यान्दकरी वराभयकरी सौन्दयरत्ाकरी । निर्धूताखिलघोरपावकरी > प्रत्यक्षमाहेश्वरी ।। I don’t know if this supports /ActualText or not, please wait for the next fix and re-test.
This should be fixed now, please retest.
Created attachment 141756 [details] Devanagari QA2 I tested with Version: 6.1.0.0.alpha1+ (x64) Build ID: 3bb3a9849c4262946013684495b18c0aa07e33c8 CPU threads: 4; OS: Windows 10.0; UI render: default; TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-29_02:56:02 Locale: en-IN (en_IN); Calc: group The problem of additional control characters and missing characters still exists. The copied text is exactly the same as posted in the earlier zip file. To help isolate the problem, I am attaching a new zip file for QA2. It has the same Devanagari text but only in one font. I have included the original utf-8 text, odt file and the generated pdf. For comparison I have also included a xelatex source file which has the same Devanagari text and the generated pdf with that. Using Adobe Reader DC to open the xelatex generated pdf, when selecting the text the highlighting is not correct, however the copied and pasted text is complete and without the extra control characters. Using Adobe Reader DC to open the Libre Office generated pdf, when selecting the text the highlighting is complete, however the copied and pasted text is with the extra control characters as well as missing Devanagari characters. There are differences in the control characters etc based on the font. Devanagari may not have standard tounicode tables in different fonts. So it maybe better to use Actualtext fully. You can install the Devanagari font used from https://packages.ubuntu.com/bionic/fonts-sahadeva Please let me know if you need any additional information.
(In reply to shreeshrii from comment #49) > Created attachment 141756 [details] > Devanagari QA2 > > I tested with > Version: 6.1.0.0.alpha1+ (x64) > Build ID: 3bb3a9849c4262946013684495b18c0aa07e33c8 > CPU threads: 4; OS: Windows 10.0; UI render: default; > TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-29_02:56:02 > Locale: en-IN (en_IN); Calc: group That build does not have the fix yet. The patch was submitted just this morning, please wait a bit more and try again.
(In reply to Khaled Hosny from comment #50) > (In reply to shreeshrii from comment #49) > > Created attachment 141756 [details] > > Devanagari QA2 > > > > I tested with > > Version: 6.1.0.0.alpha1+ (x64) > > Build ID: 3bb3a9849c4262946013684495b18c0aa07e33c8 > > CPU threads: 4; OS: Windows 10.0; UI render: default; > > TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-29_02:56:02 > > Locale: en-IN (en_IN); Calc: group > > That build does not have the fix yet. The patch was submitted just this > morning, please wait a bit more and try again. Please try now, the last build should have the fix.
Created attachment 141772 [details] Devanagari QA3 files I tested with Version: 6.1.0.0.alpha1+ (x64) Build ID: 5f2073fbc995fb619f398a55187413813578b62e CPU threads: 4; OS: Windows 10.0; UI render: default; TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-30_00:51:08 Locale: en-IN (en_IN); Calc: group Thank you, Khaled Hosny. Your new patch is applied in this build. The results are much improved. 1. In Adobe Reader, The control characters have disappeared. All Devanagari characters and glyphs are displaying. However, there certain extra spaces within words. These seem related to certain constant conjunct glyphs and the combining i mark (which is repositioned before the constants). There location seems to change based on fonts used. I have created a wdiff file with the original text vs the text copied from Adobe Reader. Here are top few lines in it: Mangal [-नित्यानन्दकरी-] {+नि त्यानन्दकरी+} वराभयकरी सौन्दर्यरत्नाकरी । [-निर्धूताखिलघोरपावनकरी-] {+नि र्धूताखि लघोरपावनकरी+} प्रत्यक्षमाहेश्वरी ।। [-अग्निशामक अभिज्ञान-] {+अग्नि शामक अभि ज्ञान+} अनुक्रम काष्ठवाद्य [-अंतर्राष्ट्रीय-] {+अंतर्रा ष्ट्रीय+} ख़ूँखार [-मूत्रविज्ञान द्विध्रुव-] {+मूत्रवि ज्ञान द्वि ध्रुव+} 2. Chrome is still displaying some control characters. wdiff is included. 3. The pdf generated by xelatex allows text to be copied correctly.
A few issues still remain with Devanagari text being copy-pasted. I have not tested with other Indian scripts yet.
(In reply to Shree Devi Kumar from comment #52) > Created attachment 141772 [details] > Devanagari QA3 files > > I tested with > Version: 6.1.0.0.alpha1+ (x64) > Build ID: 5f2073fbc995fb619f398a55187413813578b62e > CPU threads: 4; OS: Windows 10.0; UI render: default; > TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-30_00:51:08 > Locale: en-IN (en_IN); Calc: group > > Thank you, Khaled Hosny. Your new patch is applied in this build. The > results are much improved. > > 1. In Adobe Reader, The control characters have disappeared. All Devanagari > characters and glyphs are displaying. However, there certain extra spaces > within words. That is a bug in the reader, it tries to guess spaces based on some threshold distances between glyphs. It is a heuristic and it often fails. The only way I know to fix this is to use /ActualText per word, but this completely breaks the ability to select individual characters inside the word, so it is out of question, at least by default. It might be a good idea to have an option to do this, please open a new issue if you are interested in such an option. > > 2. Chrome is still displaying some control characters. wdiff is included. Again bug(s) in the reader, not sure if there is anything we can do here. > 3. The pdf generated by xelatex allows text to be copied correctly. But you can’t select individual characters (or grapheme clusters) as it embeds /ActualText per word (see above).
(In reply to Shree Devi Kumar from comment #53) > A few issues still remain with Devanagari text being copy-pasted. I have not > tested with other Indian scripts yet. You can get the sample from here: http://www.gnu.org/software/freefont/ranges/
(In reply to Volga from comment #55) > (In reply to Shree Devi Kumar from comment #53) > > A few issues still remain with Devanagari text being copy-pasted. I have not > > tested with other Indian scripts yet. > > You can get the sample from here: > http://www.gnu.org/software/freefont/ranges/ Volga, Thank you for the link. I have created a test document from the same.
(In reply to Khaled Hosny from comment #54) > That is a bug in the reader, it tries to guess spaces based on some > threshold distances between glyphs. It is a heuristic and it often fails. I was using Adobe Reader as the best case scenario. Is there any other viewer which work correctly to copy and extract text from generated pdfs for complex scripts? Which viewer do you test with? > The only way I know to fix this is to use /ActualText per word, but this > completely breaks the ability to select individual characters inside the > word, so it is out of question, at least by default. > It might be a good idea > to have an option to do this, please open a new issue if you are interested > in such an option. I think such an option should be used internally by the program based on the languages/scripts, since a number of Indic/Complex scripts are having the same problem. I will add a zip file with test cases for various Indic scripts including Devanagari. While looking for a viewer/reader of pdfs, I read that LibreOffice supports opening of pdf files. I tried opening the generated pdf through the daily build and it showed a number of errors (it was opened in LibreDraw). I can open a new issue for that, though I haven't quite figured out how to copy the text from text box in it. Thanks, Khaled. Appreciate your efforts in fixing this issue which has been open for 5 years!
Created attachment 141808 [details] Indic including Devanagari - QA4 Marked Older attachments as Obsolete. This zip file has two .odt documents, one has the text used earlier in Devanagari QA1-3, the other uses samples from the freefont site, suggested by Volga. wdiff is provided for both sample documents. The summary below gives an idea of differences. indic-freefont-sample-qa4.txt: 536 words 371 69% common 0 0% deleted 165 31% changed indic-freefont-sample-qa4.adobe-reader.txt: 814 words 371 46% common 0 0% inserted 443 54% changed indic-pdf-export-qa4.txt: 242 words 151 62% common 0 0% deleted 91 38% changed indic-pdf-export-qa4.adobe-reader.txt: 362 words 151 42% common 0 0% inserted 211 58% changed Languages Tested ---------------- Devanagari Script – Hindi, Sanskrit, Marathi, Nepali languages Bengali Script - Assamese, Bengali Gurmukhi script – Panjabi/Punjabi language Gujarati Kannada Malayalam Oriya Telugu Burmese Khmer Sinhala Tamil Thaana
Created attachment 141809 [details] RTL Languages - Arabic, Hebrew QA5 This zip file has an .odt with Arabic and Hebrew samples taken from the freefont page. .txt, .pdf and the text copied from adobe reader are included alongwith the wdiff. For Arabic, most of the errors seem to be related to usage of ( and ) with the Arabic text. The number of errors is much smaller. RTL-pdf-export-QA5.txt: 407 words 347 85% common 0 0% deleted 60 15% changed RTL-pdf-export-QA5.adobe-reader.txt: 424 words 347 82% common 2 0% inserted 75 18% changed
(In reply to Shree Devi Kumar from comment #57) > (In reply to Khaled Hosny from comment #54) > > > That is a bug in the reader, it tries to guess spaces based on some > > threshold distances between glyphs. It is a heuristic and it often fails. > > I was using Adobe Reader as the best case scenario. > > Is there any other viewer which work correctly to copy and extract text from > generated pdfs for complex scripts? > > Which viewer do you test with? Viewers based on Poppler seem to be good, comparable to Adobe’s at least. > > The only way I know to fix this is to use /ActualText per word, but this > > completely breaks the ability to select individual characters inside the > > word, so it is out of question, at least by default. > > > It might be a good idea > > to have an option to do this, please open a new issue if you are interested > > in such an option. > > I think such an option should be used internally by the program based on the > languages/scripts, since a number of Indic/Complex scripts are having the > same problem. The extra space issue can happen to any script, even Latin, I have certainly seen it with purely Latin text. The solutions comes with a big downside, so I’d not want to do it automatically. > I will add a zip file with test cases for various Indic scripts including > Devanagari. > > While looking for a viewer/reader of pdfs, I read that LibreOffice supports > opening of pdf files. I tried opening the generated pdf through the daily > build and it showed a number of errors (it was opened in LibreDraw). I can > open a new issue for that, though I haven't quite figured out how to copy > the text from text box in it. There are several issues open about this, but it is completely different matter. LibreOffice is trying to convert PDFs into editable documents (which is lost cause, if you ask for my opinion), and that has its own set of issues.
(In reply to Shree Devi Kumar from comment #59) > Created attachment 141809 [details] > RTL Languages - Arabic, Hebrew QA5 > > This zip file has an .odt with Arabic and Hebrew samples taken from the > freefont page. .txt, .pdf and the text copied from adobe reader are included > alongwith the wdiff. > > For Arabic, most of the errors seem to be related to usage of ( and ) with > the Arabic text. The number of errors is much smaller. > > > RTL-pdf-export-QA5.txt: > 407 words > 347 85% common > 0 0% deleted > 60 15% changed > RTL-pdf-export-QA5.adobe-reader.txt: > 424 words > 347 82% common > 2 0% inserted > 75 18% changed That is much better than I’d have expected for RTL, which is a totally different beast since PDF documents contain the final visual result (after applying bidirectional algorithm) and the logical direction of the text is totally lost and the viewer has to re-apply the bidirectional algorithm in reverse which will almost always fail for some cases. The only way to preserve the original text in its entirety is by using /ActualText with the whole paragraph (not just words or lines).
(In reply to Shree Devi Kumar from comment #58) > Created attachment 141808 [details] > Indic including Devanagari - QA4 > > Marked Older attachments as Obsolete. > > This zip file has two .odt documents, one has the text used earlier in > Devanagari QA1-3, the other uses samples from the freefont site, suggested > by Volga. > > wdiff is provided for both sample documents. The summary below gives an idea > of differences. > > indic-freefont-sample-qa4.txt: > 536 words > 371 69% common > 0 0% deleted > 165 31% changed > indic-freefont-sample-qa4.adobe-reader.txt: > 814 words > 371 46% common > 0 0% inserted > 443 54% changed > > indic-pdf-export-qa4.txt: > 242 words > 151 62% common > 0 0% deleted > 91 38% changed > indic-pdf-export-qa4.adobe-reader.txt: > 362 words > 151 42% common > 0 0% inserted > 211 58% changed Thanks for testing, really appreciate it. All the changes seem to be space related. Tamil looks the worst, but after careful examination it seems to be also space related, it just happens to be in very unfortunate places that wreak havoc with the cluster formation.
Khaled, Thank you for your responses. What do you suggest as next steps to resolve this issue? Is it possible to implement /Actualtext at word level based on language/script being used/ based on unicode range.
(In reply to Shree Devi Kumar from comment #63) > Khaled, > Thank you for your responses. > > What do you suggest as next steps to resolve this issue? > > Is it possible to implement /Actualtext at word level based on > language/script being used/ based on unicode range. My suggestion is to add an option to PDF export dialog to do ActualText per word (per sentence might be harder with the current implementation). Please open a new issue for this, this one is getting too long and further fixes are out of scope of the original issue. We would need also the UX team to give there opinion about the UI.
(In reply to Khaled Hosny from comment #64) > (In reply to Shree Devi Kumar from comment #63) > > Khaled, > > Thank you for your responses. > > > > What do you suggest as next steps to resolve this issue? > > > > Is it possible to implement /Actualtext at word level based on > > language/script being used/ based on unicode range. > > My suggestion is to add an option to PDF export dialog to do ActualText per > word (per sentence might be harder with the current implementation). > > Please open a new issue for this, this one is getting too long and further > fixes are out of scope of the original issue. We would need also the UX team > to give there opinion about the UI. OK, opened a new bug at https://bugs.documentfoundation.org/show_bug.cgi?id=117428
*** Bug 124191 has been marked as a duplicate of this bug. ***