Bug 117428 - add an option to PDF export dialog to do ActualText per word
Summary: add an option to PDF export dialog to do ActualText per word
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
6.1.0.0.alpha1+
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on: 66597
Blocks: RTL-CTL PDF-Export PDF-Export-Options-Dialog
  Show dependency treegraph
 
Reported: 2018-05-04 14:41 UTC by Shree Devi Kumar
Modified: 2023-08-25 15:26 UTC (History)
7 users (show)

See Also:
Crash report or crash signature:


Attachments
Result of OP STR as pasted to Notepad++ UTF-8 (31.96 KB, image/png)
2021-07-20 15:10 UTC, V Stuart Foote
Details
results of testing on Ubuntu 18.04 with LO 7.3 alpha and Evince as PDF viewer (54.78 KB, application/vnd.oasis.opendocument.text)
2021-07-21 13:03 UTC, Stéphane Guillou (stragu)
Details
PDF as exported by LO 7.3 on Ubuntu 18.04 (30.63 KB, application/pdf)
2021-07-21 13:04 UTC, Stéphane Guillou (stragu)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Shree Devi Kumar 2018-05-04 14:41:11 UTC
Description:
A new feature has been added to 6.1.0 by Khaled Hosny that allows text to be copied and extracted from pdfs using ActualText. However it does not work completely for complex scripts.

ActualText per word has been suggested as a possible solution. Khaled has suggested that this be done via an option to PDF export dialog to do ActualText per word rather than as a default.

Steps to Reproduce:
1.Use the following text for testing.
Devanagari Script – 
Hindi, Sanskrit, Marathi, Nepali languages
नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी प्रत्यक्षमाहेश्वरी ।। 
अग्निशामक अभिज्ञान अनुक्रम काष्ठवाद्य अंतर्राष्ट्रीय ख़ूँखार मूत्रविज्ञान द्विध्रुव 
2.Open a new .odt file in LibreOffice , copy and paste the above text.
3.Export to pdf
4.Open the pdf in Acrobat Reader
5. Copy the text and paste in a text editor
6. Compare with the original utf-8 text

Actual Results:  
Devanagari Script –
Hindi, Sanskrit, Marathi, Nepali languages
नि त्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । नि र्धूताखि लघोरपावनकरी प्रत्यक्षमाहेश्वरी ।।
अग्नि शामक अभि ज्ञान अनुक्रम काष्ठवाद्य अंतर्रा ष्ट्र ीय ख़ूँखार मूत्रवि ज्ञान द्वि ध्रुव

Expected Results:
The text should be the same as original.

Devanagari Script – 
Hindi, Sanskrit, Marathi, Nepali languages
नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी प्रत्यक्षमाहेश्वरी ।। 
अग्निशामक अभिज्ञान अनुक्रम काष्ठवाद्य अंतर्राष्ट्रीय ख़ूँखार मूत्रविज्ञान द्विध्रुव 


Reproducible: Always


User Profile Reset: No



Additional Info:
The following wdiff output shows the difference.

======================================================================

[-नित्यानन्दकरी-]
{+नि त्यानन्दकरी+}
======================================================================
 [-निर्धूताखिलघोरपावनकरी-] {+नि र्धूताखि लघोरपावनकरी+}
======================================================================
 
[-अग्निशामक अभिज्ञान-]
{+अग्नि शामक अभि ज्ञान+}
======================================================================
 [-अंतर्राष्ट्रीय-] {+अंतर्रा ष्ट्र ीय+}
======================================================================
 [-मूत्रविज्ञान द्विध्रुव-] {+मूत्रवि ज्ञान द्वि ध्रुव+}
======================================================================

Please see https://bugs.documentfoundation.org/attachment.cgi?id=141808 for more examples with many other Indic/Complex scripts.

 I tested with 
> Version: 6.1.0.0.alpha1+ (x64)
> Build ID: 5f2073fbc995fb619f398a55187413813578b62e
> CPU threads: 4; OS: Windows 10.0; UI render: default; 
> TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-30_00:51:08
> Locale: en-IN (en_IN); Calc: group



User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36
Comment 1 Heiko Tietze 2018-05-06 09:20:52 UTC Comment hidden (off-topic)
Comment 2 Heiko Tietze 2018-05-06 09:22:23 UTC
What input from UX do you expect, Khaled?
Comment 3 V Stuart Foote 2018-05-06 15:08:28 UTC
The change is internal to the PDF export filter, enabled it will produce much larger PDF. But that PDF will be more useful to users needing to copy out text with reasonable word bounds--especially so for Complex script languages as drove bug 66597

The UX issues now are if the work on tagging PDF with /ActualText should be: 

1. done only for CTL?  => NO (see 6.)

2. toggled active by default? => NO (all PDFs would balloon in size)

3. receive its own check box control on the PDF export dialog?  => NO

4. alternatively, be merged into the generate "Tagged PDF (add document structure)" checkbox?  => YES (pending export dialog work needed for bug 45636) 

5. perform ICU lib only recognition of intended language/script? => NO (insufficient granularity as to /Lang tagging for non-CTL scripts)

6. or, recognize /ActualText as a component of supporting a11y--and that eventual support of ISO 14289-1 PDF/UA (bug 45636) will require accurate /Lang tagging--so coordination of ICU lib Unicode block detection with the locale/language (BCP 47/ISO 639 [1][2][3]) as set by locale or by Paragraph from the GUI must be implemented for fidelity of non-CTL scripts.  => YES


=-ref-=
[1] https://opengrok.libreoffice.org/xref/core/i18nlangtag/source/isolang/isolang.cxx#170
[2] https://opengrok.libreoffice.org/xref/core/include/i18nlangtag/mslangid.hxx
[3] https://opengrok.libreoffice.org/xref/core/include/rtl/locale.h
Comment 4 ⁨خالد حسني⁩ 2018-05-06 15:11:50 UTC
(In reply to Heiko Tietze from comment #2)
> What input from UX do you expect, Khaled?

What V Stuart Foote, plus 1) Do we want this option or not 2) What exact wording to use, /ActualText is a jargon (it refers to specific PDF construct) that I’m not sure should be exposed in user UI.
Comment 5 Shree Devi Kumar 2018-05-06 15:51:08 UTC
Regarding expected ballooning in pdf size, please see

http://tug.org/pipermail/xetex/2016-February/026445.html

On 23/2/16 02:54, Andrew Cunningham wrote:
> It would probably more than double, i was under the impression that
> ActualText was a tag attrubute, so extensive tagging would be needed,
> and actual text added to the tags.

The ActualText tagging is highly compressible, so in practice the 
increase in overall PDF size is not all that great.
Comment 6 V Stuart Foote 2018-05-06 20:39:29 UTC
(In reply to Khaled Hosny from comment #4)
> 1) Do we want this option or not

IIUC with https://cgit.freedesktop.org/libreoffice/core/commit/?id=c688b01d9102832226251fc84045408afe392459 gets us /ActualText tags in the PDF at the Unicode Glyph cluster level where needed.

This additional work would be to expand PDF export to include generation of the /ActualText at Unicode Word boundaries for text in all scripts/fonts. Helpful for fidelity of CTL script content by word, but also for extending our Tagged PDF content in general to include tagged words for entire text. Good for a11y and AT tools that can parse the tags.

So I think it is worth doing.


> 2) What exact wording to use, /ActualText is a jargon (it refers to specific PDF
> construct) that I’m not sure should be exposed in user UI.


True "Actual Text", a counterpoint to "Alternate Text" or "Extended Text" commenting for accessibility, could include other lexical aspects of rendering a document--e.g. exposing in PDF the meaning of an Emoji, from its Unicode point name or drawn from a substitution table. 

But here if we were to enable/disable Unicode Word boundary tags by simply adding it to the "Tagged PDF (add document structure)" check box the specific PDF /Lang & /ActualText tags would not be needed in the UI.  "Tagged PDF" would simply include actual text tagging by default.

The Help item for the checkbox would include details regards the Tagging of text including /Lang tags and /ActualText tags mentioned for completeness--but with no need to refer to them in the GUI otherwise.
Comment 7 Shree Devi Kumar 2018-05-07 18:30:13 UTC
> "Tagged PDF" would simply include actual text tagging by default.

That would be great!
Comment 8 Heiko Tietze 2018-05-08 07:07:19 UTC
(In reply to V Stuart Foote from comment #3)
> 1. done only for CTL?  => NO (see 6.)
No

> 2. toggled active by default? => NO (all PDFs would balloon in size)
We have a direct export command with just the file dialog. So considering Tools> Options>Print makes also sense. But I agree with No because of KISS.
 
> 3. receive its own check box control on the PDF export dialog?  => NO
> 4. alternatively, be merged into the generate "Tagged PDF (add document
> structure)" checkbox?  => YES (pending export dialog work needed for bug
> 45636) 
We have many options in this dialog and one more doesnt spoil the party. The problem with Tagged PDF is that this option is formally used for the structure.
=> Maybe ("[ ] Export raw text" underneath "[ ] Export comments")

> 5. perform ICU lib only recognition of intended language/script? => NO
> (insufficient granularity as to /Lang tagging for non-CTL scripts)
ACK
 
> 6. or, recognize /ActualText as a component of supporting a11y--and that
> eventual support of ISO 14289-1 PDF/UA (bug 45636) will require accurate
> /Lang tagging--so coordination of ICU lib Unicode block detection with the
> locale/language (BCP 47/ISO 639 [1][2][3]) as set by locale or by Paragraph
> from the GUI must be implemented for fidelity of non-CTL scripts.  => YES
Sounds to me like a checkbox is set on or off by default.

(In reply to Khaled Hosny from comment #4)
> 2) What exact wording to use, /ActualText is a jargon
"Export raw text", "Export actual text", "Export source"...
Comment 9 ⁨خالد حسني⁩ 2018-05-09 09:07:08 UTC
(In reply to Heiko Tietze from comment #8)
>
> (In reply to Khaled Hosny from comment #4)
> > 2) What exact wording to use, /ActualText is a jargon
> "Export raw text", "Export actual text", "Export source"...

We do export the text already, but using a clever algorithm that minimizes file size impact and keeps individual characters selectable (as much as possible), but it fails in minor ways with some readers second guessing us and inserting random spaces in the middle of the word. They keyword for the proposed changes is “per word”, the new option would skip the algorithm and tags the glyphs if each word with it's text, as a complete unit. This fixes the issue, but introduces a new one; you can no longer select parts of the word, it is now a single unit. The option text needs to relay some of this to the user.
Comment 10 Shree Devi Kumar 2018-05-16 10:25:41 UTC
(In reply to Khaled Hosny from comment #9)
>
> We do export the text already, but using a clever algorithm that minimizes
> file size impact and keeps individual characters selectable (as much as
> possible), but it fails in minor ways with some readers second guessing us
> and inserting random spaces in the middle of the word. 

For Indic languages this was happening in ALL readers that I tested. 

> They keyword for the
> proposed changes is “per word”, the new option would skip the algorithm and
> tags the glyphs if each word with it's text, as a complete unit. 

@Khaled Any update on this? Can you create a patch for this option so that it can be tested?
Comment 11 V Stuart Foote 2018-05-16 13:12:35 UTC
I don't believe Khaled has volunteered to tackle the needed refactoring to the PDF export filter and GUI.  Check History--clearly not assigned as Khaled removed himself, back to NEW

Otherwise, is there any objection that implementing an /ActualText flag "per word" will mean string selection to copy from PDF will be limited to word bounds? Personally I think we need the tagging more than the partial string copy. 

Assuring correct handling combining glyphs and Unicode script--and presumably OTF font features when implemented (as for bug 58941)--is the desired outcome.

Justified from a11y perspective, and needed for accuracy supporting CTL scripts. 

Is that the UX consensus?
Comment 12 ⁨خالد حسني⁩ 2018-05-16 14:23:58 UTC
(In reply to Shree Devi Kumar from comment #10)
> (In reply to Khaled Hosny from comment #9)
> > They keyword for the
> > proposed changes is “per word”, the new option would skip the algorithm and
> > tags the glyphs if each word with it's text, as a complete unit. 
> 
> @Khaled Any update on this? Can you create a patch for this option so that
> it can be tested?

I don’t currently have time to work on this, unfortunately.
Comment 13 Shree Devi Kumar 2018-05-20 16:21:41 UTC
(In reply to V Stuart Foote from comment #11)
> I don't believe Khaled has volunteered to tackle the needed refactoring to
> the PDF export filter and GUI.  Check History--clearly not assigned as
> Khaled removed himself, back to NEW

OK. Since he had suggested about opening a new bug for this, I had incorrectly assumed that he was planning to work on it. 

> 
> Otherwise, is there any objection that implementing an /ActualText flag "per
> word" will mean string selection to copy from PDF will be limited to word
> bounds? Personally I think we need the tagging more than the partial string
> copy. 
> 
> Assuring correct handling combining glyphs and Unicode script--and
> presumably OTF font features when implemented (as for bug 58941)--is the
> desired outcome.
> 
> Justified from a11y perspective, and needed for accuracy supporting CTL
> scripts. 
> 
> Is that the UX consensus?

As a user the ability to copy text from pdf is important. Currently, except for xelatex, I am not aware of any other method of doing so for Devanagari and other Indic scripts.

Please see https://www.wikihow.com/index.php?title=Create-a-Searchable-Hindi-PDF-Using-Lyx-with-Xetex which is a workaround for users who are not comfortable with XeLatex to create these searchable/copyable pdfs.

It will be a great benefit to users if this option can be implemented in Libre Office.

Thank You!
Comment 14 Shree Devi Kumar 2018-05-20 16:23:48 UTC
(In reply to Khaled Hosny from comment #12)
> (In reply to Shree Devi Kumar from comment #10)
> > (In reply to Khaled Hosny from comment #9)
> > > They keyword for the
> > > proposed changes is “per word”, the new option would skip the algorithm and
> > > tags the glyphs if each word with it's text, as a complete unit. 
> > 
> > @Khaled Any update on this? Can you create a patch for this option so that
> > it can be tested?
> 
> I don’t currently have time to work on this, unfortunately.

Ok. Thank you for your work on \Actualtext, it is step in the right direction to getting fully copyable text from pdfs.
Comment 15 Heiko Tietze 2018-05-27 08:32:45 UTC
Putting all comments together UX recommends to implement an option for this /Actualtext feature. I suggest the caption "Improve non-latin text export" (with default off, meaning nothing changes for western users) and explain details at the help pages.
Comment 16 ⁨خالد حسني⁩ 2018-05-27 15:42:53 UTC
(In reply to Heiko Tietze from comment #15)
> Putting all comments together UX recommends to implement an option for this
> /Actualtext feature. I suggest the caption "Improve non-latin text export"
> (with default off, meaning nothing changes for western users) and explain
> details at the help pages.

Nothing is “non-latin”-specific about the proposed option.
Comment 17 Heiko Tietze 2018-05-28 19:42:32 UTC
(In reply to Khaled Hosny from comment #16)
> Nothing is “non-latin”-specific about the proposed option.

How would you call CTL and alike in a way that average users understand this? IMHO, "Latin" is understood as A..Z maybe including some special characters like umlauts but definitely not arabic, hebrew, and asian.
Comment 18 flywire 2018-06-02 02:00:12 UTC
I consider libre word pdf characters displayed missing when text copied is a serious bug. In my instance the letter 'i' is displayed in the pdf file but often missing when text is copied and pasted to another program. eg computer commands are pasted incorrectly.

I have also noticed Text To Speech (TTS) does not work with missing characters in the pdf. Especially when it is a vowel!
Comment 19 ⁨خالد حسني⁩ 2018-06-19 08:51:29 UTC
(In reply to flywire0 from comment #18)
> I consider libre word pdf characters displayed missing when text copied is a
> serious bug. In my instance the letter 'i' is displayed in the pdf file but
> often missing when text is copied and pasted to another program. eg computer
> commands are pasted incorrectly.

This should be fixed in big 66597, if you still have an issue with builds including that fix, please open a new bug. This should be independent of the issue being discussed here.
Comment 20 Stéphane Guillou (stragu) 2021-07-20 13:44:27 UTC
I just tested the steps described in the Description, and couldn't reproduce the same issue:

On Ubuntu 18.04, using LO 7.0.6 and 7.3 alpha0+, I could copy the text, paste in Write, export to PDF, open in Evince 3.28.4, copy the text and paste it back in Writer or gedit: the result was  the same as the original text (as far as I can see).

Not sure if something changed in PDF export along the way? Could you please test again with a recent version of LO?
Comment 21 V Stuart Foote 2021-07-20 15:05:47 UTC
(In reply to stragu from comment #20)

> Not sure if something changed in PDF export along the way? Could you please
> test again with a recent version of LO?

Hmm, strange. With STR of OP with Writer 7.3.0alpha export to PDF. Opened in Acrobat Reader (ver 2021.005.20058) and copy to Notepad++ (bld 7.9.5) in UTF+8 encoding--I get exactly the same misformed Devanagari 

The glyph clusters are not formed correctly, so the words can not be copied out of the PDF.

The /ActualText structures when present would supplement the incorrect ToUnicode strings that drop lexical details.  Parsing the actual text runs would, if done at Unicode word bound iterators, provide better fidelity to original text when enabled and embedded into the PDF export.

=-testing-=
Version: 7.3.0.0.alpha0+ (x64) / LibreOffice Community
Build ID: 213430e0bdac0786b30a76a68b43d35647e93912
CPU threads: 8; OS: Windows 10.0 Build 19043; UI render: Skia/Vulkan; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded
Comment 22 V Stuart Foote 2021-07-20 15:10:20 UTC
Created attachment 173712 [details]
Result of OP STR as pasted to Notepad++ UTF-8
Comment 23 Stéphane Guillou (stragu) 2021-07-21 13:03:23 UTC
Created attachment 173741 [details]
results of testing on Ubuntu 18.04 with LO 7.3 alpha and Evince as PDF viewer

Interesting indeed !

Here are the results of my tests using:
- LO 7.3 alpha0+
- Ubuntu 18.04
- Evince 3.28.4
- gedit 3.28.1

I can't spot any difference with the original text.

This makes me wonder if the issue is specific to Windows, or if Acrobat Reader is the culprit?

Version: 7.3.0.0.alpha0+ / LibreOffice Community
Build ID: 113d308155e4b6a67a8510098a7db5f4a6632bdc
CPU threads: 8; OS: Linux 4.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
TinderBox: Linux-rpm_deb-x86_64@86-TDF, Branch:master, Time: 2021-07-16_21:27:22
Calc: threaded
Comment 24 Stéphane Guillou (stragu) 2021-07-21 13:04:11 UTC
Created attachment 173742 [details]
PDF as exported by LO 7.3 on Ubuntu 18.04

Also attaching the resulting PDF for completeness' sake.
Comment 25 Eyal Rozenberg 2023-08-25 15:26:20 UTC
Can someone summarize the state of this bug at the moment?