117428 – add an option to PDF export dialog to do ActualText per word

Bug 117428 - add an option to PDF export dialog to do ActualText per word

Summary: add an option to PDF export dialog to do ActualText per word

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Printing and PDF export (show other bugs)
Version: (earliest affected)	6.1.0.0.alpha1+
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:	66597
Blocks:	PDF-Export PDF-Export-Options-Dialog CTL
	Show dependency tree / graph

Reported:	2018-05-04 14:41 UTC by Shree Devi Kumar
Modified:	2025-01-16 14:23 UTC (History)
CC List:	9 users (show)

See Also:	118370 39667 152143
Crash report or crash signature:

Attachments
Result of OP STR as pasted to Notepad++ UTF-8 (31.96 KB, image/png) 2021-07-20 15:10 UTC, V Stuart Foote	Details
results of testing on Ubuntu 18.04 with LO 7.3 alpha and Evince as PDF viewer (54.78 KB, application/vnd.oasis.opendocument.text) 2021-07-21 13:03 UTC, Stéphane Guillou (stragu)	Details
PDF as exported by LO 7.3 on Ubuntu 18.04 (30.63 KB, application/pdf) 2021-07-21 13:04 UTC, Stéphane Guillou (stragu)	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Shree Devi Kumar 2018-05-04 14:41:11 UTC

Description:
A new feature has been added to 6.1.0 by Khaled Hosny that allows text to be copied and extracted from pdfs using ActualText. However it does not work completely for complex scripts.

ActualText per word has been suggested as a possible solution. Khaled has suggested that this be done via an option to PDF export dialog to do ActualText per word rather than as a default.

Steps to Reproduce:
1.Use the following text for testing.
Devanagari Script – 
Hindi, Sanskrit, Marathi, Nepali languages
नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी प्रत्यक्षमाहेश्वरी ।। 
अग्निशामक अभिज्ञान अनुक्रम काष्ठवाद्य अंतर्राष्ट्रीय ख़ूँखार मूत्रविज्ञान द्विध्रुव 
2.Open a new .odt file in LibreOffice , copy and paste the above text.
3.Export to pdf
4.Open the pdf in Acrobat Reader
5. Copy the text and paste in a text editor
6. Compare with the original utf-8 text

Actual Results:  
Devanagari Script –
Hindi, Sanskrit, Marathi, Nepali languages
नि त्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । नि र्धूताखि लघोरपावनकरी प्रत्यक्षमाहेश्वरी ।।
अग्नि शामक अभि ज्ञान अनुक्रम काष्ठवाद्य अंतर्रा ष्ट्र ीय ख़ूँखार मूत्रवि ज्ञान द्वि ध्रुव

Expected Results:
The text should be the same as original.

Devanagari Script – 
Hindi, Sanskrit, Marathi, Nepali languages
नित्यानन्दकरी वराभयकरी सौन्दर्यरत्नाकरी । निर्धूताखिलघोरपावनकरी प्रत्यक्षमाहेश्वरी ।। 
अग्निशामक अभिज्ञान अनुक्रम काष्ठवाद्य अंतर्राष्ट्रीय ख़ूँखार मूत्रविज्ञान द्विध्रुव 


Reproducible: Always


User Profile Reset: No



Additional Info:
The following wdiff output shows the difference.

======================================================================

[-नित्यानन्दकरी-]
{+नि त्यानन्दकरी+}
======================================================================
 [-निर्धूताखिलघोरपावनकरी-] {+नि र्धूताखि लघोरपावनकरी+}
======================================================================
 
[-अग्निशामक अभिज्ञान-]
{+अग्नि शामक अभि ज्ञान+}
======================================================================
 [-अंतर्राष्ट्रीय-] {+अंतर्रा ष्ट्र ीय+}
======================================================================
 [-मूत्रविज्ञान द्विध्रुव-] {+मूत्रवि ज्ञान द्वि ध्रुव+}
======================================================================

Please see https://bugs.documentfoundation.org/attachment.cgi?id=141808 for more examples with many other Indic/Complex scripts.

 I tested with 
> Version: 6.1.0.0.alpha1+ (x64)
> Build ID: 5f2073fbc995fb619f398a55187413813578b62e
> CPU threads: 4; OS: Windows 10.0; UI render: default; 
> TinderBox: Win-x86_64@42, Branch:master, Time: 2018-04-30_00:51:08
> Locale: en-IN (en_IN); Calc: group



User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36

Comment 1 Heiko Tietze 2018-05-06 09:20:52 UTC Comment hidden (off-topic)

needsUXEval needs CC @ ux-advice

Comment 2 Heiko Tietze 2018-05-06 09:22:23 UTC

What input from UX do you expect, Khaled?

Comment 3 V Stuart Foote 2018-05-06 15:08:28 UTC

The change is internal to the PDF export filter, enabled it will produce much larger PDF. But that PDF will be more useful to users needing to copy out text with reasonable word bounds--especially so for Complex script languages as drove bug 66597

The UX issues now are if the work on tagging PDF with /ActualText should be: 

1. done only for CTL?  => NO (see 6.)

2. toggled active by default? => NO (all PDFs would balloon in size)

3. receive its own check box control on the PDF export dialog?  => NO

4. alternatively, be merged into the generate "Tagged PDF (add document structure)" checkbox?  => YES (pending export dialog work needed for bug 45636) 

5. perform ICU lib only recognition of intended language/script? => NO (insufficient granularity as to /Lang tagging for non-CTL scripts)

6. or, recognize /ActualText as a component of supporting a11y--and that eventual support of ISO 14289-1 PDF/UA (bug 45636) will require accurate /Lang tagging--so coordination of ICU lib Unicode block detection with the locale/language (BCP 47/ISO 639 [1][2][3]) as set by locale or by Paragraph from the GUI must be implemented for fidelity of non-CTL scripts.  => YES


=-ref-=
[1] https://opengrok.libreoffice.org/xref/core/i18nlangtag/source/isolang/isolang.cxx#170
[2] https://opengrok.libreoffice.org/xref/core/include/i18nlangtag/mslangid.hxx
[3] https://opengrok.libreoffice.org/xref/core/include/rtl/locale.h

Comment 4 Khaled Hosny 2018-05-06 15:11:50 UTC

(In reply to Heiko Tietze from comment #2)
> What input from UX do you expect, Khaled?

What V Stuart Foote, plus 1) Do we want this option or not 2) What exact wording to use, /ActualText is a jargon (it refers to specific PDF construct) that I’m not sure should be exposed in user UI.

Comment 5 Shree Devi Kumar 2018-05-06 15:51:08 UTC

Regarding expected ballooning in pdf size, please see

http://tug.org/pipermail/xetex/2016-February/026445.html

On 23/2/16 02:54, Andrew Cunningham wrote:
> It would probably more than double, i was under the impression that
> ActualText was a tag attrubute, so extensive tagging would be needed,
> and actual text added to the tags.

The ActualText tagging is highly compressible, so in practice the 
increase in overall PDF size is not all that great.

Comment 6 V Stuart Foote 2018-05-06 20:39:29 UTC

(In reply to Khaled Hosny from comment #4)
> 1) Do we want this option or not

IIUC with https://cgit.freedesktop.org/libreoffice/core/commit/?id=c688b01d9102832226251fc84045408afe392459 gets us /ActualText tags in the PDF at the Unicode Glyph cluster level where needed.

This additional work would be to expand PDF export to include generation of the /ActualText at Unicode Word boundaries for text in all scripts/fonts. Helpful for fidelity of CTL script content by word, but also for extending our Tagged PDF content in general to include tagged words for entire text. Good for a11y and AT tools that can parse the tags.

So I think it is worth doing.


> 2) What exact wording to use, /ActualText is a jargon (it refers to specific PDF
> construct) that I’m not sure should be exposed in user UI.


True "Actual Text", a counterpoint to "Alternate Text" or "Extended Text" commenting for accessibility, could include other lexical aspects of rendering a document--e.g. exposing in PDF the meaning of an Emoji, from its Unicode point name or drawn from a substitution table. 

But here if we were to enable/disable Unicode Word boundary tags by simply adding it to the "Tagged PDF (add document structure)" check box the specific PDF /Lang & /ActualText tags would not be needed in the UI.  "Tagged PDF" would simply include actual text tagging by default.

The Help item for the checkbox would include details regards the Tagging of text including /Lang tags and /ActualText tags mentioned for completeness--but with no need to refer to them in the GUI otherwise.

Comment 7 Shree Devi Kumar 2018-05-07 18:30:13 UTC

> "Tagged PDF" would simply include actual text tagging by default.

That would be great!

Comment 8 Heiko Tietze 2018-05-08 07:07:19 UTC

(In reply to V Stuart Foote from comment #3)
> 1. done only for CTL?  => NO (see 6.)
No

> 2. toggled active by default? => NO (all PDFs would balloon in size)
We have a direct export command with just the file dialog. So considering Tools> Options>Print makes also sense. But I agree with No because of KISS.
 
> 3. receive its own check box control on the PDF export dialog?  => NO
> 4. alternatively, be merged into the generate "Tagged PDF (add document
> structure)" checkbox?  => YES (pending export dialog work needed for bug
> 45636) 
We have many options in this dialog and one more doesnt spoil the party. The problem with Tagged PDF is that this option is formally used for the structure.
=> Maybe ("[ ] Export raw text" underneath "[ ] Export comments")

> 5. perform ICU lib only recognition of intended language/script? => NO
> (insufficient granularity as to /Lang tagging for non-CTL scripts)
ACK
 
> 6. or, recognize /ActualText as a component of supporting a11y--and that
> eventual support of ISO 14289-1 PDF/UA (bug 45636) will require accurate
> /Lang tagging--so coordination of ICU lib Unicode block detection with the
> locale/language (BCP 47/ISO 639 [1][2][3]) as set by locale or by Paragraph
> from the GUI must be implemented for fidelity of non-CTL scripts.  => YES
Sounds to me like a checkbox is set on or off by default.

(In reply to Khaled Hosny from comment #4)
> 2) What exact wording to use, /ActualText is a jargon
"Export raw text", "Export actual text", "Export source"...

Comment 9 Khaled Hosny 2018-05-09 09:07:08 UTC

(In reply to Heiko Tietze from comment #8)
>
> (In reply to Khaled Hosny from comment #4)
> > 2) What exact wording to use, /ActualText is a jargon
> "Export raw text", "Export actual text", "Export source"...

We do export the text already, but using a clever algorithm that minimizes file size impact and keeps individual characters selectable (as much as possible), but it fails in minor ways with some readers second guessing us and inserting random spaces in the middle of the word. They keyword for the proposed changes is “per word”, the new option would skip the algorithm and tags the glyphs if each word with it's text, as a complete unit. This fixes the issue, but introduces a new one; you can no longer select parts of the word, it is now a single unit. The option text needs to relay some of this to the user.

Comment 10 Shree Devi Kumar 2018-05-16 10:25:41 UTC

(In reply to Khaled Hosny from comment #9)
>
> We do export the text already, but using a clever algorithm that minimizes
> file size impact and keeps individual characters selectable (as much as
> possible), but it fails in minor ways with some readers second guessing us
> and inserting random spaces in the middle of the word. 

For Indic languages this was happening in ALL readers that I tested. 

> They keyword for the
> proposed changes is “per word”, the new option would skip the algorithm and
> tags the glyphs if each word with it's text, as a complete unit. 

@Khaled Any update on this? Can you create a patch for this option so that it can be tested?

Comment 11 V Stuart Foote 2018-05-16 13:12:35 UTC

I don't believe Khaled has volunteered to tackle the needed refactoring to the PDF export filter and GUI.  Check History--clearly not assigned as Khaled removed himself, back to NEW

Otherwise, is there any objection that implementing an /ActualText flag "per word" will mean string selection to copy from PDF will be limited to word bounds? Personally I think we need the tagging more than the partial string copy. 

Assuring correct handling combining glyphs and Unicode script--and presumably OTF font features when implemented (as for bug 58941)--is the desired outcome.

Justified from a11y perspective, and needed for accuracy supporting CTL scripts. 

Is that the UX consensus?

Comment 12 Khaled Hosny 2018-05-16 14:23:58 UTC

(In reply to Shree Devi Kumar from comment #10)
> (In reply to Khaled Hosny from comment #9)
> > They keyword for the
> > proposed changes is “per word”, the new option would skip the algorithm and
> > tags the glyphs if each word with it's text, as a complete unit. 
> 
> @Khaled Any update on this? Can you create a patch for this option so that
> it can be tested?

I don’t currently have time to work on this, unfortunately.

Comment 13 Shree Devi Kumar 2018-05-20 16:21:41 UTC

(In reply to V Stuart Foote from comment #11)
> I don't believe Khaled has volunteered to tackle the needed refactoring to
> the PDF export filter and GUI.  Check History--clearly not assigned as
> Khaled removed himself, back to NEW

OK. Since he had suggested about opening a new bug for this, I had incorrectly assumed that he was planning to work on it. 

> 
> Otherwise, is there any objection that implementing an /ActualText flag "per
> word" will mean string selection to copy from PDF will be limited to word
> bounds? Personally I think we need the tagging more than the partial string
> copy. 
> 
> Assuring correct handling combining glyphs and Unicode script--and
> presumably OTF font features when implemented (as for bug 58941)--is the
> desired outcome.
> 
> Justified from a11y perspective, and needed for accuracy supporting CTL
> scripts. 
> 
> Is that the UX consensus?

As a user the ability to copy text from pdf is important. Currently, except for xelatex, I am not aware of any other method of doing so for Devanagari and other Indic scripts.

Please see https://www.wikihow.com/index.php?title=Create-a-Searchable-Hindi-PDF-Using-Lyx-with-Xetex which is a workaround for users who are not comfortable with XeLatex to create these searchable/copyable pdfs.

It will be a great benefit to users if this option can be implemented in Libre Office.

Thank You!

Comment 14 Shree Devi Kumar 2018-05-20 16:23:48 UTC

(In reply to Khaled Hosny from comment #12)
> (In reply to Shree Devi Kumar from comment #10)
> > (In reply to Khaled Hosny from comment #9)
> > > They keyword for the
> > > proposed changes is “per word”, the new option would skip the algorithm and
> > > tags the glyphs if each word with it's text, as a complete unit. 
> > 
> > @Khaled Any update on this? Can you create a patch for this option so that
> > it can be tested?
> 
> I don’t currently have time to work on this, unfortunately.

Ok. Thank you for your work on \Actualtext, it is step in the right direction to getting fully copyable text from pdfs.

Comment 15 Heiko Tietze 2018-05-27 08:32:45 UTC

Putting all comments together UX recommends to implement an option for this /Actualtext feature. I suggest the caption "Improve non-latin text export" (with default off, meaning nothing changes for western users) and explain details at the help pages.

Comment 16 Khaled Hosny 2018-05-27 15:42:53 UTC

(In reply to Heiko Tietze from comment #15)
> Putting all comments together UX recommends to implement an option for this
> /Actualtext feature. I suggest the caption "Improve non-latin text export"
> (with default off, meaning nothing changes for western users) and explain
> details at the help pages.

Nothing is “non-latin”-specific about the proposed option.

Comment 17 Heiko Tietze 2018-05-28 19:42:32 UTC

(In reply to Khaled Hosny from comment #16)
> Nothing is “non-latin”-specific about the proposed option.

How would you call CTL and alike in a way that average users understand this? IMHO, "Latin" is understood as A..Z maybe including some special characters like umlauts but definitely not arabic, hebrew, and asian.

Comment 18 flywire 2018-06-02 02:00:12 UTC

I consider libre word pdf characters displayed missing when text copied is a serious bug. In my instance the letter 'i' is displayed in the pdf file but often missing when text is copied and pasted to another program. eg computer commands are pasted incorrectly.

I have also noticed Text To Speech (TTS) does not work with missing characters in the pdf. Especially when it is a vowel!

Comment 19 Khaled Hosny 2018-06-19 08:51:29 UTC

(In reply to flywire0 from comment #18)
> I consider libre word pdf characters displayed missing when text copied is a
> serious bug. In my instance the letter 'i' is displayed in the pdf file but
> often missing when text is copied and pasted to another program. eg computer
> commands are pasted incorrectly.

This should be fixed in big 66597, if you still have an issue with builds including that fix, please open a new bug. This should be independent of the issue being discussed here.

Comment 20 Stéphane Guillou (stragu) 2021-07-20 13:44:27 UTC

I just tested the steps described in the Description, and couldn't reproduce the same issue:

On Ubuntu 18.04, using LO 7.0.6 and 7.3 alpha0+, I could copy the text, paste in Write, export to PDF, open in Evince 3.28.4, copy the text and paste it back in Writer or gedit: the result was  the same as the original text (as far as I can see).

Not sure if something changed in PDF export along the way? Could you please test again with a recent version of LO?

Comment 21 V Stuart Foote 2021-07-20 15:05:47 UTC

(In reply to stragu from comment #20)

> Not sure if something changed in PDF export along the way? Could you please
> test again with a recent version of LO?

Hmm, strange. With STR of OP with Writer 7.3.0alpha export to PDF. Opened in Acrobat Reader (ver 2021.005.20058) and copy to Notepad++ (bld 7.9.5) in UTF+8 encoding--I get exactly the same misformed Devanagari 

The glyph clusters are not formed correctly, so the words can not be copied out of the PDF.

The /ActualText structures when present would supplement the incorrect ToUnicode strings that drop lexical details.  Parsing the actual text runs would, if done at Unicode word bound iterators, provide better fidelity to original text when enabled and embedded into the PDF export.

=-testing-=
Version: 7.3.0.0.alpha0+ (x64) / LibreOffice Community
Build ID: 213430e0bdac0786b30a76a68b43d35647e93912
CPU threads: 8; OS: Windows 10.0 Build 19043; UI render: Skia/Vulkan; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded

Comment 22 V Stuart Foote 2021-07-20 15:10:20 UTC

Created attachment 173712 [details]
Result of OP STR as pasted to Notepad++ UTF-8

Comment 23 Stéphane Guillou (stragu) 2021-07-21 13:03:23 UTC

Created attachment 173741 [details]
results of testing on Ubuntu 18.04 with LO 7.3 alpha and Evince as PDF viewer

Interesting indeed !

Here are the results of my tests using:
- LO 7.3 alpha0+
- Ubuntu 18.04
- Evince 3.28.4
- gedit 3.28.1

I can't spot any difference with the original text.

This makes me wonder if the issue is specific to Windows, or if Acrobat Reader is the culprit?

Version: 7.3.0.0.alpha0+ / LibreOffice Community
Build ID: 113d308155e4b6a67a8510098a7db5f4a6632bdc
CPU threads: 8; OS: Linux 4.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
TinderBox: Linux-rpm_deb-x86_64@86-TDF, Branch:master, Time: 2021-07-16_21:27:22
Calc: threaded

Comment 24 Stéphane Guillou (stragu) 2021-07-21 13:04:11 UTC

Created attachment 173742 [details]
PDF as exported by LO 7.3 on Ubuntu 18.04

Also attaching the resulting PDF for completeness' sake.

Comment 25 Eyal Rozenberg 2023-08-25 15:26:20 UTC Comment hidden (obsolete)

Can someone summarize the state of this bug at the moment?

Comment 26 Eyal Rozenberg 2024-07-22 08:20:32 UTC

(In reply to Shree Devi Kumar from comment #5)
> Regarding expected ballooning in pdf size, please see
> 
> http://tug.org/pipermail/xetex/2016-February/026445.html

This link suggests the size increase is ~10% on a PDF of size 22 KB. Is that characteristic? If so, it's not terrible.

(In reply to Eyal Rozenberg from comment #25)
> Can someone summarize the state of this bug at the moment?

Repeating this question. Where does this stand? Also, should it really block RTL-CTL?

Comment 27 V Stuart Foote 2024-07-22 15:28:36 UTC

(In reply to Eyal Rozenberg from comment #26)

> Repeating this question. Where does this stand? Also, should it really block
> RTL-CTL?

Implementing PDF /ActualText tagging (comment 3, comment 6) has not been picked up for dev effort, but remains a valid enhancement to quality/function of our PDF export.

Especially so for the Complex Text Languages (CTL) [that *you* had tagged (bug 43808)], obviously less an issue for handling simple RTL scripts, but essential embedding text runs of the truly complex, e.g. the Indic scripts. [1]

@Miklos, Tomaž -- have either of you looked through this? Khaled had done the initial work on the /ActualText structure in the PDF export filters, and commented on the merit of more complete text tagging within our PDF export.

=-ref-=
[1] https://en.wikipedia.org/wiki/Brahmic_scripts#Unicode_of_Brahmic_scripts

Comment 28 Jonathan Clark 2025-01-16 14:23:11 UTC

To update this bug, I briefly investigated the current state of text extraction. I performed the following tests using a trivial Devanagari Writer document containing only "नित्यानन्दकरी", then exported to PDF using our filter:

Adobe Acrobat Reader now extracts the correct text. This is an improvement over the original report.

Evince also extracts the correct text. The macOS preview app crashed when I tried to click on the text to select it, but using the keyboard I was able to copy and paste the correct text.

Current stable Firefox (pdf.js) and Google Chrome do not seem to handle ActualText at all. Both programs seem to replace glyphs without ToUnicode mappings with an index, whether or not ActualText is specified. I also tested with quick-and-dirty hacks to simulate ActualText per word, forcing ActualText for every cluster, and using ActualText with no ToUnicode mappings; none of these fixes improved the situation.

As noted above, ActualText per-word could have other benefits. Currently, however, I don't think it would improve the text extraction situation. The major blocker seems to be the readers that don't implement any ActualText support at all, whether it's done per-word or per-cluster.