Bug 89471 - FILEOPEN pdf: When opening a PDF with RTL language text in Draw, text gets mirrored
Summary: FILEOPEN pdf: When opening a PDF with RTL language text in Draw, text gets mi...
Status: RESOLVED DUPLICATE of bug 104597
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
4.1 all versions
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: bibisected, bisected, regression
Depends on:
Blocks: RTL-CTL PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2015-02-19 14:11 UTC by bar.hofesh
Modified: 2022-10-15 17:30 UTC (History)
11 users (show)

See Also:
Crash report or crash signature:


Attachments
Hebrew in PDF (429.50 KB, application/pdf)
2015-02-19 14:11 UTC, bar.hofesh
Details
Non-LO-generated PDF with Hebrew text (38.77 KB, application/pdf)
2018-03-01 13:11 UTC, Eyal Rozenberg
Details
PDF with Arabic text triggering the bug (24.24 KB, application/pdf)
2018-03-01 13:18 UTC, Eyal Rozenberg
Details
simple PDF with RTL text inside text box (7.34 KB, application/pdf)
2019-03-30 08:30 UTC, Jack
Details

Note You need to log in before you can comment on or make changes to this bug.
Description bar.hofesh 2015-02-19 14:11:40 UTC
Created attachment 113500 [details]
Hebrew in PDF

This happens in Windows and Linux alike.
Using LibreOffcie 4.4.0.3

Attached is an example PDF, we have tested multiple files with multiple fonts.
Comment 1 A (Andy) 2015-02-22 10:21:50 UTC
Reproducible with LO 4.4.0.3, Win 8.1.
Comment 2 Erez Hadad 2015-08-30 13:39:57 UTC
Reproducible on Ubuntu 14.04, LibreOffice 5.0.1.2
Comment 3 Eyal Rozenberg 2016-03-15 21:11:55 UTC
Indeed, this seems to always be the behavior - both with PDFs created by LibreOffice (Export to PDF) and PDFs created elsewhere. If you want another PDF (not created in LO) with Hebrew, try this:

http://elyon1.court.gov.il/heb/forms/docs/loMeyutzagim_new.pdf

Reproduced with 5.0.5.2 release.
Comment 4 QA Administrators 2017-05-22 13:19:32 UTC Comment hidden (obsolete)
Comment 5 Eyal Rozenberg 2018-03-01 13:11:06 UTC
Created attachment 140246 [details]
Non-LO-generated PDF with Hebrew text

The link to a document in my previous comment is dead, so instead here's an attachment with a PDF file not created with LO (originally it's InDesign CS4 and Adobe Distiller 10). Opening it in LO Draw results in the same issue.
Comment 6 Eyal Rozenberg 2018-03-01 13:18:06 UTC
Created attachment 140247 [details]
PDF with Arabic text triggering the bug

It seems this happens with Arabic text as well, so it's an RTL language issue rather than Hebrew issue specifically.
Comment 7 Buovjaga 2018-03-20 12:32:51 UTC
*** Bug 116318 has been marked as a duplicate of this bug. ***
Comment 8 Buovjaga 2018-03-20 12:33:50 UTC
*** Bug 97131 has been marked as a duplicate of this bug. ***
Comment 9 Eyal Rozenberg 2018-09-17 20:58:34 UTC
Bug still manifests with:

Version: 6.1.1.2
Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2; 
Locale: en-GB (en_GB.UTF-8); Calc: group threaded
Comment 10 Buovjaga 2018-12-04 08:09:12 UTC
*** Bug 84797 has been marked as a duplicate of this bug. ***
Comment 11 Jack 2019-03-30 08:30:08 UTC
Created attachment 150410 [details]
simple PDF with RTL text inside text box

Bug 84797 is both older and specifies RTL text inside text boxes.
Comment 12 Eyal Rozenberg 2019-03-30 10:14:09 UTC
As a member of the LO RTL QA "team", I would like to point out this is currently one of the most significant RTL bugs in LibreOffice, as we see it (there was an actual discussion rating prominence of RTL bugs on our Telegram group). This bug essentially means LO cannot be used to work with PDFs with RTL content.

In light of the latter sentence, I ask that the severity be raised to major; and in light of the former sentence, I ask its importance be raised as well.
Comment 13 Xisco Faulí 2019-03-30 16:10:49 UTC
@Mark, i thought you might be interested in this RTL issue...
Comment 14 Zev Spitz 2019-06-16 10:38:20 UTC
*** Bug 125951 has been marked as a duplicate of this bug. ***
Comment 15 V Stuart Foote 2019-06-16 16:36:12 UTC
Hmm, clearly an issue against the pdfio for filter import to Draw (or other modules), the ipdf (pdfium based) cleanly inserts all RTL text from PDF--but fails miserably on break (bad font fallback from looks of it).

Anyhow, adding the PDF Import filter meta.

@Miklos, can anything be done, is issue that Unicode for glyphs are not recorded to PDF so no ICU bidi applied when creating the draw Text boxes to hold text?
Comment 16 Eyal Rozenberg 2019-06-16 20:29:10 UTC
(In reply to V Stuart Foote from comment #15)
> the ipdf (pdfium based) cleanly inserts all RTL text from PDF--but
> fails miserably on break (bad font fallback from looks of it).

What's the "ipdf" as opposed to "pdfio"?

> Unicode for glyphs are not recorded to PDF

How could this be possible? Could you elaborate?
Comment 17 V Stuart Foote 2019-06-16 22:12:27 UTC
(In reply to Eyal Rozenberg from comment #16)
> (In reply to V Stuart Foote from comment #15)
> > the ipdf (pdfium based) cleanly inserts all RTL text from PDF--but
> > fails miserably on break (bad font fallback from looks of it).
> 
> What's the "ipdf" as opposed to "pdfio"?

ipdf using Google chrome project's pdfium
https://opengrok.libreoffice.org/xref/core/vcl/source/filter/ipdf/

pdfio (legacy from the Oracle PDF extension, takes things through popler)
https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/

> 
> > Unicode for glyphs are not recorded to PDF
> 
> How could this be possible? Could you elaborate?

Adobe defined Glyph lists (AGLFN) as opposed to OpenType tables--with results added to /ToUnicode table structure of glypnhs used in the PDF. So Unicode is always involved--should have thought a bit more about that. The glyps are recorded into the PDF in the sequence they occur. What is mishandled in the pdfio filter is recognition that the text run is from a script entered RTL. IIUC Hebrew and Arabic Unicode ranges have Unicode bidirectional ranges identifying them as RTL--that they are not recognized as such on import suggests the filter is not making use of the ICU bidi library.

@Khaled, have you ever looked at the PDF import filters? Does it use the ICU libs? And if not, it probably should, right?
Comment 18 ⁨خالد حسني⁩ 2019-06-17 01:39:20 UTC
PDF outputs glyphs in visual order, so the original (logical) order of the text is lost, so the reverse of the bidi algorithm needs to be applied to the text extracted from the PDF, but there is no reliable or documented algorithm to do this.

ICU's bidi implementation has modes for applying the reverse of bidi algorithm (see the last three modes documented in http://icu-project.org/apiref/icu4c/ubidi_8h.html#afe123acc1196c4d7363f968ca6af6faa), I have never used them myself but if someone wants to work on this they may prove to be useful.
Comment 19 ⁨خالد حسني⁩ 2019-06-17 01:49:29 UTC
My own recommendation is to not try to edit PDFs in LibreOffice, PDF is not an editable format (despite what some tools would lead you to believe), and what you get is just some complex hacks. Making PDF editable in LibreOffice was a misguided mistake to begin with, and if it were for me I'd just deprecate and eventually remove support for it.
Comment 20 Miklos Vajna 2019-06-17 06:49:45 UTC
> ipdf using Google chrome project's pdfium

Correct, and the result is currently a bitmap. Better call it just "pdfium-based import", I would say.

> pdfio (legacy from the Oracle PDF extension, takes things through popler)
> https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/

Let's call this "poppler-based import", or pdfimport please. :-)

We also have "pdfio", but that's just a tokenizer, and currently it doesn't use pdfium nor poppler. It's primary usage is digital signatures; some cppunit tests also use it to assert the content of a pdf file.
Comment 21 Eyal Rozenberg 2019-06-18 22:39:28 UTC
(In reply to Khaled Hosny from comment #18)
> PDF outputs glyphs in visual order, so the original (logical) order of the
> text is lost, so the reverse of the bidi algorithm needs to be applied to
> the text extracted from the PDF, but there is no reliable or documented
> algorithm to do this.

Oh, no no no!

We seem to have a huge misunderstanding with respect to this bug.

Stuart, Khaled - the bug is not about how the original text order reconstruction is failing for RTL. The bug is that it is _performed_ at all. Basically, nothing in the PDF should be touched when we open it in Draw, unless the user actively change it. If I open a PDF file in Draw, then save it - I should get a PDF with essentially the same thing that came in. Only if I modify a specific frame/box/object within the PDF is Draw allowed to do any of this reconstruction stuff. If I touch something and the RTL text gets flipped or messed up due to my edit - that's sad, but it's not terrible. I can either not-touch it, or replace it (but just it) with newly-written text.

Now, I agree that proper reconstruction of RTL text runs from arbitrary PDFs is difficult and challenging; but that would be a request of an interesting future feature, not a bug report.

(PS - If LO Draw could write meta-data/hints regarding the correct logical order, it could at least do perfect reconstruction for those files. But that too is a feature request and doesn't belong in this bug.)
Comment 22 Eyal Rozenberg 2019-06-18 22:47:14 UTC
(In reply to Khaled Hosny from comment #19)
> My own recommendation is to not try to edit PDFs in LibreOffice, PDF is not
> an editable format (despite what some tools would lead you to believe), and
> what you get is just some complex hacks. Making PDF editable in LibreOffice
> was a misguided mistake to begin with, and if it were for me I'd just
> deprecate and eventually remove support for it.

With respect - this is an irrelevant recommendation. People edit PDFs exactly because they don't have access to the source files with which they were generated; or because they want to be certain they begin editing in the absolute final typeset form of a document.

PDFs are pretty editable. There are PDF editors, which work. Inkscape works. Adobe Acrobat (the full suite) works. They may work in the somewhat handicapped fashion I described above (not sure about the full Acrobat) - but they are quite useful. But I want LibreOffice draw functionality for these PDFs! and when this bug is fixed I can have it. Mostly.
Comment 23 V Stuart Foote 2019-06-19 02:33:19 UTC
(In reply to Eyal Rozenberg from comment #21)
> Oh, no no no!
> 
> We seem to have a huge misunderstanding with respect to this bug.
> 
Eyal, *

I will state again and quite clearly--LibreOffice is _NOT_ a PDF editor!

We can read it as a source document, opening into Writer, Impress, Calc, or Draw. We can filter export to PDF from any document--but that would overwrite/replace any source PDF, and only as os/DE allows.

We do not edit the PDF stream

We do not edit any of the PDF objects

ALL we do is read and filter import the PDF stream. 

We do not write back to the original source document and must swap in a reconstructed PDF stream with any changes.

Either of our two import filters: pdfium based or poppler based keeps a copy of the PDF source file, but always covert its contents for manipulation on the LibreOffice canvas. We do not work directly on the "original" we do not "edit" it!

That said, in practice the Poppler based import filter parses the object streams from PDF and converts them into corresponding LibreOffice Draw objects--Text boxes, Shapes, meta images, etc. Fidelity between the original PDF objects and the import filter result varies depending on the object type and if corresponding Draw object supports an attribute--clipping masks for example (bug 86211).

The pdfium base import filter is configured to render content of the PDF as a bitmap image with high fidelity to the document layout published in the PDF. Currently it only handles the first page of a PDF 'inserted as image', with the bitmap resolution set at just 96 dpi.

The issue here is that on filter import of the PDF--the object stream holding text runs is added to a Draw text box. Withing the source PDF, some original text will be broken into multiple text runs in multiple text objects.  

The text stream is sequenced as entered RTL, but as filter import is written out to the Text box the run is written LTR--with no handling of the text run of glyphs as RTL, or IIRC for more complex composite scripts.

LibreOffice uses extensively the ICU project (https://en.wikipedia.org/wiki/International_Components_for_Unicode) for script recognition and transliteration. But would seem text runs for non-western scripts are not being supported--and we may not be using the ICU Unicode text handling that is needed.

You'll note the pdfium filter (bug 89727) correctly handles the Hebrew and Arabic text of the sample documents attached here.  But less you think that is the solution for better fidelity and potential for "editing" PDF, like the poppler based import filter, selecting the graphic object and 'breaking' out its PDF stream objects results are not well rendered to document canvas--either losing the Unicode glyph, or getting incorrect font fallback (or a mix).

As Khaled said--PDF is not a format intended to be edited. And, LibreOffice is not a PDF editor. But we are mishandling RTL text runs and that needs to be investigated.
Comment 24 Eyal Rozenberg 2019-06-19 07:40:06 UTC
(In reply to V Stuart Foote from comment #23)
> I will state again and quite clearly--LibreOffice is _NOT_ a PDF editor!
> 
> We can read it as a source document, opening into Writer, Impress, Calc, or
> Draw. We can filter export to PDF from any document--but that would
> overwrite/replace any source PDF, and only as os/DE allows.

If LO can filter in a PDF, perform editing actions on it, and filter it out to a file, then it's effectively a PDF editor. If people use LO to edit PDFs then it's effectively a PDF editor. IIANM, inkscape probably does the same thing, i.e. it doesn't work on PDF object streams. I agree that it's not a proper editor, but you shouldn't deny an important use of LO Draw. I don't have statistics, but many people use it for this purpose. People on the RTL QA team rated this the most prominent outstanding LO RTL bug for a reason...

But actually, that is not relevant to this bug in the narrow sense. That is to say, even if people only ever save LO draw files and just open PDFs, it's the same bug. About the rest of your comment - I understand. Of course the pdfium import filter is not the solution.
Comment 25 V Stuart Foote 2019-06-21 23:14:05 UTC
The MirrorString and MirrorMap of OOo bz#i90800 was stripped out early in LO era with https://cgit.freedesktop.org/libreoffice/core/commit/?id=ff140bb6b8b109f14c270ff059f0b8d71dab5d6c

While the MirrorString helper remains, not clear it still has any function, or what should trigger it for a text run. And looking at processGlyphLine in pdfiprocessor.cxx it does not look like we link up the FontAttributes with fontID to make use of the ICU libs BiDi Unicode block detection that Khaled notes in comment 18. 

@Artem, you have any interest?
Comment 26 vvort 2019-06-22 08:03:42 UTC
Since my language is LTR, it is hard for me to understand what exactly should be done here. It is better to find someone with the knowledge of RTL specific.
For example, I don't see why FontAttributes are important here. We have "visual" Unicode string and need to convert it to "logical" (later LO will do inverse conversion for rendering purposes and cancel it).
Something is done in DrawXmlEmitter::visit ( https://docs.libreoffice.org/sdext/html/drawtreevisiting_8cxx_source.html#l00091 ), look at "// Check for RTL" comment. If that is not enough, it can be replaced by UBiDi "magic".
Comment 27 Xisco Faulí 2019-06-25 09:36:56 UTC
Indeed, this is a regression from https://cgit.freedesktop.org/libreoffice/core/commit/?id=ff140bb6b8b109f14c270ff059f0b8d71dab5d6c

Closing as dupe of bug 104597, which is clearer with less coments...

*** This bug has been marked as a duplicate of bug 104597 ***