Created attachment 113500 [details]
Hebrew in PDF
This happens in Windows and Linux alike.
Using LibreOffcie 18.104.22.168
Attached is an example PDF, we have tested multiple files with multiple fonts.
Reproducible with LO 22.214.171.124, Win 8.1.
Reproducible on Ubuntu 14.04, LibreOffice 126.96.36.199
Indeed, this seems to always be the behavior - both with PDFs created by LibreOffice (Export to PDF) and PDFs created elsewhere. If you want another PDF (not created in LO) with Hebrew, try this:
Reproduced with 188.8.131.52 release.
** Please read this message in its entirety before responding **
To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.
There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.
If you have time, please do the following:
Test to see if the bug is still present on a currently supported version of LibreOffice
(5.2.7 or 5.3.3 https://www.libreoffice.org/download/
If the bug is present, please leave a comment that includes the version of LibreOffice and
your operating system, and any changes you see in the bug behavior
If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave
a short comment that includes your version of LibreOffice and Operating System
Please DO NOT
Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)
If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)
2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to "inherited from OOo";
4b. If the bug was not present in 3.3 - add "regression" to keyword
Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa
Thank you for helping us make LibreOffice even better for everyone!
Created attachment 140246 [details]
Non-LO-generated PDF with Hebrew text
The link to a document in my previous comment is dead, so instead here's an attachment with a PDF file not created with LO (originally it's InDesign CS4 and Adobe Distiller 10). Opening it in LO Draw results in the same issue.
Created attachment 140247 [details]
PDF with Arabic text triggering the bug
It seems this happens with Arabic text as well, so it's an RTL language issue rather than Hebrew issue specifically.
*** Bug 116318 has been marked as a duplicate of this bug. ***
*** Bug 97131 has been marked as a duplicate of this bug. ***
Bug still manifests with:
Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2;
Locale: en-GB (en_GB.UTF-8); Calc: group threaded
*** Bug 84797 has been marked as a duplicate of this bug. ***
Created attachment 150410 [details]
simple PDF with RTL text inside text box
Bug 84797 is both older and specifies RTL text inside text boxes.
As a member of the LO RTL QA "team", I would like to point out this is currently one of the most significant RTL bugs in LibreOffice, as we see it (there was an actual discussion rating prominence of RTL bugs on our Telegram group). This bug essentially means LO cannot be used to work with PDFs with RTL content.
In light of the latter sentence, I ask that the severity be raised to major; and in light of the former sentence, I ask its importance be raised as well.
@Mark, i thought you might be interested in this RTL issue...
*** Bug 125951 has been marked as a duplicate of this bug. ***
Hmm, clearly an issue against the pdfio for filter import to Draw (or other modules), the ipdf (pdfium based) cleanly inserts all RTL text from PDF--but fails miserably on break (bad font fallback from looks of it).
Anyhow, adding the PDF Import filter meta.
@Miklos, can anything be done, is issue that Unicode for glyphs are not recorded to PDF so no ICU bidi applied when creating the draw Text boxes to hold text?
(In reply to V Stuart Foote from comment #15)
> the ipdf (pdfium based) cleanly inserts all RTL text from PDF--but
> fails miserably on break (bad font fallback from looks of it).
What's the "ipdf" as opposed to "pdfio"?
> Unicode for glyphs are not recorded to PDF
How could this be possible? Could you elaborate?
(In reply to Eyal Rozenberg from comment #16)
> (In reply to V Stuart Foote from comment #15)
> > the ipdf (pdfium based) cleanly inserts all RTL text from PDF--but
> > fails miserably on break (bad font fallback from looks of it).
> What's the "ipdf" as opposed to "pdfio"?
ipdf using Google chrome project's pdfium
pdfio (legacy from the Oracle PDF extension, takes things through popler)
> > Unicode for glyphs are not recorded to PDF
> How could this be possible? Could you elaborate?
Adobe defined Glyph lists (AGLFN) as opposed to OpenType tables--with results added to /ToUnicode table structure of glypnhs used in the PDF. So Unicode is always involved--should have thought a bit more about that. The glyps are recorded into the PDF in the sequence they occur. What is mishandled in the pdfio filter is recognition that the text run is from a script entered RTL. IIUC Hebrew and Arabic Unicode ranges have Unicode bidirectional ranges identifying them as RTL--that they are not recognized as such on import suggests the filter is not making use of the ICU bidi library.
@Khaled, have you ever looked at the PDF import filters? Does it use the ICU libs? And if not, it probably should, right?
PDF outputs glyphs in visual order, so the original (logical) order of the text is lost, so the reverse of the bidi algorithm needs to be applied to the text extracted from the PDF, but there is no reliable or documented algorithm to do this.
ICU's bidi implementation has modes for applying the reverse of bidi algorithm (see the last three modes documented in http://icu-project.org/apiref/icu4c/ubidi_8h.html#afe123acc1196c4d7363f968ca6af6faa), I have never used them myself but if someone wants to work on this they may prove to be useful.
My own recommendation is to not try to edit PDFs in LibreOffice, PDF is not an editable format (despite what some tools would lead you to believe), and what you get is just some complex hacks. Making PDF editable in LibreOffice was a misguided mistake to begin with, and if it were for me I'd just deprecate and eventually remove support for it.
> ipdf using Google chrome project's pdfium
Correct, and the result is currently a bitmap. Better call it just "pdfium-based import", I would say.
> pdfio (legacy from the Oracle PDF extension, takes things through popler)
Let's call this "poppler-based import", or pdfimport please. :-)
We also have "pdfio", but that's just a tokenizer, and currently it doesn't use pdfium nor poppler. It's primary usage is digital signatures; some cppunit tests also use it to assert the content of a pdf file.
(In reply to Khaled Hosny from comment #18)
> PDF outputs glyphs in visual order, so the original (logical) order of the
> text is lost, so the reverse of the bidi algorithm needs to be applied to
> the text extracted from the PDF, but there is no reliable or documented
> algorithm to do this.
Oh, no no no!
We seem to have a huge misunderstanding with respect to this bug.
Stuart, Khaled - the bug is not about how the original text order reconstruction is failing for RTL. The bug is that it is _performed_ at all. Basically, nothing in the PDF should be touched when we open it in Draw, unless the user actively change it. If I open a PDF file in Draw, then save it - I should get a PDF with essentially the same thing that came in. Only if I modify a specific frame/box/object within the PDF is Draw allowed to do any of this reconstruction stuff. If I touch something and the RTL text gets flipped or messed up due to my edit - that's sad, but it's not terrible. I can either not-touch it, or replace it (but just it) with newly-written text.
Now, I agree that proper reconstruction of RTL text runs from arbitrary PDFs is difficult and challenging; but that would be a request of an interesting future feature, not a bug report.
(PS - If LO Draw could write meta-data/hints regarding the correct logical order, it could at least do perfect reconstruction for those files. But that too is a feature request and doesn't belong in this bug.)
(In reply to Khaled Hosny from comment #19)
> My own recommendation is to not try to edit PDFs in LibreOffice, PDF is not
> an editable format (despite what some tools would lead you to believe), and
> what you get is just some complex hacks. Making PDF editable in LibreOffice
> was a misguided mistake to begin with, and if it were for me I'd just
> deprecate and eventually remove support for it.
With respect - this is an irrelevant recommendation. People edit PDFs exactly because they don't have access to the source files with which they were generated; or because they want to be certain they begin editing in the absolute final typeset form of a document.
PDFs are pretty editable. There are PDF editors, which work. Inkscape works. Adobe Acrobat (the full suite) works. They may work in the somewhat handicapped fashion I described above (not sure about the full Acrobat) - but they are quite useful. But I want LibreOffice draw functionality for these PDFs! and when this bug is fixed I can have it. Mostly.
(In reply to Eyal Rozenberg from comment #21)
> Oh, no no no!
> We seem to have a huge misunderstanding with respect to this bug.
I will state again and quite clearly--LibreOffice is _NOT_ a PDF editor!
We can read it as a source document, opening into Writer, Impress, Calc, or Draw. We can filter export to PDF from any document--but that would overwrite/replace any source PDF, and only as os/DE allows.
We do not edit the PDF stream
We do not edit any of the PDF objects
ALL we do is read and filter import the PDF stream.
We do not write back to the original source document and must swap in a reconstructed PDF stream with any changes.
Either of our two import filters: pdfium based or poppler based keeps a copy of the PDF source file, but always covert its contents for manipulation on the LibreOffice canvas. We do not work directly on the "original" we do not "edit" it!
That said, in practice the Poppler based import filter parses the object streams from PDF and converts them into corresponding LibreOffice Draw objects--Text boxes, Shapes, meta images, etc. Fidelity between the original PDF objects and the import filter result varies depending on the object type and if corresponding Draw object supports an attribute--clipping masks for example (bug 86211).
The pdfium base import filter is configured to render content of the PDF as a bitmap image with high fidelity to the document layout published in the PDF. Currently it only handles the first page of a PDF 'inserted as image', with the bitmap resolution set at just 96 dpi.
The issue here is that on filter import of the PDF--the object stream holding text runs is added to a Draw text box. Withing the source PDF, some original text will be broken into multiple text runs in multiple text objects.
The text stream is sequenced as entered RTL, but as filter import is written out to the Text box the run is written LTR--with no handling of the text run of glyphs as RTL, or IIRC for more complex composite scripts.
LibreOffice uses extensively the ICU project (https://en.wikipedia.org/wiki/International_Components_for_Unicode) for script recognition and transliteration. But would seem text runs for non-western scripts are not being supported--and we may not be using the ICU Unicode text handling that is needed.
You'll note the pdfium filter (bug 89727) correctly handles the Hebrew and Arabic text of the sample documents attached here. But less you think that is the solution for better fidelity and potential for "editing" PDF, like the poppler based import filter, selecting the graphic object and 'breaking' out its PDF stream objects results are not well rendered to document canvas--either losing the Unicode glyph, or getting incorrect font fallback (or a mix).
As Khaled said--PDF is not a format intended to be edited. And, LibreOffice is not a PDF editor. But we are mishandling RTL text runs and that needs to be investigated.
(In reply to V Stuart Foote from comment #23)
> I will state again and quite clearly--LibreOffice is _NOT_ a PDF editor!
> We can read it as a source document, opening into Writer, Impress, Calc, or
> Draw. We can filter export to PDF from any document--but that would
> overwrite/replace any source PDF, and only as os/DE allows.
If LO can filter in a PDF, perform editing actions on it, and filter it out to a file, then it's effectively a PDF editor. If people use LO to edit PDFs then it's effectively a PDF editor. IIANM, inkscape probably does the same thing, i.e. it doesn't work on PDF object streams. I agree that it's not a proper editor, but you shouldn't deny an important use of LO Draw. I don't have statistics, but many people use it for this purpose. People on the RTL QA team rated this the most prominent outstanding LO RTL bug for a reason...
But actually, that is not relevant to this bug in the narrow sense. That is to say, even if people only ever save LO draw files and just open PDFs, it's the same bug. About the rest of your comment - I understand. Of course the pdfium import filter is not the solution.
The MirrorString and MirrorMap of OOo bz#i90800 was stripped out early in LO era with https://cgit.freedesktop.org/libreoffice/core/commit/?id=ff140bb6b8b109f14c270ff059f0b8d71dab5d6c
While the MirrorString helper remains, not clear it still has any function, or what should trigger it for a text run. And looking at processGlyphLine in pdfiprocessor.cxx it does not look like we link up the FontAttributes with fontID to make use of the ICU libs BiDi Unicode block detection that Khaled notes in comment 18.
@Artem, you have any interest?
Since my language is LTR, it is hard for me to understand what exactly should be done here. It is better to find someone with the knowledge of RTL specific.
For example, I don't see why FontAttributes are important here. We have "visual" Unicode string and need to convert it to "logical" (later LO will do inverse conversion for rendering purposes and cancel it).
Something is done in DrawXmlEmitter::visit ( https://docs.libreoffice.org/sdext/html/drawtreevisiting_8cxx_source.html#l00091 ), look at "// Check for RTL" comment. If that is not enough, it can be replaced by UBiDi "magic".
Indeed, this is a regression from https://cgit.freedesktop.org/libreoffice/core/commit/?id=ff140bb6b8b109f14c270ff059f0b8d71dab5d6c
Closing as dupe of bug 104597, which is clearer with less coments...
*** This bug has been marked as a duplicate of bug 104597 ***