Bug 97131 - Libreoffice imports RTL text in reverse order in PDF importer
Summary: Libreoffice imports RTL text in reverse order in PDF importer
Status: RESOLVED DUPLICATE of bug 89471
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 114189 (view as bug list)
Depends on:
Blocks: RTL-CTL PDF-Import-Draw PDF-Import-Writer
  Show dependency treegraph
 
Reported: 2016-01-14 12:22 UTC by zahra
Modified: 2018-03-20 12:33 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
its a persian translation islamic doa in pdf format. (1.47 MB, application/pdf)
2016-01-14 12:22 UTC, zahra
Details
sample odt with arabic, hebrew and persian (10.77 KB, application/vnd.oasis.opendocument.text)
2017-10-14 19:31 UTC, Yousuf Philips (jay) (retired)
Details
sample pdf (15.42 KB, application/pdf)
2017-10-14 19:36 UTC, Yousuf Philips (jay) (retired)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description zahra 2016-01-14 12:22:00 UTC
Created attachment 121931 [details]
its a persian translation islamic doa in pdf format.

hi every one. 
persian pdf files can not be show correctly in libreoffice and becomes completely unreadible. 
steps to reproduce: 
1/ open libreoffice writer. 
2/ press control o or select open in the file menu. 
3/ for the type of file that you want to open, select PDF - Portable Document Format (Writer) (*.pdf)

4/ choose utf-8 for encoding. 
5/ dont change anything after that and press okay to open the file. 
current result: libreoffice does not show persian pdf documents and after opening, the files unreadible. 
also open these files in adobe reader and sumatra pdf reader to see the difference. 
expected behaviour: libreoffice shows persian pdf documents like word and html documents and show them correctly.
Comment 1 Pedro 2016-01-17 09:43:39 UTC Comment hidden (obsolete)
Comment 2 zahra 2016-01-17 14:04:05 UTC Comment hidden (obsolete)
Comment 3 Pedro 2016-01-17 23:54:08 UTC Comment hidden (obsolete)
Comment 4 QA Administrators 2017-03-06 14:23:21 UTC Comment hidden (obsolete)
Comment 5 Yousuf Philips (jay) (retired) 2017-10-14 19:20:07 UTC
So the issue is that LO should be able to detect RTL characters and order them correctly, but presently it is putting them in reverse, and it should set RTL in the textboxes they are added in as well.

(In reply to zahra from comment #0)
> 4/ choose utf-8 for encoding. 

Where did you find this option?
Comment 6 Yousuf Philips (jay) (retired) 2017-10-14 19:31:18 UTC
Created attachment 136974 [details]
sample odt with arabic, hebrew and persian
Comment 7 Yousuf Philips (jay) (retired) 2017-10-14 19:36:03 UTC
Created attachment 136975 [details]
sample pdf

steps:
1. open pdf with 'pdf (writer)' filter in file open dialog
2. notice that text is in reverse order for each of the 3 languages
3. enter into any of the 3 textboxes and they are set to LTR

Version: 6.0.0.0.alpha0+
Build ID: 3672cdd35985201ea87463cf032fedd02c052f4d
CPU threads: 2; OS: Linux 4.4; UI render: default; VCL: gtk2; 
Locale: en-US (en_US.UTF-8); Calc: group
Comment 8 Yousuf Philips (jay) (retired) 2017-10-14 19:37:46 UTC
Same issue also happens when importing the pdf in Draw.
Comment 9 Urmas 2017-10-16 04:50:46 UTC
I don't see it is a bug.
RTL text layout depends on control characters and meta attributes to establish the fragments hierarchy and individual direction for each span.
PDF images contain only the pictures of the letters, and there is no simple, automatic way to put them into the correct reading order.

Maybe there should be a new RTL-specific editing operation implemented, "Visual <-> Logical conversion", which would allow fixing this manually.
Comment 10 Khaled Hosny (inactive) 2017-10-20 21:59:37 UTC
This can be partially worked around but can’t 100% fixed since PDF does not store actual text (usually) but just the end result of text layout with many information critical to reproducing the original text completely lost.

Poppler has some support for this, the discussion and patches in https://bugs.freedesktop.org/show_bug.cgi?id=55977 might help someone trying to do the same in LibreOffice.

My 2¢, recreating text from PDF files is a lost cause, PDF is first and foremost a print file format, so it should be viewed as some glorified printed paper.
Comment 11 Yousuf Philips (jay) (retired) 2017-12-02 08:04:56 UTC
*** Bug 114189 has been marked as a duplicate of this bug. ***
Comment 12 Eyal Rozenberg 2017-12-02 09:29:30 UTC
I filed the recent dupe, so - something I noted there: This bug manifests in particular with PDFs created by LO itself (write something in Writer, export it to PDF, open it in Draw).

Replying to Comment 10:
> My 2¢, recreating text from PDF files is a lost cause, PDF is first and foremost a print file format, so it should be viewed as some glorified printed paper.

I disagree, with the exact opposite opinion:  

* First, we have to distinguish between proper document recreation from PDF, which is more of a challenge, and recreation of text runs in frames which is what Draw does.
* There's no good reason that what Draw does for LTR text should not succeed for RTL text - especially when most PDF readers succeed in this already, and even let you copy-and-paste the raw RTL text correctly (in most cases).
* People very often get PDF documents and need to alter them despite not having the original. Example: A form you need to fill for some official agency like the government or a bank etc. That's an important use case that needs to be catered to.
* People very often have use for the raw text in a PDF document when penning a reply - to quote some of the text back. So this is another important use case.
* Even if you have a piece of paper, you should be able to OCR it into a PDF and then get the text back... :-)
Comment 13 Khaled Hosny (inactive) 2017-12-03 15:21:30 UTC
If you want editable documents don’t use PDF; it is a final format for consumption by human readers. Treat PDF as a glorified printed paper and you will be happy.

Draw is not a PDF editor, if you need one there are better options. If it were for me, I’d drop the PDF importer altogether and not give people false hopes. Inserted PDFs should be treated as vector images not documents to import text from.

Feel free to fix this bug though, and good luck with it (sincerely, I have worked with extracting RTL from PDF documents elsewhere before and I know how messed up things are)
Comment 14 Eyal Rozenberg 2017-12-03 15:35:13 UTC
(In reply to Khaled Hosny from comment #13)
So, the basis for your argument is the position that Draw should not offer importing PDFs. I respectfully disagree, but, regardless - it does import PDFs. Also, I understand that LO Writer has a PDF import filter too...

Now, the thing is, I understand it may be difficult - but it can't be that difficult if most of the PDF readers get it right. Right?
Comment 15 Khaled Hosny (inactive) 2017-12-03 20:05:39 UTC
(In reply to Eyal Rozenberg from comment #14)
> Now, the thing is, I understand it may be difficult - but it can't be that
> difficult if most of the PDF readers get it right. Right?

If they did, but they don’t.
Comment 16 Buovjaga 2018-03-20 12:33:50 UTC

*** This bug has been marked as a duplicate of bug 89471 ***