Bug 149457 - Arabic Text Scrambled and Unreadable in PDF Files Opened by LibreOffice Draw
Summary: Arabic Text Scrambled and Unreadable in PDF Files Opened by LibreOffice Draw
Status: RESOLVED DUPLICATE of bug 104597
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Draw (show other bugs)
Version:
(earliest affected)
7.3.3.2 release
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: PDF-Import-Draw RTL-Arabic-and-Farsi
  Show dependency treegraph
 
Reported: 2022-06-04 22:04 UTC by Khaldoun
Modified: 2022-12-25 00:36 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
PDF Sample File with Arabic text. (60.05 KB, application/pdf)
2022-06-04 22:06 UTC, Khaldoun
Details
Lam-Alef and Lam-Hamza issue and Splitting singles words (162.38 KB, application/pdf)
2022-10-14 22:00 UTC, Khaldoun
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Khaldoun 2022-06-04 22:04:56 UTC
Description:
PDF Documents when opened with LO Draw show the Arabic text scrambled and completely unreadable.

The file attached is an example of what shows up in a document with Arabic text content.

I tried to open the same document with Adobe Acrobat and it opened perfectly fine.

I love LO and I want it to maintain its position as the best office suite out there.

The details of my system:

Version: 7.3.3.2 / LibreOffice Community
Build ID: 30(Build:2)
CPU threads: 8; OS: Linux 5.18; UI render: default; VCL: gtk3
Locale: en-AE (en_AE.UTF-8); UI: en-US
7.3.3-3
Calc: threaded

The issue is there for V 7.3.3.2 as well.

Steps to Reproduce:
1. Save the document attached to this bug report
2. Open the document with LO Draw

Actual Results:
The Arabic text is scrambled and not readable at all.

Expected Results:
Arabic text to look normal and be readable.


Reproducible: Always


User Profile Reset: Yes


OpenGL enabled: Yes

Additional Info:
- I checked fonts, I have the font used. I tried also to change the font, but still the same issue.

- LO Office on Windows and Linus were tried with the same issue.

- Even documents created with LO and saved as PDF will have the same issue as well.

- It is not only shows the text scrambled and unreadbale, but also if I save the document, it is then scrambled and not readable using any other PDF reader including Adobe Acrobat Reader.

- IMPORTANT: The attached document was created by another office suite, then opened as scrambled as described, then it was modified for privacy reasons and saved by LO Draw - like described in previous point.
Comment 1 Khaldoun 2022-06-04 22:06:14 UTC
Created attachment 180566 [details]
PDF Sample File with Arabic text.
Comment 2 V Stuart Foote 2022-06-07 02:00:51 UTC
Thanks for filing, but a known and long running PDF import filter issue for RTL text runs.

*** This bug has been marked as a duplicate of bug 104597 ***
Comment 3 Eyal Rozenberg 2022-09-17 17:01:18 UTC
While this bug is about PDF import of RTL language text runs - it is not the same problem described in 104597. There, the problem is the reversal of order in text runs. Here we have additional problems, like character repetitions, shifting, excessive and insufficient (horizontal) spacing.

So, this is not clearly a dupe. Perhaps the fix for 104597 will resolve this one as well, but - perhaps not. I think the more careful relation between the bugs is dependence.
Comment 4 Khaldoun 2022-10-14 08:08:58 UTC
Hello How can I check the 104597 fix and decide if this is as well is solved??

How the new commit will be delivered as a new LO version?

@Eyal Rozenberg
Comment 5 V Stuart Foote 2022-10-14 16:05:09 UTC
The 2022-10-14 nightly [1] imports the sample PDF to Draw pretty well. Some font glitches and obvious spots where combining glyphs get separated from their root glyph.

Overall greatly improved, but please consider LibreOffice is *NOT* a PDF editor, the filter import to Draw produces an ODF holding sdraw text objects arranged on a document canvas.

Version: 7.5.0.0.alpha0+ (x64) / LibreOffice Community
Build ID: 8991cbb7986d3967bc6c3719d95254ff04428d1a
CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Vulkan; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded

=-ref-=
[1] https://dev-builds.libreoffice.org/daily/master/
Comment 6 Khaldoun 2022-10-14 21:18:39 UTC
Hello @ V Stuart Foote 

Thanks a lot for the link to the build.

I am not seeking to LO to be a PDF editor, but properly display RTL (Arabic in my case). I can assure that many Arabic users are not using Arabic because of such issues they are not facing with other apps.

Anyways, I downloaded the 2022-10-14 build:

Version: 7.5.0.0.alpha0+ / LibreOffice Community
Build ID: a09c5c69e3b5fbf448cae1d6c476f39067e40023
CPU threads: 8; OS: Linux 6.0; UI render: default; VCL: gtk3
Locale: en-US (en_US.utf8); UI: en-US
Calc: threaded


The text rendering is much better but still the reverse order did not handle all the letters properly. Please note the added attachment that describes an issue in handling specific 2 letter combinations.

Also, there is an issue of splitting the same word over multiple blocks rather coming into 1 block.
Comment 7 Khaldoun 2022-10-14 21:58:26 UTC
Hello,

I agree, Draw is not a PDF editor. But Draw still show handle the RTL/Arabic letters properly.

Which is not yet 100% fixed in this fix.

In Arabic when a "Lam" letter is followed by a 'Alef" letter or "Hamza" letter, both letters are combined into a new form/shape.

This looks like not being handled yet properly in this fix.

Also, another issue appears that Draw sometimes split the "same word" into multiple blocks.

NB. I call it a "block" but it can be named: frame, box.. etc.

I am attaching new file that describes both issues.

IMPORTANT: This commit fixes a big portion of the issue. It deserves to go live.
Comment 8 Khaldoun 2022-10-14 22:00:48 UTC
Created attachment 183055 [details]
Lam-Alef and Lam-Hamza issue and Splitting singles words
Comment 9 Khaldoun 2022-10-14 22:01:21 UTC
For got to mention:

Version: 7.5.0.0.alpha0+ / LibreOffice Community
Build ID: a09c5c69e3b5fbf448cae1d6c476f39067e40023
CPU threads: 8; OS: Linux 6.0; UI render: default; VCL: gtk3
Locale: en-US (en_US.utf8); UI: en-US
Calc: threaded
Comment 10 V Stuart Foote 2022-10-15 00:46:50 UTC
@Khaldoun, thanks for the analysis. 

I did notice the 1st issue. I don't know if that is a font fallback, or just manifestation of the way the glyphs are being extracted from the PDF--where the logic for handling the glyph transformations is probably not present.

For the second, best to think of them as partial text runs or snippets. Glyphs are encoded into the PDF with no sense of source script. We filter import them (using poppler libs) into LibreOffice as just a run of text, all lexical context is missing. Normal break iterators are not parsed even if present.  They end up recorded into the draw canvas as text box objects--disjointed by which glyphs get strung together.

So, given the coarseness of the filter import, just getting them into the correct RTL sequence (for bug 104597) is a great improvement.  Assembling them into lexically useful strings, sentences and paragraphs is work still to be done, work done for bug 118370 is not doing well with assembling the RTL textboxes, suspect that needs additional logic to do so.

I'm interested in Khaled's take on things at this juncture.
Comment 11 Eyal Rozenberg 2022-10-15 11:09:03 UTC
The first attachment ("PDF sample file with Arabic text") is already kind of scrambled to begin with. Specifically, observe how, on line 2, the % sign overlaps the two aleef characters.

Also, the text is not in the Arabic language, and I doubt it is properly in any language.

So, let's please start with a proper PDF document (with Arabic, or Farsi or whatever), then analyze any problems.
Comment 12 V Stuart Foote 2022-11-20 19:32:34 UTC
The primary issue of the reversed text runs is corrected for the 7.4.3 release, with additional work in master against a 7.5 release.

Any residual formatting or conversion of extracted RTL text runs should be opened as new issues against 7.4.3

*** This bug has been marked as a duplicate of bug 104597 ***