Description: PDF Documents when opened with LO Draw show the Arabic text scrambled and completely unreadable. The file attached is an example of what shows up in a document with Arabic text content. I tried to open the same document with Adobe Acrobat and it opened perfectly fine. I love LO and I want it to maintain its position as the best office suite out there. The details of my system: Version: 7.3.3.2 / LibreOffice Community Build ID: 30(Build:2) CPU threads: 8; OS: Linux 5.18; UI render: default; VCL: gtk3 Locale: en-AE (en_AE.UTF-8); UI: en-US 7.3.3-3 Calc: threaded The issue is there for V 7.3.3.2 as well. Steps to Reproduce: 1. Save the document attached to this bug report 2. Open the document with LO Draw Actual Results: The Arabic text is scrambled and not readable at all. Expected Results: Arabic text to look normal and be readable. Reproducible: Always User Profile Reset: Yes OpenGL enabled: Yes Additional Info: - I checked fonts, I have the font used. I tried also to change the font, but still the same issue. - LO Office on Windows and Linus were tried with the same issue. - Even documents created with LO and saved as PDF will have the same issue as well. - It is not only shows the text scrambled and unreadbale, but also if I save the document, it is then scrambled and not readable using any other PDF reader including Adobe Acrobat Reader. - IMPORTANT: The attached document was created by another office suite, then opened as scrambled as described, then it was modified for privacy reasons and saved by LO Draw - like described in previous point.
Created attachment 180566 [details] PDF Sample File with Arabic text.
Thanks for filing, but a known and long running PDF import filter issue for RTL text runs. *** This bug has been marked as a duplicate of bug 104597 ***
While this bug is about PDF import of RTL language text runs - it is not the same problem described in 104597. There, the problem is the reversal of order in text runs. Here we have additional problems, like character repetitions, shifting, excessive and insufficient (horizontal) spacing. So, this is not clearly a dupe. Perhaps the fix for 104597 will resolve this one as well, but - perhaps not. I think the more careful relation between the bugs is dependence.
Hello How can I check the 104597 fix and decide if this is as well is solved?? How the new commit will be delivered as a new LO version? @Eyal Rozenberg
The 2022-10-14 nightly [1] imports the sample PDF to Draw pretty well. Some font glitches and obvious spots where combining glyphs get separated from their root glyph. Overall greatly improved, but please consider LibreOffice is *NOT* a PDF editor, the filter import to Draw produces an ODF holding sdraw text objects arranged on a document canvas. Version: 7.5.0.0.alpha0+ (x64) / LibreOffice Community Build ID: 8991cbb7986d3967bc6c3719d95254ff04428d1a CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Vulkan; VCL: win Locale: en-US (en_US); UI: en-US Calc: threaded =-ref-= [1] https://dev-builds.libreoffice.org/daily/master/
Hello @ V Stuart Foote Thanks a lot for the link to the build. I am not seeking to LO to be a PDF editor, but properly display RTL (Arabic in my case). I can assure that many Arabic users are not using Arabic because of such issues they are not facing with other apps. Anyways, I downloaded the 2022-10-14 build: Version: 7.5.0.0.alpha0+ / LibreOffice Community Build ID: a09c5c69e3b5fbf448cae1d6c476f39067e40023 CPU threads: 8; OS: Linux 6.0; UI render: default; VCL: gtk3 Locale: en-US (en_US.utf8); UI: en-US Calc: threaded The text rendering is much better but still the reverse order did not handle all the letters properly. Please note the added attachment that describes an issue in handling specific 2 letter combinations. Also, there is an issue of splitting the same word over multiple blocks rather coming into 1 block.
Hello, I agree, Draw is not a PDF editor. But Draw still show handle the RTL/Arabic letters properly. Which is not yet 100% fixed in this fix. In Arabic when a "Lam" letter is followed by a 'Alef" letter or "Hamza" letter, both letters are combined into a new form/shape. This looks like not being handled yet properly in this fix. Also, another issue appears that Draw sometimes split the "same word" into multiple blocks. NB. I call it a "block" but it can be named: frame, box.. etc. I am attaching new file that describes both issues. IMPORTANT: This commit fixes a big portion of the issue. It deserves to go live.
Created attachment 183055 [details] Lam-Alef and Lam-Hamza issue and Splitting singles words
For got to mention: Version: 7.5.0.0.alpha0+ / LibreOffice Community Build ID: a09c5c69e3b5fbf448cae1d6c476f39067e40023 CPU threads: 8; OS: Linux 6.0; UI render: default; VCL: gtk3 Locale: en-US (en_US.utf8); UI: en-US Calc: threaded
@Khaldoun, thanks for the analysis. I did notice the 1st issue. I don't know if that is a font fallback, or just manifestation of the way the glyphs are being extracted from the PDF--where the logic for handling the glyph transformations is probably not present. For the second, best to think of them as partial text runs or snippets. Glyphs are encoded into the PDF with no sense of source script. We filter import them (using poppler libs) into LibreOffice as just a run of text, all lexical context is missing. Normal break iterators are not parsed even if present. They end up recorded into the draw canvas as text box objects--disjointed by which glyphs get strung together. So, given the coarseness of the filter import, just getting them into the correct RTL sequence (for bug 104597) is a great improvement. Assembling them into lexically useful strings, sentences and paragraphs is work still to be done, work done for bug 118370 is not doing well with assembling the RTL textboxes, suspect that needs additional logic to do so. I'm interested in Khaled's take on things at this juncture.
The first attachment ("PDF sample file with Arabic text") is already kind of scrambled to begin with. Specifically, observe how, on line 2, the % sign overlaps the two aleef characters. Also, the text is not in the Arabic language, and I doubt it is properly in any language. So, let's please start with a proper PDF document (with Arabic, or Farsi or whatever), then analyze any problems.
The primary issue of the reversed text runs is corrected for the 7.4.3 release, with additional work in master against a 7.5 release. Any residual formatting or conversion of extracted RTL text runs should be opened as new issues against 7.4.3 *** This bug has been marked as a duplicate of bug 104597 ***