Bug 119606 - PDF: Arabic text gets deformed when creating a PDF in LibreOffice Writer (Linux-only)
Summary: PDF: Arabic text gets deformed when creating a PDF in LibreOffice Writer (Lin...
Status: RESOLVED NOTOURBUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
5.4.6.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: PDF-Export
  Show dependency treegraph
 
Reported: 2018-08-30 12:25 UTC by vaaydayaasra
Modified: 2022-09-24 07:13 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
PDF created with LO 5.4.6.2 where textual content is garbled (10.56 KB, application/pdf)
2018-08-30 12:26 UTC, vaaydayaasra
Details
Test PDF with various fonts (243.45 KB, application/pdf)
2022-09-15 06:30 UTC, ⁨خالد حسني⁩
Details

Note You need to log in before you can comment on or make changes to this bug.
Description vaaydayaasra 2018-08-30 12:25:49 UTC
Description:
Creating a PDF from a document written in the Arabic script deforms the textual content of the document, although it looks fine on the screen.

For example, see the attached PDF created with Writer 5.4.6.2 on Ubuntu 17.10, where the example sentence "اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ" looks as it should, but when you view it with any PDF reader, such as evince, copying the text deforms most of the words. Some characters are clearly visible but cannot be selected or searched (such as ى at the end of the first word اشترى). If I search for the second word بلال, evince tells me there are no matches in the document. The same happens when converting the file with pdftotext, which produces the following output:

اشتر

للا مسة لفا كتاب وَأَنَا ْ
ه
اشت َ َريْتُهَا ِ
من ْ ُ

Here only two of the eight words are intact, the rest are garbled in one way or another. If the text is in Latin script, both evince and pdftotext behave as expected, meaning that the textual content is transferred correctly from Writer to the PDF.

On LO 6.0.3.2 on Ubuntu 18.04, the textual content is preserved a little better but it is still quite garbled. This is the output from pdftotext:

ه
اشترى للا خمسة آفا كتاب وَأنَا اشْ ت َ َريْتُهَا ِ
من ْ ُ

Here four out of the eight words are intact, and for example the last word of the sentence is divided so that the last full character is found on the first line and the rest on the third line. Some diacritics are found where they are supposed to be, some others not.

MS Word 2007 handles this case better, although it's not perfect either. This is the output from pdftotext:

اشترى بالل خمسة آالف كتاب وأنا اشتريتها منه

Here all diacritics are dropped and all sequences of ل (U+0644) + ا (U+0627) are reversed turning لا into ال. Otherwise the sentence is intact.

This bug was first reported on Launchpad for LO 5.4.6.2 on Ubuntu 17.10 at: https://bugs.launchpad.net/ubuntu/+source/libreoffice/+bug/1772439 . After my initial report, I have upgraded to LO 6.0.3.2 where the problem persists, although the actual output is different. Another user on Launchpad confirmed the bug on LO 6.0.3.2, as well.

Steps to Reproduce:
1. In a new Writer document, type some text in Arabic. My example sentence was: اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
2. Create a PDF.
3. Open the created PDF with a PDF reader (such as evince) and type one of the words in the Search dialog, e.g. بلال. Alternatively select the word in the PDF reader and copy-paste it somewhere else. You can also convert the PDF to text using a utility like pdftotext.

Actual Results:
The PDF reader reports there are no matches for some of the words in the document, although they are all clearly visible. Selecting and copy-pasting the word garbles it. Pdftotext's output is garbled.

Expected Results:
All the words that are visible should also be searchable in a PDF reader, copy-pasting should preserve the text, and the output of pdftotext should match the original document.


Reproducible: Always


User Profile Reset: No



Additional Info:
Comment 1 vaaydayaasra 2018-08-30 12:26:47 UTC
Created attachment 144554 [details]
PDF created with LO 5.4.6.2 where textual content is garbled
Comment 2 Buovjaga 2018-09-23 15:50:15 UTC
Repro. Can only successfully search with individual glyphs in PDF

Arch Linux 64-bit
Version: 6.2.0.0.alpha0+
Build ID: 8b1501d80dc9d3f42c351c6e026fa737e116cae5
CPU threads: 8; OS: Linux 4.18; UI render: default; VCL: gtk3_kde5; 
Locale: fi-FI (fi_FI.UTF-8); Calc: threaded
Built on 23 September 2018
Comment 3 QA Administrators 2019-09-24 03:09:47 UTC Comment hidden (obsolete)
Comment 4 vaaydayaasra 2019-10-07 15:55:37 UTC
Still reproducible on:

Version: 6.3.2.2
Build ID: libreoffice-6.3.2.2-snap1
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk3; 
Locale: fi-FI (fi_FI.UTF-8); UI-Language: en-US
Calc: threaded

pdftotext's output is again different from my initial report but it's still garbled:

أ
ه
ن
ا م
ه
ت
ي
ر
ا اشْت
و ن
اشترى بالل خمسة آالف كتاب َ

This time the beginning of the sentence (found on the last line of the output) is already quite good, though ل and ا in the ligature لا are reversed. Thus on evince بالل matches بلال. The end of the sentence where there are diacritical vowel marks is worse than in my initial report.
Comment 5 QA Administrators 2021-10-07 03:53:30 UTC Comment hidden (obsolete)
Comment 6 vaaydayaasra 2022-03-01 18:37:20 UTC
The problem seems to have been resolved on LO 7.3.0.3 on Windows 10. To test PDF output this time, I used Adobe Acrobat DC 2021.011.20039 64-bit. I haven't tested on Linux, where the problem initially appeared.

Version: 7.3.0.3 (x64) / LibreOffice Community
Build ID: 0f246aa12d0eee4a0f7adcefbf7c878fc2238db3
CPU threads: 4; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: fr-FR (fr_FR); UI: fr-FR
Calc: CL
Comment 7 Buovjaga 2022-03-02 13:25:39 UTC
Unfortunately still reproduced on Linux

Arch Linux 64-bit
Version: 7.4.0.0.alpha0+ / LibreOffice Community
Build ID: 8f2b1b1cb84e1ae3139eb90b8efdf61e608adbad
CPU threads: 8; OS: Linux 5.16; UI render: default; VCL: kf5 (cairo+xcb)
Locale: fi-FI (fi_FI.UTF-8); UI: en-US
Calc: threaded Jumbo
Built on 24 February 2022
Comment 8 ⁨خالد حسني⁩ 2022-09-15 06:26:15 UTC
This highly depends on font and the PDF viewer used, and limitations of PDF format.

We are doing our best with what PDF format gives us, we are outputting ToUnicode mapping when applicable and ActualText tagging when not. We try to limit the scope of ActualText spans so that individual characters and words can be selected and highlighted, otherwise we can tag full paragraphs with ActualText which will give the most fidelity in preserving the textual content, but then PDF viewers will treat the paragraph text as back box and can no longer associate the text with the glyphs rendered (so search results can’t be highlighted, parts of the paragraph can’t be selected and so on).

PDF is not an archival format, no matter how hard Adobe wants to sell this idea, it is first and foremost a print format, a glorified paper so to speak.

We are crippled by several issues here:

* Text in PDF is output in visual order (i.e. from left to right), while the text content is stored in logical order (the first character comes first in memory, regardless of the direction). This means any tool extracting text from PDF need to reverse the logical to visual order and this process lossy and not always reliable.

* PDF stores glyphs not characters, so we need to handle all the complex glyph to character relationships, that is why the result depends on the font.

* Not all PDF viewers support ActualText tagging, and the ToUnicode mechanism can’t capture all the possible relations above.

* PDF viewers will often try to guess where the spaces are since many PDF producing tools don’t output space character at all (they just position the glyphs so that they are separated visually by blank space), so sometimes kerning can be misrepresented as word spaces.

Overall I don’t think there is anything that can be done here, but if someone can attach a PDF that is doing better, I can try to have a look and see if we can learn some trick from it.

Lastly, none of this is platform dependent, if you are getting different results on different platforms, it will be either because the different fonts or PDF viewers used.
Comment 9 ⁨خالد حسني⁩ 2022-09-15 06:30:31 UTC
Created attachment 182459 [details]
Test PDF with various fonts

Here is a test PDF and here is the extracted text:

Adobe Acrobat Reader DC:
اش ترى بلال خمسة آلاف كتاب وَ أَنَا اشْ تَر يَْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَ أَ نَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَ أَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب و أََ ناَ اشْترَيَتْهُاَ مِنهُْ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
اش ترى بلال خمسة آلاف كتاب وَأَ نَا اشْ تَر يَْ تُه اَ مِنْهُ
نْ هُ هَا مِ
تُْ رَ ي
شْ تَ ا لف ت ك ا ب أََ و نَ ا ا
آ
مسة
خ
ا ش ترى ب ا لل
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَ أَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
نْهُ ا مِ نَا اشْ تَر يَْتُهَ
اش ترى بلال خمسة آلاف كتاب وَ أَ
اشترى بلال خمسة آلاف كتاب وَ أَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَ أَنَا اشْترََيتُْهَا مِنْهُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
ا ش تر ى بلال خ مس ة آلا ف ك تا ب وَ أَ نَا ا شْ تَر يَْ تُهَا مِ نْهُ


Apple’s Preview:
ا ش تَر ى ب لا ل خ م س ة آ لا ف ك ت ا ب َو أَ َ ن َ ا ا ْش َتَر ْي ُت َه ا ِم ن ْ ُه َ
اشترىبلالخمسةآلافكتاب َوَأََناا ْشَتَرْيُت َها ِمْنُه
اشترى بلال خمسة آلاف كتاب َو َأَ َنا ا ْش َت َر ْي ُت َها ِم ْن ُه
اشترى بلال خمسة آلاف كتاب َوأَنَا ا ْشتَ َر ْي ُت َها ِم ْن ُه
اشترى بلال خمسة آلاف كتاب وَأَنَا ا ْشتَرَيْتُهَا مِنْهُ
اشترىبلالخمسةآلافكتاب َوأََنَاا ْشتَرَيْتُ َها ِمنْ ُه
اشتَرىبلالخمسة􏰀أَلافكتاب َوأََنَااْشََتَرْيُتُهَاَ ِمْنُه 􏰃􏰁 خ􏰁􏰆􏰀􏰁 َ􏰄􏰀َْ􏰃َ􏰁َُْ􏰁َ􏰀ُْ
شت شت ي ا رى􏰅ياللمسهاالفكنا􏰅بواياا ر􏰂تهاِمنه
اشترى بلال خمسة آلاف كتاب َوأَنَا ا ْش َت َريْ ُت َها ِم ْن ُه
اشترىبلالخمسةآلافكتاب َوَأََنااْشَتَرْيُتَها ِمْنُه
اشترىبلالخمسةآلافكتاب َوَأَنااْشَتََرْيُتَها ِمْنُه َََََُُْْْ
اشتَرى بلال خمسة آلاف كتاب وأَنا اشتَريتها ِمنه
اشترى بلال خمسة آلاف كتاب َو َأَ َنا ا ْش َت َر ْي ُت َها ِم ْن ُه
اشترى بلال خمسة آلاف كتاب َوَأََنا ا ْشَتَرْيُت َها ِمْن ُه
اشترى بلال خمسة آلاف كتاب وَأََنَا ا ْشتَ َريْتُهَا ِمنْ ُه
اشترى بلال خمسة آلاف كتاب َوَأَنَا اشْتَرَيْتُهَا مِنْهُ
اشتَرى يلال خمسه أَلا ف كنا ب و َأَيا ا ْشتَرينها ِمن ُه  ََ َََُْ ْ

Firefox PDF viewer:

َيْتُهَا مِنْهُ َ تَر ْ نَا اشَ أَ َ ى بلال خمسة آلاف كتاب وتَر اش
ُ
نَ ا اشْ تَ رَ يْ تُ هَ ا مِ نْ ه َأَ َ اشترى بلال خمسة آلاف كتاب و
ُ
نَا اشْ تَرَيْتُهَ ا مِنْهَأَ َ اشترى بلال خمسة آلاف كتاب و
ُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْ تَرَيْتُهَا مِنْه
ُ
نَا اشْ تَرَيْتُهَا مِنْه  أَ َ اشترى بلال خمسة آلاف كتاب و
ُ
نَا اشْ تَرَيْتُهَ ا مِنْهَ أَ َ اشترى بلال خمسة آلاف كتاب و
ُ
َا مِنْه ُ تُه ْ َي َتَر ْ اش َ نَا  أَ َ لاف كتاب و أَ ى بلال خمسةتَر اش
ُ
ْ ه ن ُِ هَ ا مت ْ ي ََ رت ْ شَ ا ا  ي
َ ا َو  بانك  فال ا همس خ الل  ي رىت شا
ُ
اشترى بلال خمسة آلاف كتاب وَأَنَا اشْ تَرَيْتُهَا مِنْه
ُ
نَا اشْ تَرَيْتُهَا مِ نْهَ أَ َ اشترى بلال خمسة آلاف كتاب و
ُ
اشترى بلال خمسة آلاف كتاب وَ أَنَ ا اشْ تَرَ يْتُهَ ا مِنْه
ُ
َ يْتُهَ ا مِ نْه َ تَر ْ نَا اشَ أَ َ ى بلال خمسة آلاف كتاب وتَر اش
ُ
نَا اشْ تَرَيْتُهَا مِ نْهَ أَ َ اشترى بلال خمسة آلاف كتاب و
ُ
نَ ا اشْ تَ رَ يْ تُ هَ ا مِ نْ ه َ أَ َ اشترى بلال خمسة آلاف كتاب و
ُ
نَا اشْ تَرَيْتُهَا مِنْهَأَ َ اشترى بلال خمسة آلاف كتاب و
ُ
اشْتَرَيْتُهَا مِنْه َان َ أَ َ و كتاب آلاف خمسة بلال اشترى
ُ
ْهن ِ ُهَ ا منْ يَ َتَر ْ  شَا ايَ  أَ َ و  بانك  فلاأَ همس  خ لالي ىتَر  شا

Chrome PDF viewer:
ْيُت َه ِ ا منْ ُه
نَ ْ ا اشرَتَ
اشرت َ ى بالل خمسة آالف كتاب وَأ
َه ِ ا مْنُه
ُْت
ي
َ
َن ْ ا اشَتر
َ اشترى بالل خمسة آالف كتاب وَأ
ْ ُتَه ْ ا مِنُه
ي
َ
َ اشترى بالل خمسة آالف كتاب وَأَن ْ ا اشَتر
ه
تَه ْ ا مِنُ
ُْ
ي
َ
ْ نَا اشتَر
َ
َ اشترى بالل خمسة آالف كتاب وأ
اشترى بلال خمسة آلاف كتاب وََأنَا اشْت َرَيْتُه َا مِنْهُ
اشترى بالل خمسة آالف كتاب وََأنَا اشْ تَرَيْتُهَ ا مِ نْهُ
نُْه
ْهُتَا مِ
َأاَن ْ اشرَتَي
َ
اشرتى بالل مخسة الف كتاب و
ُ
ه
ْ
ن
َ ا ِم
ه
ُ
ْت
ي
َ
ر
َ
ت
ْ
ش
َ ا ا
ن
َأ
َ
اب و
كت
ف
لا

مس
خ
شت رى بلال
ا
ُْت َه ْ ا مِنُه
ي
َ
نَ ْ ا اش َتر
َ
أ
َ
اشترى بالل خمسة آالف كتاب و
ْ ُتَه ْ ا مِنُه
ي
َ
َن ْ ا اشَتر
َأ
َ اشترى بالل خمسة آالف كتاب و
َن ْ ا اشَتَرْيُت َه ْ ا مِ نُه
َ اشترى بالل خمسة آالف كتاب وأَ
ُ
ه
ْ
ن
ِ
َا م
ه
ُ
ت
ْ
ي
َ
ْ رَت
َا اش
ن
َأ
َ
اشرتى بالل خمسة آالف كتاب و
ُْت َه ْ ا مِ نُه
َري
َ اشترى بالل خمسة آالف كتاب وَأَن ْ ا اشَت
َ اشترى بالل خمسة آالف كتاب وَأ ْ نَ ا اش َتَ رْي َتُ ه ُ ا مِ نْ ه
ه
ُْ
هَ ِ ا من
ُ
ت
ْ
َري
نَ ْ ا اشتَ
َأ
اشترى بالل خمسة آالف كتاب وَ
ُ
ه
ْ
هَا مِن
ُ
ْت
تَرَي
َأنَا اشْ
َ اشترى بالل خمسة آالف كتاب و
ه
ُ
نْن
نَها مِ
ُ ْ
ي
نَنا اش ْ رَتَ
اشرَتى بالل خمسه الف كناب َ وَأ
Comment 10 ⁨خالد حسني⁩ 2022-09-15 06:32:15 UTC
As you can see in comment 9, results vary widely across fonts and PDF viewers, and Adobe’s viewer give the best result, but still some fonts gives broken results.