Bug 151788 - Imported Arabic/Persian text from PDF should go through unshaping Algorithm
Summary: Imported Arabic/Persian text from PDF should go through unshaping Algorithm
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
7.5.0.0 alpha0+
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: RTL-CTL PDF-Import-Draw PDF-Import-Writer
  Show dependency treegraph
 
Reported: 2022-10-27 14:00 UTC by Hossein
Modified: 2024-01-05 13:19 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hossein 2022-10-27 14:00:57 UTC
Description:
Recently, an old regression that prevented opening Arabic/Persian/Hebrew PDF files correctly in Draw is fixed by patches from Kevin Suo (thanks!).
But, there are still problems that prevent one from being able to edit the PDF file.
The main problem is that characters are shaped in Arabic script according to the position of the character. Thus, for example if you convert the text ضمیمه into PDF, you will get ﺿﻤﯿﻤﻪ which is ﺿ ﻤ ﯿ ﻤ ﻪ instead of ض م ی م ه.

Steps to Reproduce:
1. Open attachment 129523 [details]
2. Try to edit the text from first line

Actual Results:
Characters are shaped ones, instead of normal Arabic/Persian characters

Expected Results:
Characters should be unshaped, so that the user can edit the text


Reproducible: Always


User Profile Reset: No

Additional Info:
Happens in the LO 7.5 dev master after the fix
Comment 1 V Stuart Foote 2022-10-27 19:07:11 UTC
Would probably need to happen before the sd Textbox objects are instantiated--so would be a final step in the PDF import filter.  Once the draw textboxes are built it too late to rework the text stream.
Comment 2 ⁨خالد حسني⁩ 2022-12-25 00:42:43 UTC
For the purpose of this, I think applying NFKC normalization to the text should be enough:

>>> unicodedata.normalize("NFKC", "ﺿ ﻤ ﯿ ﻤ ﻪ")
'ض م ی م ه'

But NFKC normalization can change text meaning for other Unicode code points, so it should be applied to Arabic Presentation Forms characters exclusively.
Comment 3 ⁨خالد حسني⁩ 2023-06-06 13:50:41 UTC
I can’t reproduce this on master or 7.5 builds, the text is imported as regular Arabic characters not presentation forms. The file line in the PDF is copied as “ضمیمه شماره ” after import.

Tested with:

Version: 7.5.3.2 (X86_64) / LibreOffice Community
Build ID: 9f56dff12ba03b9acd7730a5a481eea045e468f3
CPU threads: 6; OS: Mac OS X 13.4; UI render: default; VCL: osx
Locale: en-EG (en_EG.UTF-8); UI: en-US
Calc: threaded

and:

Version: 7.6.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: 244f9cf66bc36f229ccb5712bc8d80166b92266d
CPU threads: 6; OS: Mac OS X 13.4; UI render: Skia/Metal; VCL: osx
Locale: en-EG (en_EG.UTF-8); UI: en-US
Calc: threaded
Comment 4 QA Administrators 2023-12-04 03:18:01 UTC Comment hidden (obsolete)
Comment 5 QA Administrators 2024-01-04 03:13:11 UTC Comment hidden (obsolete)
Comment 6 Eyal Rozenberg 2024-01-05 12:58:47 UTC
(In reply to ⁨خالد حسني⁩ from comment #3)

I also can't reproduce with:

Version: 24.2.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: 516f800f84b533db0082b1f39c19d1af40ab29c8
CPU threads: 4; OS: Linux 6.5; UI render: default; VCL: gtk3
Locale: he-IL (en_IL); UI: en-US

when opening in Draw.

So, closing this as WORKSFORME; but - Khaled, you are more than welcome to reopen if this shows up again or if I've misunderstood something.
Comment 7 Hossein 2024-01-05 13:19:00 UTC
I still see multiple issues with the imported file in the latest LO 24.2 dev master:

The smiley is reversed, (: becomes ):

الله (ligature) becomes هللا

In the first line one brackets is reversed, but not the other one.

Parenthesizes in the text are reversed.

Parenthesizes in the links are reversed.

Version: 24.8.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 5056da285da2f130d741add1f8432cd590116a96
CPU threads: 12; OS: Linux 6.2; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: CL threaded

I think this issue should remain open, until a fix is provided.