Bug 104597 - Text runs of RTL scripts (e.g. Arabic, Hebrew, Persian) from imported PDF are reversed, PDFIProcessor::mirrorString not behaving
Summary: Text runs of RTL scripts (e.g. Arabic, Hebrew, Persian) from imported PDF are...
Status: ASSIGNED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
4.1 all versions
Hardware: All All
: highest major
Assignee: Thorsten Behrens (CIB)
URL:
Whiteboard:
Keywords: bibisected, bisected, filter:pdf, regression
: 84797 89471 97131 115802 116318 125951 (view as bug list)
Depends on:
Blocks: RTL-CTL PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2016-12-12 08:24 UTC by Pourrabi
Modified: 2019-12-05 15:47 UTC (History)
15 users (show)

See Also:
Crash report or crash signature:


Attachments
Screen copy showing the problem (208.30 KB, image/png)
2016-12-12 08:25 UTC, Pourrabi
Details
PDF File (184.13 KB, application/pdf)
2016-12-12 11:49 UTC, Pourrabi
Details
arabic RTL3.pdf: shows when mirroring changed from paragraphs to individual characters (11.60 KB, application/pdf)
2019-07-29 13:19 UTC, Justin L
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pourrabi 2016-12-12 08:24:14 UTC
Hi;
When open a Persian PDF document, words document are reverse
Comment 1 Pourrabi 2016-12-12 08:25:43 UTC
Created attachment 129516 [details]
Screen copy showing the problem
Comment 2 Xisco Faulí 2016-12-12 10:59:39 UTC
Hello alireza,

Thank you for reporting the bug. Please attach a sample document, as this makes it easier for us to verify the bug. 
I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' once the requested document is provided.
(Please note that the attachment will be public, remove any sensitive information before attaching it. 
See https://wiki.documentfoundation.org/QA/FAQ#How_can_I_eliminate_confidential_data_from_a_sample_document.3F for help on how to do so.)
Comment 3 Pourrabi 2016-12-12 11:49:32 UTC
Created attachment 129523 [details]
PDF File
Comment 4 Buovjaga 2016-12-14 13:55:04 UTC
Already broken in at least 4.4.5.
Works mostly correctly in 3.5. There are some issues here and there.

Win 7 Pro 64-bit Version: 5.4.0.0.alpha0+
Build ID: ba6b35fc68a01aff72b39eb7809bacb326068668
CPU Threads: 4; OS Version: Windows 6.1; UI Render: default; 
TinderBox: Win-x86@39, Branch:master, Time: 2016-12-13_06:07:39
Locale: fi-FI (fi_FI); Calc: group

Version: 4.4.5.2
Build ID: a22f674fd25a3b6f45bdebf25400ed2adff0ff99
Locale: fi_FI

LibreOffice 3.5.0rc3 
Build ID: 7e68ba2-a744ebf-1f241b7-c506db1-7d53735
Comment 5 Xisco Faulí 2016-12-14 14:14:41 UTC
Reproduced in

Version: 4.1.0.0.alpha1+
Build ID: a2c9d4f8bbde97f175bae4df771273a61251f40

but not in

Version 4.1.0.0.alpha0+ (Build ID: efca6f15609322f62a35619619a6d5fe5c9bd5a)
Comment 6 Xisco Faulí 2016-12-14 14:52:34 UTC
Regression introduced by:

author	Thorsten Behrens <tbehrens@suse.com>	2013-03-20 23:37:28 (GMT)
committer	Thorsten Behrens <tbehrens@suse.com>	2013-03-20 23:54:16 (GMT)
commit	ff140bb6b8b109f14c270ff059f0b8d71dab5d6c (patch)
tree	4fee2384e2937f167943c37233e6098b9d064a11
parent	dd0db92a174f6a4da1ada3de17cb869264be9342 (diff)

Remove StringMirror UNO service.
This was a kludge from back in the day when pdfimport was an
extension and could not link against office libs.

While at it, fix mirror method to handle unicode surrogates
correctly.

Adding Cc: to Thorsten Behrens
Comment 7 QA Administrators 2017-12-15 07:45:52 UTC Comment hidden (obsolete)
Comment 8 Xisco Faulí 2018-02-28 09:15:07 UTC
*** Bug 115802 has been marked as a duplicate of this bug. ***
Comment 9 QA Administrators 2019-03-01 03:49:58 UTC Comment hidden (obsolete, spam)
Comment 10 Justin L 2019-06-21 19:18:20 UTC
Probably a duplicate of bug 89471, but this one at least has a bisect commit to look at, so I don't want to lose it in the messy comments of another bug.
Comment 11 Xisco Faulí 2019-06-25 09:30:50 UTC
@Mark Hung, I thought you might be interested in this issue...
Comment 12 Xisco Faulí 2019-06-25 09:36:56 UTC
*** Bug 89471 has been marked as a duplicate of this bug. ***
Comment 13 V Stuart Foote 2019-06-25 12:37:55 UTC
Not clear if the (isRTL) is applied logically to do the mirror::String() [1]? I get confused walking through the code. Where we check for RTL [2] checking character by character with GetCharacterClassification () [3] using the ICU lib  getUnicodeDirection() by Unicode block [4] to classify the text run.

So the string should be mirrored/reversed when written out to document canvas. But something is misapplied.

=-refs-=

[1] https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/pdfiprocessor.cxx?r=58e31745#699

[2] https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?r=8b663bc8&h=91#108

[3] https://opengrok.libreoffice.org/xref/core/i18nutil/source/utility/scripttypedetector.cxx?r=97844a11&h=42#50

[4] https://opengrok.libreoffice.org/xref/core/i18nutil/source/utility/unicode.cxx?r=c3c620c8&h=81#81
Comment 14 Justin L 2019-06-29 17:00:25 UTC
(In reply to V Stuart Foote from comment #13)
> Not clear if the (isRTL) is applied logically to do the mirror::String()?
> I get confused walking through the code. 

I agree that it doesn't seem to mirror the string.  Originally it walked backwards and now it walks forwards, but still uses appendUtf32 instead of insertUtf32.
-    for(int i = nLen - 1; i >= 0; i--)
+    sal_Int32 i = 0;
+    while(i < nLen)

However, making that change doesn't fix it, since almost nothing is even passed to the mirrorString function.
Comment 15 Justin L 2019-07-29 13:19:59 UTC
Created attachment 153035 [details]
arabic RTL3.pdf: shows when mirroring changed from paragraphs to individual characters

(In reply to Justin L from comment #14)
> However, making that change doesn't fix it, since almost nothing is even
> passed to the mirrorString function.
Back then, the entire textbox was passed to the mirrorString function. Using insertUtf does fix it when compiling the 4.1 codebase. (make dev-install with gcc 4.6 on Ubuntu 12.04, and running install/program/soffice.)

The paragraphElement switched to being composed of mostly individual character elements (which don't get mirrored) instead of large chunks of text in LO 5.0 commit ee21771db0292315ff3e1b87ff58294335106bd3
Author: Vort <vvort@yandex.ru>
Date:   Fri Jan 16 18:29:40 2015 +0200
    fdo#88465 PDF Import: fix scale and rotate image transformations
Comment 16 vvort 2019-07-29 13:48:02 UTC
Are you sure that image transformations are related to text processing?
Comment 17 Justin L 2019-07-29 13:56:39 UTC
(In reply to vvort from comment #16)
> Are you sure that image transformations are related to text processing?
You can bibisect with comment 15's document to verify.

I don't know the hows or the whys, but that commit changed the text processing call to mirrorString from a single call mirroring all 17 characters

warn:JCL:17625:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:773: ::mirrorString[\u0642\u0641\u0632}}}\u0644\u0627\u062b\u0639\u0644\u0628\u0644\u0627\u0628\u0646\u064a][17]
warn:JCL:17625:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:782: ---returning mirrored string[\u0642\u0641\u0632{{{\u0644\u0627\u062b\u0639\u0644\u0628\u0644\u0627\u0628\u0646\u064a]

into two calls mirroring just a couple of characters

warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:731: ::mirrorString[\u0644\u0627][2]
warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:740: ---returning mirrored string[\u0644\u0627]
warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:731: ::mirrorString[\u0644\u0627][2]
warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:740: ---returning mirrored string[\u0644\u0627]
Comment 18 vvort 2019-07-30 11:52:38 UTC
Characters are processed one by one because DrawXmlOptimizer::optimizeTextElements() fails to concatenate TextElements.
Inside of that function check for transformation equality exists: rCurGC.Transformation == rNextGC.Transformation.
If it is commented out, brackets inside test file becomes equal.
But result is still differs from PDF.
Comment 19 Justin L 2019-08-01 08:22:37 UTC
(In reply to vvort from comment #18)
> DrawXmlOptimizer::optimizeTextElements() fails to concatenate TextElements
because before notTransformed() previously returned true, and now it returns false. (same line in the code, but just a clarification since rCurGC.Transformation was never equal to rNextGC.Transformation in this context.)
Comment 20 vvort 2019-08-01 11:17:40 UTC
I was fixing scale in my commit.
So, maybe, scaling for text elements should be skipped.
While staying intact for images.
More investigation is needed.
Comment 21 vvort 2019-08-01 11:50:40 UTC
Another possibility is that transformation check may be not needed at all now since all characters inside paragraph have the same transformation (except for translation part).
Comment 22 vvort 2019-08-02 14:08:29 UTC
TextElements are created just in one place: inside processGlyphLine function. Also there they are placed inside new FrameElement.
Characters, which are processed inside processGlyphLine call are having the same transformation because of checks inside drawGlyphs call.
So I think that transformation check inside optimizeTextElements call should be removed. Alongside with notTransformed function.
But, sadly, this can still produce side effects, because lot of poorly tested semi-dead code of pdfimport may come to life.
Comment 23 Fahad Al-Saidi 2019-11-27 04:41:19 UTC
This bug affects my workflow a lot. I hope someone can fix it. Thanks in advance.
Comment 24 Xisco Faulí 2019-11-29 12:50:48 UTC
Changing priority to 'high' since the number of duplicates is 5 or higher
Comment 25 Xisco Faulí 2019-11-29 12:51:31 UTC
*** Bug 84797 has been marked as a duplicate of this bug. ***
Comment 26 Xisco Faulí 2019-11-29 12:51:37 UTC
*** Bug 97131 has been marked as a duplicate of this bug. ***
Comment 27 Xisco Faulí 2019-11-29 12:51:41 UTC
*** Bug 116318 has been marked as a duplicate of this bug. ***
Comment 28 Xisco Faulí 2019-11-29 12:51:46 UTC
*** Bug 125951 has been marked as a duplicate of this bug. ***
Comment 29 Eyal Rozenberg 2019-11-29 18:10:32 UTC
> Changing priority to 'high' since the number of duplicates is 5 or higher

Thank you; this is very significant to us.

Quoting myself from one of the dupes:
https://bugs.documentfoundation.org/show_bug.cgi?id=89471#c12

As a member of the LibreOffice RTL QA "team", I would like to point out this is currently one of the most significant RTL bugs in LibreOffice, as we see it (there was an actual discussion rating prominence of RTL bugs on our Telegram group). This bug essentially means LibreOffice cannot be used to work with PDFs with RTL content.
Comment 30 Jack 2019-11-30 10:16:48 UTC
Five years, seven bug reports, three major version number increments, and many replies later, finally this is being taken seriously.
Comment 31 Xisco Faulí 2019-12-02 13:40:57 UTC
Changing priority to 'highest' since this a regression and the number of duplicates is higher than 5 or the number of people in CC higher than 20