Download it now!
Bug 104597 - Text runs of RTL scripts (e.g. Arabic, Hebrew, Persian) from imported PDF are reversed, PDFIProcessor::mirrorString not behaving
Summary: Text runs of RTL scripts (e.g. Arabic, Hebrew, Persian) from imported PDF are...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
4.1 all versions
Hardware: All All
: highest major
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: bibisected, bisected, filter:pdf, regression
: 84797 89471 97131 115802 116318 125951 131222 134180 (view as bug list)
Depends on:
Blocks: RTL-CTL PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2016-12-12 08:24 UTC by Pourrabi
Modified: 2021-07-18 17:25 UTC (History)
25 users (show)

See Also:
Crash report or crash signature:


Attachments
Screen copy showing the problem (208.30 KB, image/png)
2016-12-12 08:25 UTC, Pourrabi
Details
PDF File (184.13 KB, application/pdf)
2016-12-12 11:49 UTC, Pourrabi
Details
arabic RTL3.pdf: shows when mirroring changed from paragraphs to individual characters (11.60 KB, application/pdf)
2019-07-29 13:19 UTC, Justin L
Details
tdf104597-not-perfect-patch.diff (1.76 KB, patch)
2021-07-16 04:59 UTC, Kevin Suo
Details
screenshot showing the PDF open in draw, with the "not perfect" patch applied (279.54 KB, image/png)
2021-07-16 05:00 UTC, Kevin Suo
Details
rtl.pdf (a simplified RTL file which can be used as a unittest case) (6.18 KB, application/pdf)
2021-07-16 05:01 UTC, Kevin Suo
Details
Suggested solution (4.38 KB, text/plain)
2021-07-17 13:19 UTC, Armin Le Grand
Details
revvised solution with adapting width during concatenation (6.14 KB, text/plain)
2021-07-18 11:59 UTC, Armin Le Grand
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pourrabi 2016-12-12 08:24:14 UTC
Hi;
When open a Persian PDF document, words document are reverse
Comment 1 Pourrabi 2016-12-12 08:25:43 UTC
Created attachment 129516 [details]
Screen copy showing the problem
Comment 2 Xisco Faulí 2016-12-12 10:59:39 UTC
Hello alireza,

Thank you for reporting the bug. Please attach a sample document, as this makes it easier for us to verify the bug. 
I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' once the requested document is provided.
(Please note that the attachment will be public, remove any sensitive information before attaching it. 
See https://wiki.documentfoundation.org/QA/FAQ#How_can_I_eliminate_confidential_data_from_a_sample_document.3F for help on how to do so.)
Comment 3 Pourrabi 2016-12-12 11:49:32 UTC
Created attachment 129523 [details]
PDF File
Comment 4 Buovjaga 2016-12-14 13:55:04 UTC
Already broken in at least 4.4.5.
Works mostly correctly in 3.5. There are some issues here and there.

Win 7 Pro 64-bit Version: 5.4.0.0.alpha0+
Build ID: ba6b35fc68a01aff72b39eb7809bacb326068668
CPU Threads: 4; OS Version: Windows 6.1; UI Render: default; 
TinderBox: Win-x86@39, Branch:master, Time: 2016-12-13_06:07:39
Locale: fi-FI (fi_FI); Calc: group

Version: 4.4.5.2
Build ID: a22f674fd25a3b6f45bdebf25400ed2adff0ff99
Locale: fi_FI

LibreOffice 3.5.0rc3 
Build ID: 7e68ba2-a744ebf-1f241b7-c506db1-7d53735
Comment 5 Xisco Faulí 2016-12-14 14:14:41 UTC
Reproduced in

Version: 4.1.0.0.alpha1+
Build ID: a2c9d4f8bbde97f175bae4df771273a61251f40

but not in

Version 4.1.0.0.alpha0+ (Build ID: efca6f15609322f62a35619619a6d5fe5c9bd5a)
Comment 6 Xisco Faulí 2016-12-14 14:52:34 UTC
Regression introduced by:

author	Thorsten Behrens <tbehrens@suse.com>	2013-03-20 23:37:28 (GMT)
committer	Thorsten Behrens <tbehrens@suse.com>	2013-03-20 23:54:16 (GMT)
commit	ff140bb6b8b109f14c270ff059f0b8d71dab5d6c (patch)
tree	4fee2384e2937f167943c37233e6098b9d064a11
parent	dd0db92a174f6a4da1ada3de17cb869264be9342 (diff)

Remove StringMirror UNO service.
This was a kludge from back in the day when pdfimport was an
extension and could not link against office libs.

While at it, fix mirror method to handle unicode surrogates
correctly.

Adding Cc: to Thorsten Behrens
Comment 7 QA Administrators 2017-12-15 07:45:52 UTC Comment hidden (obsolete)
Comment 8 Xisco Faulí 2018-02-28 09:15:07 UTC
*** Bug 115802 has been marked as a duplicate of this bug. ***
Comment 9 QA Administrators 2019-03-01 03:49:58 UTC Comment hidden (obsolete, spam)
Comment 10 Justin L 2019-06-21 19:18:20 UTC
Probably a duplicate of bug 89471, but this one at least has a bisect commit to look at, so I don't want to lose it in the messy comments of another bug.
Comment 11 Xisco Faulí 2019-06-25 09:30:50 UTC
@Mark Hung, I thought you might be interested in this issue...
Comment 12 Xisco Faulí 2019-06-25 09:36:56 UTC
*** Bug 89471 has been marked as a duplicate of this bug. ***
Comment 13 V Stuart Foote 2019-06-25 12:37:55 UTC
Not clear if the (isRTL) is applied logically to do the mirror::String() [1]? I get confused walking through the code. Where we check for RTL [2] checking character by character with GetCharacterClassification () [3] using the ICU lib  getUnicodeDirection() by Unicode block [4] to classify the text run.

So the string should be mirrored/reversed when written out to document canvas. But something is misapplied.

=-refs-=

[1] https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/pdfiprocessor.cxx?r=58e31745#699

[2] https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?r=8b663bc8&h=91#108

[3] https://opengrok.libreoffice.org/xref/core/i18nutil/source/utility/scripttypedetector.cxx?r=97844a11&h=42#50

[4] https://opengrok.libreoffice.org/xref/core/i18nutil/source/utility/unicode.cxx?r=c3c620c8&h=81#81
Comment 14 Justin L 2019-06-29 17:00:25 UTC
(In reply to V Stuart Foote from comment #13)
> Not clear if the (isRTL) is applied logically to do the mirror::String()?
> I get confused walking through the code. 

I agree that it doesn't seem to mirror the string.  Originally it walked backwards and now it walks forwards, but still uses appendUtf32 instead of insertUtf32.
-    for(int i = nLen - 1; i >= 0; i--)
+    sal_Int32 i = 0;
+    while(i < nLen)

However, making that change doesn't fix it, since almost nothing is even passed to the mirrorString function.
Comment 15 Justin L 2019-07-29 13:19:59 UTC
Created attachment 153035 [details]
arabic RTL3.pdf: shows when mirroring changed from paragraphs to individual characters

(In reply to Justin L from comment #14)
> However, making that change doesn't fix it, since almost nothing is even
> passed to the mirrorString function.
Back then, the entire textbox was passed to the mirrorString function. Using insertUtf does fix it when compiling the 4.1 codebase. (make dev-install with gcc 4.6 on Ubuntu 12.04, and running install/program/soffice.)

The paragraphElement switched to being composed of mostly individual character elements (which don't get mirrored) instead of large chunks of text in LO 5.0 commit ee21771db0292315ff3e1b87ff58294335106bd3
Author: Vort <vvort@yandex.ru>
Date:   Fri Jan 16 18:29:40 2015 +0200
    fdo#88465 PDF Import: fix scale and rotate image transformations
Comment 16 vvort 2019-07-29 13:48:02 UTC
Are you sure that image transformations are related to text processing?
Comment 17 Justin L 2019-07-29 13:56:39 UTC
(In reply to vvort from comment #16)
> Are you sure that image transformations are related to text processing?
You can bibisect with comment 15's document to verify.

I don't know the hows or the whys, but that commit changed the text processing call to mirrorString from a single call mirroring all 17 characters

warn:JCL:17625:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:773: ::mirrorString[\u0642\u0641\u0632}}}\u0644\u0627\u062b\u0639\u0644\u0628\u0644\u0627\u0628\u0646\u064a][17]
warn:JCL:17625:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:782: ---returning mirrored string[\u0642\u0641\u0632{{{\u0644\u0627\u062b\u0639\u0644\u0628\u0644\u0627\u0628\u0646\u064a]

into two calls mirroring just a couple of characters

warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:731: ::mirrorString[\u0644\u0627][2]
warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:740: ---returning mirrored string[\u0644\u0627]
warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:731: ::mirrorString[\u0644\u0627][2]
warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:740: ---returning mirrored string[\u0644\u0627]
Comment 18 vvort 2019-07-30 11:52:38 UTC
Characters are processed one by one because DrawXmlOptimizer::optimizeTextElements() fails to concatenate TextElements.
Inside of that function check for transformation equality exists: rCurGC.Transformation == rNextGC.Transformation.
If it is commented out, brackets inside test file becomes equal.
But result is still differs from PDF.
Comment 19 Justin L 2019-08-01 08:22:37 UTC
(In reply to vvort from comment #18)
> DrawXmlOptimizer::optimizeTextElements() fails to concatenate TextElements
because before notTransformed() previously returned true, and now it returns false. (same line in the code, but just a clarification since rCurGC.Transformation was never equal to rNextGC.Transformation in this context.)
Comment 20 vvort 2019-08-01 11:17:40 UTC
I was fixing scale in my commit.
So, maybe, scaling for text elements should be skipped.
While staying intact for images.
More investigation is needed.
Comment 21 vvort 2019-08-01 11:50:40 UTC
Another possibility is that transformation check may be not needed at all now since all characters inside paragraph have the same transformation (except for translation part).
Comment 22 vvort 2019-08-02 14:08:29 UTC
TextElements are created just in one place: inside processGlyphLine function. Also there they are placed inside new FrameElement.
Characters, which are processed inside processGlyphLine call are having the same transformation because of checks inside drawGlyphs call.
So I think that transformation check inside optimizeTextElements call should be removed. Alongside with notTransformed function.
But, sadly, this can still produce side effects, because lot of poorly tested semi-dead code of pdfimport may come to life.
Comment 23 Fahad Al-Saidi 2019-11-27 04:41:19 UTC
This bug affects my workflow a lot. I hope someone can fix it. Thanks in advance.
Comment 24 Xisco Faulí 2019-11-29 12:50:48 UTC
Changing priority to 'high' since the number of duplicates is 5 or higher
Comment 25 Xisco Faulí 2019-11-29 12:51:31 UTC
*** Bug 84797 has been marked as a duplicate of this bug. ***
Comment 26 Xisco Faulí 2019-11-29 12:51:37 UTC
*** Bug 97131 has been marked as a duplicate of this bug. ***
Comment 27 Xisco Faulí 2019-11-29 12:51:41 UTC
*** Bug 116318 has been marked as a duplicate of this bug. ***
Comment 28 Xisco Faulí 2019-11-29 12:51:46 UTC
*** Bug 125951 has been marked as a duplicate of this bug. ***
Comment 29 Eyal Rozenberg 2019-11-29 18:10:32 UTC
> Changing priority to 'high' since the number of duplicates is 5 or higher

Thank you; this is very significant to us.

Quoting myself from one of the dupes:
https://bugs.documentfoundation.org/show_bug.cgi?id=89471#c12

As a member of the LibreOffice RTL QA "team", I would like to point out this is currently one of the most significant RTL bugs in LibreOffice, as we see it (there was an actual discussion rating prominence of RTL bugs on our Telegram group). This bug essentially means LibreOffice cannot be used to work with PDFs with RTL content.
Comment 30 Jack 2019-11-30 10:16:48 UTC
Five years, seven bug reports, three major version number increments, and many replies later, finally this is being taken seriously.
Comment 31 Xisco Faulí 2019-12-02 13:40:57 UTC
Changing priority to 'highest' since this a regression and the number of duplicates is higher than 5 or the number of people in CC higher than 20
Comment 32 V Stuart Foote 2020-03-08 15:19:23 UTC
*** Bug 131222 has been marked as a duplicate of this bug. ***
Comment 33 Alex Cohn 2020-03-18 14:53:57 UTC Comment hidden (obsolete)
Comment 34 Alex Cohn 2020-03-18 14:54:10 UTC Comment hidden (obsolete)
Comment 35 V Stuart Foote 2020-06-24 12:28:18 UTC
*** Bug 134180 has been marked as a duplicate of this bug. ***
Comment 36 V Stuart Foote 2020-06-24 14:05:04 UTC
*** Bug 131222 has been marked as a duplicate of this bug. ***
Comment 37 V Stuart Foote 2020-10-04 15:11:54 UTC
@Thorsten, *

Artem & Justin had zeroed in on the mishandling--testing single chars rather than text run strings--and now rarely triggering the mirrorstring helper.

As an alternative method for establishing RTL logical order from the visual order of the processed PDF stream, Khaled had suggested in bug 89471 the ICU BiDi libs (ubidi)provides an 'inverse' option that could convert the RTL text runs back to correct logical order [1][2] for better fidelity to the PDF streams.

=-ref-=
described...
[1] https://unicode-org.github.io/icu/userguide/transforms/bidi.html

API...
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html
Comment 38 Sergey 2020-11-26 07:06:31 UTC
Dear developers team,
Hope this is the right place for my massage.
I have a problem opening pdf file in Hebrew, the text is flipped. 
LibreOffice really good tool, I like it very much !!!
Is it possible to fix the issue with opening Hebrew text? It will be very helpful!!! Pleeeeeeease ...... :-)
Comment 39 Buovjaga 2020-11-26 08:18:25 UTC
(In reply to Sergey from comment #38)
> Dear developers team,
> Hope this is the right place for my massage.
> I have a problem opening pdf file in Hebrew, the text is flipped. 
> LibreOffice really good tool, I like it very much !!!
> Is it possible to fix the issue with opening Hebrew text? It will be very
> helpful!!! Pleeeeeeease ...... :-)

No, not the right place. Please open a new report for your issue.
Comment 40 Justin L 2020-11-26 08:32:28 UTC
(In reply to Buovjaga from comment #39)
> No, not the right place. Please open a new report for your issue.

Actually, it sounds like the identical issue to me. So a new report would just be flagged as a duplicate.

This bug report has been highlighted for over a year in the ESC minutes as one of the most important bugs. Removing Assigned status since there appears to have been no activity. Perhaps Thorsten is the "right person" to fix it, but obviously he doesn't have enough personal time to allocate to this.

This will either take a motivated RTL community developer, or more likely a corporate sponsor to get fixed.
Comment 41 Alex Cohn 2020-11-26 12:26:40 UTC
@Justin is there an established way to look for such sponsor?
Comment 42 Buovjaga 2020-11-26 13:32:54 UTC
(In reply to Alex Cohn from comment #41)
> @Justin is there an established way to look for such sponsor?

There is no established way. Ideally the sponsors would come looking instead of the other way around.
Comment 43 shmuel.globus@gmail.com 2021-02-21 19:46:19 UTC
I am just waiting for this issue to be resolved. It's top priority for many users. It really handicaps the use of LibreOffice in Israel and the whole Middle East.
Comment 44 V Stuart Foote 2021-07-15 02:22:56 UTC
@Kevin, given your interest and recent work on the pdfio filter the munged RTL text runs are a real annoyance when opening a PDF into Draw. Seems like an ICU libs bidirectional sense for RTL scripts would be a viable approach. Perhaps have a look... Stuart
Comment 45 Kevin Suo 2021-07-15 04:58:23 UTC
(In reply to V Stuart Foote from comment #44)

Well, below is my observation these days. I may be wrong, but I think these are helpful to those who are interested:

1. The "mirrorString" function is never hit because isRTL is always false.

2. The "isRTL" is always false because the following code never returned true:
'''
    if( xCC.is() )
    {
        for(int i=1; i< elem.Text.getLength(); i++)
        {
            css::i18n::DirectionProperty nType = static_cast<css::i18n::DirectionProperty>(xCC->getCharacterDirection( str, i ));
            if ( nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT           ||
                 nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT_ARABIC    ||
                 nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT_EMBEDDING ||
                 nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT_OVERRIDE
                )
                isRTL = true;
        }
    }
'''
https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?#110

3. This "if" block never returned true because "getCharacterDirection" in XCharacterClassification requires an OUString with a length > 1, whereas only one Arabic character is passed to it. If the length is 1 which is the same as the nPos, then this function returns 0 directly without doing any detection of the direction (it is impossible to detect the direction if only one Arabic character is provided).
See:
https://opengrok.libreoffice.org/xref/core/i18npool/source/characterclassification/cclass_unicode.cxx?#128

4. The reason why only one Arabic character is passed to getCharacterDirection may be due to the sdext.pdfimport code failed to combine those single characters (as produced by xpdfimport process) into a string.

Below is the sample ODF XML stream generated by the sdext pdfimport code (using a pdf file with only one line content "لوحة المفاتيح العربية"):

'''
<draw:text-box>
    <text:p text:style-name="paragraph11">
        <text:span text:style-name="text13"> ة </text:span>
        <text:span text:style-name="text13"> ي </text:span>
        <text:span text:style-name="text13"> ب </text:span>
        <text:span text:style-name="text13"> ر </text:span>
        <text:span text:style-name="text13"> ع </text:span>
        <text:span text:style-name="text13"> ل </text:span>
        <text:span text:style-name="text13"> ا </text:span>
        <text:span text:style-name="text13">
            <text:s text:c="1" text:style-name="text13"> </text:s>
        </text:span>
        <text:span text:style-name="text13"> ح </text:span>
        <text:span text:style-name="text13"> ي </text:span>
        <text:span text:style-name="text13"> ت </text:span>
        <text:span text:style-name="text13"> ا </text:span>
        <text:span text:style-name="text13"> ف </text:span>
        <text:span text:style-name="text13"> م </text:span>
        <text:span text:style-name="text13"> ل </text:span>
        <text:span text:style-name="text13"> ا </text:span>
        <text:span text:style-name="text13">
            <text:s text:c="1" text:style-name="text13"> </text:s>
        </text:span>
        <text:span text:style-name="text13"> ة </text:span>
        <text:span text:style-name="text13"> ح </text:span>
        <text:span text:style-name="text13"> و </text:span>
        <text:span text:style-name="text13"> ل </text:span>
    </text:p>
</draw:text-box>
'''

As we can see above, this produced a lot of text.span with the same style name. This makes the ODF XML stream huge before it is imported into Draw (this may be the reason why pdfimport is very slow and memory/CPU consuming for large PDFs). 

The pdfimport code was intended to combine these characters in to a single string, see
https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?#698
where it says it will "concatenate consecutive text elements unless there is a font or text color or matrix change, leave a new span in that case". 
However, it must have been failed to do so. It failed because, based on my observation, in the following code block:
'''
                if( (pCur->FontId == pNext->FontId || isSpaces(pNext)) &&
                    rCurGC.FillColor.Red == rNextGC.FillColor.Red &&
                    rCurGC.FillColor.Green == rNextGC.FillColor.Green &&
                    rCurGC.FillColor.Blue == rNextGC.FillColor.Blue &&
                    rCurGC.FillColor.Alpha == rNextGC.FillColor.Alpha &&
                    (rCurGC.Transformation == rNextGC.Transformation || notTransformed(rNextGC))
                    )
'''
all the other conditions are true, except the "rCurGC.Transformation == rNextGC.Transformation".

Until now I am still not sure why rCurGC.Transformation does not equal to rNextGC.Transformation while they should be the same.

In sum, the mirrorString code would be reached if the single characters are successfully combined into a string.
Comment 46 Kevin Suo 2021-07-15 10:54:14 UTC Comment hidden (obsolete)
Comment 47 Kevin Suo 2021-07-15 10:58:15 UTC
Further info:

If xpdf generated the following output:
drawChar 462.400000 770.989000 466.900000 770.989000 1.000000 0.000000 0.000000 1.000000 12.000000 ة

then sdext pdfimport will produce the following Transformation values, in the order in the metrics: (assume this is the rCurGC):

            (0,0)   (0,1)   (0,2)       (1,0)   (1,1)       (1,2)
---------------------------------------------------------------------
rCurGC:     1200    0       46240       0       1200        5674.08

If xpdf generated the following: 
drawChar 466.900000 770.989000 469.828000 770.989000 1.000000 0.000000 0.000000 1.000000 12.000000 ي

then in sdext pdfimport the Transformation values are: (assume this is the rNextGC):
            (0,0)   (0,1)   (0,2)       (1,0)   (1,1)       (1,2)
---------------------------------------------------------------------
rNextGC:    1200    0       46690       0       1200        5674.08

Apparently rCurGC.Transformation != rNextGC.Transformation. The different is in the (0,2): one is 46240, the other one is 46690. What are these two values? the position of the characters on the page??

Below is the full output of rCurGC.Transformation and rNextGC.Transformation:
(this is generated by adding a SAR_WARN above the if block in DrawXmlOptimizer::optimizeTextElements
in file drawtreevisiting.cxx:
                std::cout << "rCurGC: " << rCurGC.Transformation.get(0,0) << " " << rCurGC.Transformation.get(0,1) << " " << rCurGC.Transformation.get(0,2) << " ";
                std::cout << rCurGC.Transformation.get(1,0) << " " << rCurGC.Transformation.get(1,1) << " " << rCurGC.Transformation.get(1,2) << std::endl;

rCurGC: 1200 0 46240 0 1200 5674.08
rCurGC: 1200 0 46690 0 1200 5674.08
rCurGC: 1200 0 46980.4 0 1200 5674.08
rCurGC: 1200 0 47070.4 0 1200 5674.08
rCurGC: 1200 0 47659.6 0 1200 5674.08
rCurGC: 1200 0 48130 0 1200 5674.08
rCurGC: 1200 0 48370 0 1200 5674.08
rCurGC: 1200 0 48618.4 0 1200 5674.08
rCurGC: 1200 0 49007.2 0 1200 5674.08
rCurGC: 1200 0 49637.2 0 1200 5674.08
rCurGC: 1200 0 49927.6 0 1200 5674.08
rCurGC: 1200 0 50218 0 1200 5674.08
rCurGC: 1200 0 50496.4 0 1200 5674.08
rCurGC: 1200 0 50806 0 1200 5674.08
rCurGC: 1200 0 51276.4 0 1200 5674.08
rCurGC: 1200 0 51524.8 0 1200 5674.08
rCurGC: 1200 0 51773.2 0 1200 5674.08
rCurGC: 1200 0 52153.6 0 1200 5674.08
rCurGC: 1200 0 52603.6 0 1200 5674.08
rCurGC: 1200 0 53093.2 0 1200 5674.08

As you can see, all other values are the same, but the value in position (0,1) of the metrics is increasing one by one. I think this value should not be used to determine whether these characters should be combined into a string.

This is beyond my knowledge as it involves the basegfx::B2DHomMatrix staff which I know nothing, so it need an expert to investigate.
Comment 48 Thorsten Behrens (allotropia) 2021-07-15 11:07:03 UTC
(In reply to Kevin Suo from comment #47)
> Apparently rCurGC.Transformation != rNextGC.Transformation. The different is
> in the (0,2): one is 46240, the other one is 46690. What are these two
> values? the position of the characters on the page??
> 

The last column (0,2 and 1,2) in that matrix roughly corresponds to x,y translations (not precisely, since the full formula for it is e.g for the first coordinate component x'=(0,0)x + (0,1)y + (0,2))
Comment 49 Armin Le Grand 2021-07-15 15:59:12 UTC
It's a 3x3 homogen transformation matrix as known from linear algebra and used in CG systems. Usually only 3x2 is used since the last line may host perspective change stuff, but usually not used in 2D CG.
You may look in any book or google for it. It is a combination of Scale, Shear, Rotate and Translate. It can be decomposed to these (use decompose() at the matrix) to get the single parts.

To make it short,
    rCurGC: 1200 0 46240 0 1200 5674.08
means: scale(1200, 1200), then translate(46240, 5674.08), no rotation/shear used here (that would be multipied into the values, so the two zeros indicate this). This describes what to do when you imagine the character being at (0,0) and of size(1,1) (unit coordinates) to get it to where it is need to be shown. In case of no rotate/shear, it's like a rectangle (0,0,46240, 5674.08).

That these rectangles increase in X just describes that the next defined character is right of the one before - you may substract the X-Values to get the distances. So - that values are correct from the definition(s).

Objects are without shear/rotate, so the note line
    // line and space optimization; works only in strictly horizontal mode
is okay - that may be adapted to also be able to concatenate with back-transformations, just a hint, that limitation is theoretically not needed, but would make things more complicated...

When you substract the X-Values to get the character widths you will note these are not 1200 -> due to scale for fonts has special meaning. Y-Scale indeed is the char height, but X-Scale is defined 'relative' to Y-Scale, so X==Y here means no unregular font scaling (some systems also use a zero for X-Scale in that case).

Thus indeed the comparison
    rCurGC.Transformation == rNextGC.Transformation
is nonsense - to be true, the current character would have to be at the same position and of the same size (and rotate/shear if used) as the next one which would mean they are defined to cover each other.

What would need to be compared is

    X(curr) + currCharWitdh <-similar-> to X(next)

with the problem where to get currCharWitdh from (and to add some 'unsharpness', see same file line 586, the multiply with 0.75).

This can be done when the Font is already available, by setting that font at an temp OutDev, setting the FontWidth(1200) and getting the TextLength of a String with that single character.

Maybe I overlook something (nobody is perfect), but I would have to debug & see if that works with latin text - and why. From looking at the current code I am not sure this ever did something...?
Comment 50 Alex Cohn 2021-07-15 19:38:02 UTC
> X(curr) + currCharWitdh <-similar-> to X(next)

only note that for an RTL sequence, this will probably be 

  X(curr) - currCharWitdh <-similar-> to X(next)
Comment 51 V Stuart Foote 2021-07-15 20:43:11 UTC
(In reply to Alex Cohn from comment #50)
> > X(curr) + currCharWitdh <-similar-> to X(next)
> 
> only note that for an RTL sequence, this will probably be 
> 
>   X(curr) - currCharWitdh <-similar-> to X(next)

Unfortunately that would not work, for RTL to render correctly they have to be laid down in the XML in the reverse order from the lexical text run. 

That is the glitch, as the code that detected the RTL for each text run-- bounded by ICU iterators--is broken, and the XML is fed the glyphs in the wrong reversed order. Text runs get encoded into the Draw canvas in the reversed sequence. And, there seems to be issues with the correct ICU word and sentence iterators.
Comment 52 Kevin Suo 2021-07-16 00:50:05 UTC
(In reply to V Stuart Foote from comment #51)
As I said, if all the following evaluate true:
'''
                if( (pCur->FontId == pNext->FontId || isSpaces(pNext)) &&
                    rCurGC.FillColor.Red == rNextGC.FillColor.Red &&
                    rCurGC.FillColor.Green == rNextGC.FillColor.Green &&
                    rCurGC.FillColor.Blue == rNextGC.FillColor.Blue &&
                    rCurGC.FillColor.Alpha == rNextGC.FillColor.Alpha &&
                    (rCurGC.Transformation == rNextGC.Transformation || notTransformed(rNextGC))
                    )

then the chharacters will be combined into strings and then the mirror code would be triggered which will reverse the reversed char sequence, thus produce correct ones in the XML. 

So the issue now is how to modify the last condition "(rCurGC.Transformation == rNextGC.Transformation || notTransformed(rNextGC))" to correctly detect the current char is of the same transformation thus they should be combined into string.
Comment 53 Kevin Suo 2021-07-16 04:59:07 UTC
Created attachment 173616 [details]
tdf104597-not-perfect-patch.diff

Good news is that we are almost there.

With the attached modification to the code, the RTL text are now shown in correct order.

However, the modification has deleted the "(rCurGC.Transformation == rNextGC.Transformation || notTransformed(rNextGC))" condition directly. This may be wrong. Need to find a special case when some text within a line has transformation even though the characters in this line all have the same font id and the same fill color. This may be very rare. I tested many PDFs with this patch applied on current master, all of them are displaying perfect.

As a result, the attached may be committed to master in my opinion, because this makes Draw to be usable for RTL pdf files. In case someone find a special case when the removable of transformation may have caused a regression, then we investigate that matter separately.
Comment 54 Kevin Suo 2021-07-16 05:00:31 UTC
Created attachment 173617 [details]
screenshot showing the PDF open in draw, with the "not perfect" patch applied
Comment 55 Kevin Suo 2021-07-16 05:01:50 UTC
Created attachment 173618 [details]
rtl.pdf (a simplified RTL file which can be used as a unittest case)
Comment 56 Armin Le Grand 2021-07-16 08:08:19 UTC
To comment 53: I would not recommend to remove the transformation comparison. I have made bad experiences with guesses like 'will nearly never happen'. That often is true, but when you multiply that probablility with LO being used by hundreds of millions -> it WILL happen. Just my 2ct...

To the problem: This is an opportunity to do it correct and rotate/shear- independent. Use linear algebra as it is intended...

We need to compare if the start point of the next charbox (C) (lower left) is touching the vector starting from the start point of the current charbox (A) to the lower right of that box (B).
To get (A) we can multiply Point(0.0,0.0) with current matrix
To get (B) we can multiply Point(1.0,0.0) with current matrix
To get (C) we can multiply Point(0.0,0.0) with next matrix

Now we need to check if (C) touches Vector (A,B) and if it does e.g. inside the right third of it (need a hand-crafted unsharp-value here due to the mentioned special Scale-role of the FontScale given in the matrices, usually a decent 'guessed' CharWidth is 0.7 of CharHeight)

You can use basegfx::B2DVector/B2DPoint for all that. You may compare length of vector (A,B) with length of vector (A,C). For parallelism you may use dot product, or for using the tooling, e.g. use the method that calculates the angle between these two vectors.

That test should go to a subroutine, will work rotate/shear independent and even when no transformation is there (due to then having the unit matrix).
Comment 57 Armin Le Grand 2021-07-16 14:15:04 UTC
Ah - sorry, one note to that: Of course since internally the Y-Axis goes down, to get the lower-left edge you would need to multiply Point(0,1) with the matrices - but in the given example working with the Vectors at the top of the text rectangles would work just the same.
And if that was not clear - this is the clue with matrices - you can transform ANY coordinate pair in unit coordinates [0.0 .. 1.0] by multiplication with the matrix to it's 'target' coordinates, this 'magically' applies scale,shear,rotate and translate. So - e.g. to get the center of that object, use Point(0.5, 0.5), multiply -> voila
Comment 58 Kevin Suo 2021-07-17 08:50:44 UTC
(In reply to Armin Le Grand from comment #57)
Could you prepare the revised patch?
Comment 59 Armin Le Grand 2021-07-17 13:19:35 UTC
Created attachment 173646 [details]
Suggested solution

Okay, will do - checked for compile & run, also loaded the reduced example. Please check/experiment/adapt as needed, I can not say if the result is correct, sorry, my language knowledge is limited :-)
Comment 60 Kevin Suo 2021-07-18 02:35:28 UTC
(In reply to Armin Le Grand from comment #59)
Thank you for your suggested patch. I checked, but it does not work for some chars.

For instance:

On The current master there is a test file named "testFontFeatures.pdf" in sdext/source/pdfimport/test/testdocs. This file is used in the CppunitTest_sdext_pdfimport (test case file is in sdext/source/pdfimport/test/tests.cxx, the test name is testFontFeatures).

If you run 'make CppunitTest_sdext_pdfimport', you can see the output in workdir/CppunitTest/sdext_pdfimport.test.log.

From the log, you can see in line 384 that, for the line "Times New Roman Normal" as shown in the PDF, the chars "Tim" are concatenated, but "e" and "s" (as in "Times" are not; the chars "Ne" (as in "New) are concatenated, but "w" is not.
(please note that in the log file the concatenated "Tim" are shown in three different lines because there is another problem in the code which added new-line characters in the output, but this does not have impact on this RTL bug, thus can be addressed separately)

Also, from the log file you can also see from line 691 that all those Chinese characters are not concatenated.

Could you please take a look and revise your patch?
Comment 61 Kevin Suo 2021-07-18 04:48:52 UTC
The failing to concatenate "Tim" and "e" may due to the issue as discussed here:
https://gitlab.freedesktop.org/poppler/poppler/-/issues/418
Comment 62 Armin Le Grand 2021-07-18 10:21:08 UTC
Thanks for the clear description, made it very effective to start debugging.
I guess I found it -> I implied that rCurGC.Transformation gets *adapted* when conactenation happens in the true case, but it gets not. It always seems to stay in it's initial form, describing just the 1st character.
If this adaption is/was not needed for further processing it's pure luck, but wrong. Thinking about how to effectively adapt rCurGC.Transformation in case of concatenation...
Comment 63 Armin Le Grand 2021-07-18 11:12:53 UTC
Adapting rCurGC.Transformation as needed streches texts in result -> so not intended (?), width scale *has* to stay relative to fontScaling. So need to internally add a correction value used to adapt to already concatenated distances while concatenation continues. Trying that...
Comment 64 Armin Le Grand 2021-07-18 11:59:27 UTC
Created attachment 173657 [details]
revvised solution with adapting width during concatenation

Have to hand in a value that describes the already used expansion of the current char snippet, as explained could not change/adapt the involved transformation. Also adapted the unsharp decision/compare value to 0.77 now - saw cases where 0.75 was at the edge. It may even be that this needs adaption e.g. vertical decisions for chinese...? I do not know. Also possible that spaces may need extra caveat due to having eventually different FontSize?
Comment 65 V Stuart Foote 2021-07-18 14:09:31 UTC
@Armin, Kevin, *

(In reply to Armin Le Grand from comment #64)
> ... It may even be that this needs
> adaption e.g. vertical decisions for chinese...? I do not know. Also
> possible that spaces may need extra caveat due to having eventually
> different FontSize?

This is great work!  But ultimately doesn't this require more finesse than simply testing presence of a "gap" to end concatenation and signify the next span to be passed out to mirror? And guess the off baseline positioning of comment 61 might need to be handled.

When concatenating with color, font or transformation check as now, what happens at pucntuations, or kashida internal to the imported text run?  Or brackets/parenthesis that should bound the span--does the opening and closing match the script?  

Restoring the transformation tests in drawtreevisiting.cxx will close this issue. But would think testing against full range of script appropriate ICU word break iterators could be the trigger to end of the text run concatenation being passed to the mirror string action. 

Also maybe test for sentence break iterators?  Beside acting as a word bound, are they kept with the span being mirrored? And do they end up placed appropriately for the RTL scripts?

Maybe including logic to test enclosing parenthesis or bracketing--to get the beginning and ending in the correct position.

The ICU BiDi libs were not very robust when the import filter was laid down in OOo era for i90800 (see also).

Also, don't similar things need to be done for RTL the Writer PDF import filter (writertreevisiting.cxx/.hxx)?
Comment 66 Kevin Suo 2021-07-18 15:01:59 UTC
I am testing locally.But V Stuart Foote could you upload some a simple doc covering most of the cases you have mentioned?

I have already tweaked to combine the space chars (which will result in the using of the same font as its previous non-space char), and I also managed to remove the line-break (\n) char. The concatenating of Chinese chars is still an issue, but I am a Chinese, so I am thinking about this but I need some time to understand the staff suggested by Armin Le Grand first.
Comment 67 Armin Le Grand 2021-07-18 15:18:19 UTC
@Kevin, Stuart: It's not really a gap - a pic would say more than 1000 words, but...
- if next char starts at the same pos (test before) it covers current -> bad
- it's about 'guessing' the width of curr char
- if curr char is tall -> 20% start value may be good
- if curr is wide -> 80% may be good
- so, together, when between that, it may be a good guess
With my current approach we check for 23% to 100% and horizontally which should be good. Extending to check for less than 80% would be easy, too.

Much better but more complicated would be to use the width of the current char -> it's not in the matrix, so would need an OutDev, the font, setting it and asking for width of that char, then using that width to create better - but still unsharp - bounds to test against. Unsharp due to we do not know if e.g. using block text layout how big the gaps btw single chars which should be concatenated really are :-(
Comment 68 V Stuart Foote 2021-07-18 17:17:01 UTC
(In reply to Kevin Suo from comment #66)
> I am testing locally.But V Stuart Foote could you upload some a simple doc
> covering most of the cases you have mentioned?
> 

The 1945 UN Charter, Chapter IV, Article 13 provides a multi-lingual testing paragraph ususally with some bracketing.  Can find it in PDF form for many languages, or as web content that can be printed to PDF.

=-=-=

https://www.un.org/en/about-us/un-charter/full-text

https://unic.un.org/aroundworld/unics/en/whatWeDo/productsAndServices/publications/index.asp?callPage=products&category=1
Comment 69 Kevin Suo 2021-07-18 17:25:06 UTC
(In reply to Armin Le Grand from comment #67)

No, the concatenation will only be applied to TextElements, and only apply if the following conditions are met:

1. The current char and next char use the same font, or the next char is a space char. Same font means the same font family, *font size*, bold/italic/outline feature etc. AND
2. The current and next chars use the same font color. AND
3. The current char and next char has the same transformation, or the next char is a blank char.

See code below:

'''
    if( (isSpaces(pNext) || pCur->FontId == pNext->FontId) &&
        rCurGC.FillColor.Red == rNextGC.FillColor.Red &&
        rCurGC.FillColor.Green == rNextGC.FillColor.Green &&
        rCurGC.FillColor.Blue == rNextGC.FillColor.Blue &&
        rCurGC.FillColor.Alpha == rNextGC.FillColor.Alpha &&
        ( isSpaces(pNext) ||
          canConcatenate(rCurGC.Transformation, rNextGC.Transformation, fAlreadyExpandedDistance)
        )
'''

In this regard, there actually no need to detect if the height of the two chars are the same. There is also no need to detect if the next object is an image or something else. The only issue here is the transformation. The transformation of font in my opinion may be something like rotation.