104597 – RTL script text runs are reversed on PDF import, PDFIProcessor::mirrorString misbehaving

Bug 104597 - RTL script text runs are reversed on PDF import, PDFIProcessor::mirrorString misbehaving

Summary: RTL script text runs are reversed on PDF import, PDFIProcessor::mirrorString ...

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	filters and storage (show other bugs)
Version: (earliest affected)	4.1 all versions
Hardware:	All All

Importance:	highest major
Assignee:	Not Assigned

URL:
Whiteboard:	target:7.5.0 target:7.4.3 target:7.4.4
Keywords:	bibisected, bisected, filter:pdf, regression

Duplicates (17):	84797 89471 97131 115802 116318 125951 131022 131222 134180 145265 145334 146524 149457 149516 151950 152258 152670 (view as bug list)
Depends on:	151546
Blocks:	PDF-Import-Draw RTL
	Show dependency tree / graph

Reported:	2016-12-12 08:24 UTC by Pourrabi
Modified:	2024-08-03 09:15 UTC (History)
CC List:	35 users (show)

See Also:	89471 https://bz.apache.org/ooo/show_bug.cgi?id=90800 151247 151473 149457 152083 152143 153967
Crash report or crash signature:

Attachments
Screen copy showing the problem (208.30 KB, image/png) 2016-12-12 08:25 UTC, Pourrabi	Details
PDF File (184.13 KB, application/pdf) 2016-12-12 11:49 UTC, Pourrabi	Details
arabic RTL3.pdf: shows when mirroring changed from paragraphs to individual characters (11.60 KB, application/pdf) 2019-07-29 13:19 UTC, Justin L	Details
tdf104597-not-perfect-patch.diff (1.76 KB, patch) 2021-07-16 04:59 UTC, Kevin Suo	Details
screenshot showing the PDF open in draw, with the "not perfect" patch applied (279.54 KB, image/png) 2021-07-16 05:00 UTC, Kevin Suo	Details
rtl.pdf (a simplified RTL file which can be used as a unittest case) (6.18 KB, application/pdf) 2021-07-16 05:01 UTC, Kevin Suo	Details
Suggested solution (4.38 KB, text/plain) 2021-07-17 13:19 UTC, Armin Le Grand (collabora)	Details
revvised solution with adapting width during concatenation (6.14 KB, text/plain) 2021-07-18 11:59 UTC, Armin Le Grand (collabora)	Details
tdf104597_textrun.odt (13.48 KB, application/vnd.oasis.opendocument.text) 2022-10-12 03:09 UTC, Kevin Suo	Details
output.xml (11.59 KB, text/xml) 2022-10-12 03:50 UTC, Kevin Suo	Details
Hebrew PDF file (285.58 KB, application/pdf) 2022-11-28 07:09 UTC, Hanan Sela	Details
screen shot of Hebrew pdf file (432.07 KB, image/png) 2022-11-28 07:11 UTC, Hanan Sela	Details
screen shot of Hebrew pdf file opened with evince (left) and draw (right; last build 28/11/22)) (797.77 KB, image/png) 2022-11-29 11:54 UTC, Hanan Sela	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Pourrabi 2016-12-12 08:24:14 UTC

Hi;
When open a Persian PDF document, words document are reverse

Comment 1 Pourrabi 2016-12-12 08:25:43 UTC

Created attachment 129516 [details]
Screen copy showing the problem

Comment 2 Xisco Faulí 2016-12-12 10:59:39 UTC

Hello alireza,

Thank you for reporting the bug. Please attach a sample document, as this makes it easier for us to verify the bug. 
I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' once the requested document is provided.
(Please note that the attachment will be public, remove any sensitive information before attaching it. 
See https://wiki.documentfoundation.org/QA/FAQ#How_can_I_eliminate_confidential_data_from_a_sample_document.3F for help on how to do so.)

Comment 3 Pourrabi 2016-12-12 11:49:32 UTC

Created attachment 129523 [details]
PDF File

Comment 4 Buovjaga 2016-12-14 13:55:04 UTC

Already broken in at least 4.4.5.
Works mostly correctly in 3.5. There are some issues here and there.

Win 7 Pro 64-bit Version: 5.4.0.0.alpha0+
Build ID: ba6b35fc68a01aff72b39eb7809bacb326068668
CPU Threads: 4; OS Version: Windows 6.1; UI Render: default; 
TinderBox: Win-x86@39, Branch:master, Time: 2016-12-13_06:07:39
Locale: fi-FI (fi_FI); Calc: group

Version: 4.4.5.2
Build ID: a22f674fd25a3b6f45bdebf25400ed2adff0ff99
Locale: fi_FI

LibreOffice 3.5.0rc3 
Build ID: 7e68ba2-a744ebf-1f241b7-c506db1-7d53735

Comment 5 Xisco Faulí 2016-12-14 14:14:41 UTC

Reproduced in

Version: 4.1.0.0.alpha1+
Build ID: a2c9d4f8bbde97f175bae4df771273a61251f40

but not in

Version 4.1.0.0.alpha0+ (Build ID: efca6f15609322f62a35619619a6d5fe5c9bd5a)

Comment 6 Xisco Faulí 2016-12-14 14:52:34 UTC

Regression introduced by:

author	Thorsten Behrens <tbehrens@suse.com>	2013-03-20 23:37:28 (GMT)
committer	Thorsten Behrens <tbehrens@suse.com>	2013-03-20 23:54:16 (GMT)
commit ff140bb6b8b109f14c270ff059f0b8d71dab5d6c (patch)
tree 4fee2384e2937f167943c37233e6098b9d064a11
parent dd0db92a174f6a4da1ada3de17cb869264be9342 (diff)

Remove StringMirror UNO service.
This was a kludge from back in the day when pdfimport was an
extension and could not link against office libs.

While at it, fix mirror method to handle unicode surrogates
correctly.

Adding Cc: to Thorsten Behrens

Comment 7 QA Administrators 2017-12-15 07:45:52 UTC Comment hidden (noise)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 8 Xisco Faulí 2018-02-28 09:15:07 UTC

*** Bug 115802 has been marked as a duplicate of this bug. ***

Comment 9 QA Administrators 2019-03-01 03:49:58 UTC Comment hidden (noise)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 10 Justin L 2019-06-21 19:18:20 UTC

Probably a duplicate of bug 89471, but this one at least has a bisect commit to look at, so I don't want to lose it in the messy comments of another bug.

Comment 11 Xisco Faulí 2019-06-25 09:30:50 UTC

@Mark Hung, I thought you might be interested in this issue...

Comment 12 Xisco Faulí 2019-06-25 09:36:56 UTC

*** Bug 89471 has been marked as a duplicate of this bug. ***

Comment 13 V Stuart Foote 2019-06-25 12:37:55 UTC

Not clear if the (isRTL) is applied logically to do the mirror::String() [1]? I get confused walking through the code. Where we check for RTL [2] checking character by character with GetCharacterClassification () [3] using the ICU lib  getUnicodeDirection() by Unicode block [4] to classify the text run.

So the string should be mirrored/reversed when written out to document canvas. But something is misapplied.

=-refs-=

[1] https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/pdfiprocessor.cxx?r=58e31745#699

[2] https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?r=8b663bc8&h=91#108

[3] https://opengrok.libreoffice.org/xref/core/i18nutil/source/utility/scripttypedetector.cxx?r=97844a11&h=42#50

[4] https://opengrok.libreoffice.org/xref/core/i18nutil/source/utility/unicode.cxx?r=c3c620c8&h=81#81

Comment 14 Justin L 2019-06-29 17:00:25 UTC

(In reply to V Stuart Foote from comment #13)
> Not clear if the (isRTL) is applied logically to do the mirror::String()?
> I get confused walking through the code. 

I agree that it doesn't seem to mirror the string.  Originally it walked backwards and now it walks forwards, but still uses appendUtf32 instead of insertUtf32.
-    for(int i = nLen - 1; i >= 0; i--)
+    sal_Int32 i = 0;
+    while(i < nLen)

However, making that change doesn't fix it, since almost nothing is even passed to the mirrorString function.

Comment 15 Justin L 2019-07-29 13:19:59 UTC

Created attachment 153035 [details]
arabic RTL3.pdf: shows when mirroring changed from paragraphs to individual characters

(In reply to Justin L from comment #14)
> However, making that change doesn't fix it, since almost nothing is even
> passed to the mirrorString function.
Back then, the entire textbox was passed to the mirrorString function. Using insertUtf does fix it when compiling the 4.1 codebase. (make dev-install with gcc 4.6 on Ubuntu 12.04, and running install/program/soffice.)

The paragraphElement switched to being composed of mostly individual character elements (which don't get mirrored) instead of large chunks of text in LO 5.0 commit ee21771db0292315ff3e1b87ff58294335106bd3
Author: Vort <vvort@yandex.ru>
Date:   Fri Jan 16 18:29:40 2015 +0200
    fdo#88465 PDF Import: fix scale and rotate image transformations

Comment 16 vvort 2019-07-29 13:48:02 UTC

Are you sure that image transformations are related to text processing?

Comment 17 Justin L 2019-07-29 13:56:39 UTC

(In reply to vvort from comment #16)
> Are you sure that image transformations are related to text processing?
You can bibisect with comment 15's document to verify.

I don't know the hows or the whys, but that commit changed the text processing call to mirrorString from a single call mirroring all 17 characters

warn:JCL:17625:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:773: ::mirrorString[\u0642\u0641\u0632}}}\u0644\u0627\u062b\u0639\u0644\u0628\u0644\u0627\u0628\u0646\u064a][17]
warn:JCL:17625:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:782: ---returning mirrored string[\u0642\u0641\u0632{{{\u0644\u0627\u062b\u0639\u0644\u0628\u0644\u0627\u0628\u0646\u064a]

into two calls mirroring just a couple of characters

warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:731: ::mirrorString[\u0644\u0627][2]
warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:740: ---returning mirrored string[\u0644\u0627]
warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:731: ::mirrorString[\u0644\u0627][2]
warn:JCL:17340:1:sdext/source/pdfimport/tree/pdfiprocessor.cxx:740: ---returning mirrored string[\u0644\u0627]

Comment 18 vvort 2019-07-30 11:52:38 UTC

Characters are processed one by one because DrawXmlOptimizer::optimizeTextElements() fails to concatenate TextElements.
Inside of that function check for transformation equality exists: rCurGC.Transformation == rNextGC.Transformation.
If it is commented out, brackets inside test file becomes equal.
But result is still differs from PDF.

Comment 19 Justin L 2019-08-01 08:22:37 UTC

(In reply to vvort from comment #18)
> DrawXmlOptimizer::optimizeTextElements() fails to concatenate TextElements
because before notTransformed() previously returned true, and now it returns false. (same line in the code, but just a clarification since rCurGC.Transformation was never equal to rNextGC.Transformation in this context.)

Comment 20 vvort 2019-08-01 11:17:40 UTC

I was fixing scale in my commit.
So, maybe, scaling for text elements should be skipped.
While staying intact for images.
More investigation is needed.

Comment 21 vvort 2019-08-01 11:50:40 UTC

Another possibility is that transformation check may be not needed at all now since all characters inside paragraph have the same transformation (except for translation part).

Comment 22 vvort 2019-08-02 14:08:29 UTC

TextElements are created just in one place: inside processGlyphLine function. Also there they are placed inside new FrameElement.
Characters, which are processed inside processGlyphLine call are having the same transformation because of checks inside drawGlyphs call.
So I think that transformation check inside optimizeTextElements call should be removed. Alongside with notTransformed function.
But, sadly, this can still produce side effects, because lot of poorly tested semi-dead code of pdfimport may come to life.

Comment 23 Fahad Al-Saidi 2019-11-27 04:41:19 UTC Comment hidden (me-too)

This bug affects my workflow a lot. I hope someone can fix it. Thanks in advance.

Comment 24 Xisco Faulí 2019-11-29 12:50:48 UTC

Changing priority to 'high' since the number of duplicates is 5 or higher

Comment 25 Xisco Faulí 2019-11-29 12:51:31 UTC

*** Bug 84797 has been marked as a duplicate of this bug. ***

Comment 26 Xisco Faulí 2019-11-29 12:51:37 UTC

*** Bug 97131 has been marked as a duplicate of this bug. ***

Comment 27 Xisco Faulí 2019-11-29 12:51:41 UTC

*** Bug 116318 has been marked as a duplicate of this bug. ***

Comment 28 Xisco Faulí 2019-11-29 12:51:46 UTC

*** Bug 125951 has been marked as a duplicate of this bug. ***

Comment 29 Eyal Rozenberg 2019-11-29 18:10:32 UTC

> Changing priority to 'high' since the number of duplicates is 5 or higher

Thank you; this is very significant to us.

Quoting myself from one of the dupes:
https://bugs.documentfoundation.org/show_bug.cgi?id=89471#c12

As a member of the LibreOffice RTL QA "team", I would like to point out this is currently one of the most significant RTL bugs in LibreOffice, as we see it (there was an actual discussion rating prominence of RTL bugs on our Telegram group). This bug essentially means LibreOffice cannot be used to work with PDFs with RTL content.

Comment 30 Jack 2019-11-30 10:16:48 UTC

Five years, seven bug reports, three major version number increments, and many replies later, finally this is being taken seriously.

Comment 31 Xisco Faulí 2019-12-02 13:40:57 UTC

Changing priority to 'highest' since this a regression and the number of duplicates is higher than 5 or the number of people in CC higher than 20

Comment 32 V Stuart Foote 2020-03-08 15:19:23 UTC

*** Bug 131222 has been marked as a duplicate of this bug. ***

Comment 33 Alex Cohn 2020-03-18 14:53:57 UTC Comment hidden (obsolete)

For all interested, I have a partial work-around: I saved the PDF and opened it again. Now the Hebrew text is flipped twice. For many (but not all) fields this works.

Comment 34 Alex Cohn 2020-03-18 14:54:10 UTC Comment hidden (obsolete)

For all interested, I have a partial work-around: I saved the PDF and opened it again. Now the Hebrew text is flipped twice. For many (but not all) fields this works.

Comment 35 V Stuart Foote 2020-06-24 12:28:18 UTC

*** Bug 134180 has been marked as a duplicate of this bug. ***

Comment 36 V Stuart Foote 2020-06-24 14:05:04 UTC

*** Bug 131222 has been marked as a duplicate of this bug. ***

Comment 37 V Stuart Foote 2020-10-04 15:11:54 UTC

@Thorsten, *

Artem & Justin had zeroed in on the mishandling--testing single chars rather than text run strings--and now rarely triggering the mirrorstring helper.

As an alternative method for establishing RTL logical order from the visual order of the processed PDF stream, Khaled had suggested in bug 89471 the ICU BiDi libs (ubidi)provides an 'inverse' option that could convert the RTL text runs back to correct logical order [1][2] for better fidelity to the PDF streams.

=-ref-=
described...
[1] https://unicode-org.github.io/icu/userguide/transforms/bidi.html

API...
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html

Comment 38 Sergey 2020-11-26 07:06:31 UTC

Dear developers team,
Hope this is the right place for my massage.
I have a problem opening pdf file in Hebrew, the text is flipped. 
LibreOffice really good tool, I like it very much !!!
Is it possible to fix the issue with opening Hebrew text? It will be very helpful!!! Pleeeeeeease ...... :-)

Comment 39 Buovjaga 2020-11-26 08:18:25 UTC

(In reply to Sergey from comment #38)
> Dear developers team,
> Hope this is the right place for my massage.
> I have a problem opening pdf file in Hebrew, the text is flipped. 
> LibreOffice really good tool, I like it very much !!!
> Is it possible to fix the issue with opening Hebrew text? It will be very
> helpful!!! Pleeeeeeease ...... :-)

No, not the right place. Please open a new report for your issue.

Comment 40 Justin L 2020-11-26 08:32:28 UTC

(In reply to Buovjaga from comment #39)
> No, not the right place. Please open a new report for your issue.

Actually, it sounds like the identical issue to me. So a new report would just be flagged as a duplicate.

This bug report has been highlighted for over a year in the ESC minutes as one of the most important bugs. Removing Assigned status since there appears to have been no activity. Perhaps Thorsten is the "right person" to fix it, but obviously he doesn't have enough personal time to allocate to this.

This will either take a motivated RTL community developer, or more likely a corporate sponsor to get fixed.

Comment 41 Alex Cohn 2020-11-26 12:26:40 UTC

@Justin is there an established way to look for such sponsor?

Comment 42 Buovjaga 2020-11-26 13:32:54 UTC

(In reply to Alex Cohn from comment #41)
> @Justin is there an established way to look for such sponsor?

There is no established way. Ideally the sponsors would come looking instead of the other way around.

Comment 43 shmuel.globus@gmail.com 2021-02-21 19:46:19 UTC

I am just waiting for this issue to be resolved. It's top priority for many users. It really handicaps the use of LibreOffice in Israel and the whole Middle East.

Comment 44 V Stuart Foote 2021-07-15 02:22:56 UTC

@Kevin, given your interest and recent work on the pdfio filter the munged RTL text runs are a real annoyance when opening a PDF into Draw. Seems like an ICU libs bidirectional sense for RTL scripts would be a viable approach. Perhaps have a look... Stuart

Comment 45 Kevin Suo 2021-07-15 04:58:23 UTC

(In reply to V Stuart Foote from comment #44)

Well, below is my observation these days. I may be wrong, but I think these are helpful to those who are interested:

1. The "mirrorString" function is never hit because isRTL is always false.

2. The "isRTL" is always false because the following code never returned true:
'''
    if( xCC.is() )
    {
        for(int i=1; i< elem.Text.getLength(); i++)
        {
            css::i18n::DirectionProperty nType = static_cast<css::i18n::DirectionProperty>(xCC->getCharacterDirection( str, i ));
            if ( nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT           ||
                 nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT_ARABIC    ||
                 nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT_EMBEDDING ||
                 nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT_OVERRIDE
                )
                isRTL = true;
        }
    }
'''
https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?#110

3. This "if" block never returned true because "getCharacterDirection" in XCharacterClassification requires an OUString with a length > 1, whereas only one Arabic character is passed to it. If the length is 1 which is the same as the nPos, then this function returns 0 directly without doing any detection of the direction (it is impossible to detect the direction if only one Arabic character is provided).
See:
https://opengrok.libreoffice.org/xref/core/i18npool/source/characterclassification/cclass_unicode.cxx?#128

4. The reason why only one Arabic character is passed to getCharacterDirection may be due to the sdext.pdfimport code failed to combine those single characters (as produced by xpdfimport process) into a string.

Below is the sample ODF XML stream generated by the sdext pdfimport code (using a pdf file with only one line content "لوحة المفاتيح العربية"):

'''
<draw:text-box>
    <text:p text:style-name="paragraph11">
        <text:span text:style-name="text13"> ة </text:span>
        <text:span text:style-name="text13"> ي </text:span>
        <text:span text:style-name="text13"> ب </text:span>
        <text:span text:style-name="text13"> ر </text:span>
        <text:span text:style-name="text13"> ع </text:span>
        <text:span text:style-name="text13"> ل </text:span>
        <text:span text:style-name="text13"> ا </text:span>
        <text:span text:style-name="text13">
            <text:s text:c="1" text:style-name="text13"> </text:s>
        </text:span>
        <text:span text:style-name="text13"> ح </text:span>
        <text:span text:style-name="text13"> ي </text:span>
        <text:span text:style-name="text13"> ت </text:span>
        <text:span text:style-name="text13"> ا </text:span>
        <text:span text:style-name="text13"> ف </text:span>
        <text:span text:style-name="text13"> م </text:span>
        <text:span text:style-name="text13"> ل </text:span>
        <text:span text:style-name="text13"> ا </text:span>
        <text:span text:style-name="text13">
            <text:s text:c="1" text:style-name="text13"> </text:s>
        </text:span>
        <text:span text:style-name="text13"> ة </text:span>
        <text:span text:style-name="text13"> ح </text:span>
        <text:span text:style-name="text13"> و </text:span>
        <text:span text:style-name="text13"> ل </text:span>
    </text:p>
</draw:text-box>
'''

As we can see above, this produced a lot of text.span with the same style name. This makes the ODF XML stream huge before it is imported into Draw (this may be the reason why pdfimport is very slow and memory/CPU consuming for large PDFs). 

The pdfimport code was intended to combine these characters in to a single string, see
https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?#698
where it says it will "concatenate consecutive text elements unless there is a font or text color or matrix change, leave a new span in that case". 
However, it must have been failed to do so. It failed because, based on my observation, in the following code block:
'''
                if( (pCur->FontId == pNext->FontId || isSpaces(pNext)) &&
                    rCurGC.FillColor.Red == rNextGC.FillColor.Red &&
                    rCurGC.FillColor.Green == rNextGC.FillColor.Green &&
                    rCurGC.FillColor.Blue == rNextGC.FillColor.Blue &&
                    rCurGC.FillColor.Alpha == rNextGC.FillColor.Alpha &&
                    (rCurGC.Transformation == rNextGC.Transformation || notTransformed(rNextGC))
                    )
'''
all the other conditions are true, except the "rCurGC.Transformation == rNextGC.Transformation".

Until now I am still not sure why rCurGC.Transformation does not equal to rNextGC.Transformation while they should be the same.

In sum, the mirrorString code would be reached if the single characters are successfully combined into a string.

Comment 46 Kevin Suo 2021-07-15 10:54:14 UTC Comment hidden (obsolete)

Further info:

If xpdf generated the following output:
drawChar 462.400000 770.989000 466.900000 770.989000 1.000000 0.000000 0.000000 1.000000 12.000000 ة

then sdext pdfimport will produce the following Transformation values, in the order in the metrics: (assume this is the rCurGC):

            (0,0)   (0,1)   (0,2)       (1,0)   (1,1)       (1,2)
---------------------------------------------------------------------
rCurGC:     1200    0       46240       0       1200        5674.08

If xpdf generated the following: 
drawChar 466.900000 770.989000 469.828000 770.989000 1.000000 0.000000 0.000000 1.000000 12.000000 ي

then in sdext pdfimport the Transformation values are: (assume this is the rNextGC):
            (0,0)   (0,1)   (0,2)       (1,0)   (1,1)       (1,2)
---------------------------------------------------------------------
rNextGC:    1200    0       46690       0       1200        5674.08

Apparently rCurGC.Transformation != rNextGC.Transformation. The different is in the (0,2): one is 46240, the other one is 46690. What are these two values? the position of the characters on the page??

Below is the full output of rCurGC.Transformation and rNextGC.Transformation:
(this is generated by adding a SAR_WARN below line
aGlyph.getGC().Transformation = totalTextMatrix1;
in file pdfiprocessor.cxx:
    SAL_WARN("sdext.pdfimport", "drawGlyphs: " 
        << aGlyph.getGC().Transformation.get(0,0) << ", "
        << aGlyph.getGC().Transformation.get(0,1) << ", "
        << aGlyph.getGC().Transformation.get(0,2) << ", "
        << aGlyph.getGC().Transformation.get(1,0) << ", "
        << aGlyph.getGC().Transformation.get(1,1) << ", "
        << aGlyph.getGC().Transformation.get(1,2) << ", "
    );

rCurGC: 1200 0 46240 0 1200 5674.08
rCurGC: 1200 0 46690 0 1200 5674.08
rCurGC: 1200 0 46980.4 0 1200 5674.08
rCurGC: 1200 0 47070.4 0 1200 5674.08
rCurGC: 1200 0 47659.6 0 1200 5674.08
rCurGC: 1200 0 48130 0 1200 5674.08
rCurGC: 1200 0 48370 0 1200 5674.08
rCurGC: 1200 0 48618.4 0 1200 5674.08
rCurGC: 1200 0 49007.2 0 1200 5674.08
rCurGC: 1200 0 49637.2 0 1200 5674.08
rCurGC: 1200 0 49927.6 0 1200 5674.08
rCurGC: 1200 0 50218 0 1200 5674.08
rCurGC: 1200 0 50496.4 0 1200 5674.08
rCurGC: 1200 0 50806 0 1200 5674.08
rCurGC: 1200 0 51276.4 0 1200 5674.08
rCurGC: 1200 0 51524.8 0 1200 5674.08
rCurGC: 1200 0 51773.2 0 1200 5674.08
rCurGC: 1200 0 52153.6 0 1200 5674.08
rCurGC: 1200 0 52603.6 0 1200 5674.08
rCurGC: 1200 0 53093.2 0 1200 5674.08

As you can see, all other values are the same, but the value in position (0,1) of the metrics is increasing one by one. I think this value should not be used to determine whether these characters should be combined into a string.

This is beyond my knowledge as it involves the basegfx::B2DHomMatrix staff which I know nothing, so it need an expert to investigate.

Comment 47 Kevin Suo 2021-07-15 10:58:15 UTC

Further info:

If xpdf generated the following output:
drawChar 462.400000 770.989000 466.900000 770.989000 1.000000 0.000000 0.000000 1.000000 12.000000 ة

then sdext pdfimport will produce the following Transformation values, in the order in the metrics: (assume this is the rCurGC):

            (0,0)   (0,1)   (0,2)       (1,0)   (1,1)       (1,2)
---------------------------------------------------------------------
rCurGC:     1200    0       46240       0       1200        5674.08

If xpdf generated the following: 
drawChar 466.900000 770.989000 469.828000 770.989000 1.000000 0.000000 0.000000 1.000000 12.000000 ي

then in sdext pdfimport the Transformation values are: (assume this is the rNextGC):
            (0,0)   (0,1)   (0,2)       (1,0)   (1,1)       (1,2)
---------------------------------------------------------------------
rNextGC:    1200    0       46690       0       1200        5674.08

Apparently rCurGC.Transformation != rNextGC.Transformation. The different is in the (0,2): one is 46240, the other one is 46690. What are these two values? the position of the characters on the page??

Below is the full output of rCurGC.Transformation and rNextGC.Transformation:
(this is generated by adding a SAR_WARN above the if block in DrawXmlOptimizer::optimizeTextElements
in file drawtreevisiting.cxx:
                std::cout << "rCurGC: " << rCurGC.Transformation.get(0,0) << " " << rCurGC.Transformation.get(0,1) << " " << rCurGC.Transformation.get(0,2) << " ";
                std::cout << rCurGC.Transformation.get(1,0) << " " << rCurGC.Transformation.get(1,1) << " " << rCurGC.Transformation.get(1,2) << std::endl;

rCurGC: 1200 0 46240 0 1200 5674.08
rCurGC: 1200 0 46690 0 1200 5674.08
rCurGC: 1200 0 46980.4 0 1200 5674.08
rCurGC: 1200 0 47070.4 0 1200 5674.08
rCurGC: 1200 0 47659.6 0 1200 5674.08
rCurGC: 1200 0 48130 0 1200 5674.08
rCurGC: 1200 0 48370 0 1200 5674.08
rCurGC: 1200 0 48618.4 0 1200 5674.08
rCurGC: 1200 0 49007.2 0 1200 5674.08
rCurGC: 1200 0 49637.2 0 1200 5674.08
rCurGC: 1200 0 49927.6 0 1200 5674.08
rCurGC: 1200 0 50218 0 1200 5674.08
rCurGC: 1200 0 50496.4 0 1200 5674.08
rCurGC: 1200 0 50806 0 1200 5674.08
rCurGC: 1200 0 51276.4 0 1200 5674.08
rCurGC: 1200 0 51524.8 0 1200 5674.08
rCurGC: 1200 0 51773.2 0 1200 5674.08
rCurGC: 1200 0 52153.6 0 1200 5674.08
rCurGC: 1200 0 52603.6 0 1200 5674.08
rCurGC: 1200 0 53093.2 0 1200 5674.08

As you can see, all other values are the same, but the value in position (0,1) of the metrics is increasing one by one. I think this value should not be used to determine whether these characters should be combined into a string.

This is beyond my knowledge as it involves the basegfx::B2DHomMatrix staff which I know nothing, so it need an expert to investigate.

Comment 48 Thorsten Behrens (allotropia) 2021-07-15 11:07:03 UTC

(In reply to Kevin Suo from comment #47)
> Apparently rCurGC.Transformation != rNextGC.Transformation. The different is
> in the (0,2): one is 46240, the other one is 46690. What are these two
> values? the position of the characters on the page??
> 

The last column (0,2 and 1,2) in that matrix roughly corresponds to x,y translations (not precisely, since the full formula for it is e.g for the first coordinate component x'=(0,0)x + (0,1)y + (0,2))

Comment 49 Armin Le Grand (collabora) 2021-07-15 15:59:12 UTC

It's a 3x3 homogen transformation matrix as known from linear algebra and used in CG systems. Usually only 3x2 is used since the last line may host perspective change stuff, but usually not used in 2D CG.
You may look in any book or google for it. It is a combination of Scale, Shear, Rotate and Translate. It can be decomposed to these (use decompose() at the matrix) to get the single parts.

To make it short,
    rCurGC: 1200 0 46240 0 1200 5674.08
means: scale(1200, 1200), then translate(46240, 5674.08), no rotation/shear used here (that would be multipied into the values, so the two zeros indicate this). This describes what to do when you imagine the character being at (0,0) and of size(1,1) (unit coordinates) to get it to where it is need to be shown. In case of no rotate/shear, it's like a rectangle (0,0,46240, 5674.08).

That these rectangles increase in X just describes that the next defined character is right of the one before - you may substract the X-Values to get the distances. So - that values are correct from the definition(s).

Objects are without shear/rotate, so the note line
    // line and space optimization; works only in strictly horizontal mode
is okay - that may be adapted to also be able to concatenate with back-transformations, just a hint, that limitation is theoretically not needed, but would make things more complicated...

When you substract the X-Values to get the character widths you will note these are not 1200 -> due to scale for fonts has special meaning. Y-Scale indeed is the char height, but X-Scale is defined 'relative' to Y-Scale, so X==Y here means no unregular font scaling (some systems also use a zero for X-Scale in that case).

Thus indeed the comparison
    rCurGC.Transformation == rNextGC.Transformation
is nonsense - to be true, the current character would have to be at the same position and of the same size (and rotate/shear if used) as the next one which would mean they are defined to cover each other.

What would need to be compared is

    X(curr) + currCharWitdh <-similar-> to X(next)

with the problem where to get currCharWitdh from (and to add some 'unsharpness', see same file line 586, the multiply with 0.75).

This can be done when the Font is already available, by setting that font at an temp OutDev, setting the FontWidth(1200) and getting the TextLength of a String with that single character.

Maybe I overlook something (nobody is perfect), but I would have to debug & see if that works with latin text - and why. From looking at the current code I am not sure this ever did something...?

Comment 50 Alex Cohn 2021-07-15 19:38:02 UTC

> X(curr) + currCharWitdh <-similar-> to X(next)

only note that for an RTL sequence, this will probably be 

  X(curr) - currCharWitdh <-similar-> to X(next)

Comment 51 V Stuart Foote 2021-07-15 20:43:11 UTC

(In reply to Alex Cohn from comment #50)
> > X(curr) + currCharWitdh <-similar-> to X(next)
> 
> only note that for an RTL sequence, this will probably be 
> 
>   X(curr) - currCharWitdh <-similar-> to X(next)

Unfortunately that would not work, for RTL to render correctly they have to be laid down in the XML in the reverse order from the lexical text run. 

That is the glitch, as the code that detected the RTL for each text run-- bounded by ICU iterators--is broken, and the XML is fed the glyphs in the wrong reversed order. Text runs get encoded into the Draw canvas in the reversed sequence. And, there seems to be issues with the correct ICU word and sentence iterators.

Comment 52 Kevin Suo 2021-07-16 00:50:05 UTC

(In reply to V Stuart Foote from comment #51)
As I said, if all the following evaluate true:
'''
                if( (pCur->FontId == pNext->FontId || isSpaces(pNext)) &&
                    rCurGC.FillColor.Red == rNextGC.FillColor.Red &&
                    rCurGC.FillColor.Green == rNextGC.FillColor.Green &&
                    rCurGC.FillColor.Blue == rNextGC.FillColor.Blue &&
                    rCurGC.FillColor.Alpha == rNextGC.FillColor.Alpha &&
                    (rCurGC.Transformation == rNextGC.Transformation || notTransformed(rNextGC))
                    )

then the chharacters will be combined into strings and then the mirror code would be triggered which will reverse the reversed char sequence, thus produce correct ones in the XML. 

So the issue now is how to modify the last condition "(rCurGC.Transformation == rNextGC.Transformation || notTransformed(rNextGC))" to correctly detect the current char is of the same transformation thus they should be combined into string.

Comment 53 Kevin Suo 2021-07-16 04:59:07 UTC

Created attachment 173616 [details]
tdf104597-not-perfect-patch.diff

Good news is that we are almost there.

With the attached modification to the code, the RTL text are now shown in correct order.

However, the modification has deleted the "(rCurGC.Transformation == rNextGC.Transformation || notTransformed(rNextGC))" condition directly. This may be wrong. Need to find a special case when some text within a line has transformation even though the characters in this line all have the same font id and the same fill color. This may be very rare. I tested many PDFs with this patch applied on current master, all of them are displaying perfect.

As a result, the attached may be committed to master in my opinion, because this makes Draw to be usable for RTL pdf files. In case someone find a special case when the removable of transformation may have caused a regression, then we investigate that matter separately.

Comment 54 Kevin Suo 2021-07-16 05:00:31 UTC

Created attachment 173617 [details]
screenshot showing the PDF open in draw, with the "not perfect" patch applied

Comment 55 Kevin Suo 2021-07-16 05:01:50 UTC

Created attachment 173618 [details]
rtl.pdf (a simplified RTL file which can be used as a unittest case)

Comment 56 Armin Le Grand (collabora) 2021-07-16 08:08:19 UTC

To comment 53: I would not recommend to remove the transformation comparison. I have made bad experiences with guesses like 'will nearly never happen'. That often is true, but when you multiply that probablility with LO being used by hundreds of millions -> it WILL happen. Just my 2ct...

To the problem: This is an opportunity to do it correct and rotate/shear- independent. Use linear algebra as it is intended...

We need to compare if the start point of the next charbox (C) (lower left) is touching the vector starting from the start point of the current charbox (A) to the lower right of that box (B).
To get (A) we can multiply Point(0.0,0.0) with current matrix
To get (B) we can multiply Point(1.0,0.0) with current matrix
To get (C) we can multiply Point(0.0,0.0) with next matrix

Now we need to check if (C) touches Vector (A,B) and if it does e.g. inside the right third of it (need a hand-crafted unsharp-value here due to the mentioned special Scale-role of the FontScale given in the matrices, usually a decent 'guessed' CharWidth is 0.7 of CharHeight)

You can use basegfx::B2DVector/B2DPoint for all that. You may compare length of vector (A,B) with length of vector (A,C). For parallelism you may use dot product, or for using the tooling, e.g. use the method that calculates the angle between these two vectors.

That test should go to a subroutine, will work rotate/shear independent and even when no transformation is there (due to then having the unit matrix).

Comment 57 Armin Le Grand (collabora) 2021-07-16 14:15:04 UTC

Ah - sorry, one note to that: Of course since internally the Y-Axis goes down, to get the lower-left edge you would need to multiply Point(0,1) with the matrices - but in the given example working with the Vectors at the top of the text rectangles would work just the same.
And if that was not clear - this is the clue with matrices - you can transform ANY coordinate pair in unit coordinates [0.0 .. 1.0] by multiplication with the matrix to it's 'target' coordinates, this 'magically' applies scale,shear,rotate and translate. So - e.g. to get the center of that object, use Point(0.5, 0.5), multiply -> voila

Comment 58 Kevin Suo 2021-07-17 08:50:44 UTC

(In reply to Armin Le Grand from comment #57)
Could you prepare the revised patch?

Comment 59 Armin Le Grand (collabora) 2021-07-17 13:19:35 UTC

Created attachment 173646 [details]
Suggested solution

Okay, will do - checked for compile & run, also loaded the reduced example. Please check/experiment/adapt as needed, I can not say if the result is correct, sorry, my language knowledge is limited :-)

Comment 60 Kevin Suo 2021-07-18 02:35:28 UTC

(In reply to Armin Le Grand from comment #59)
Thank you for your suggested patch. I checked, but it does not work for some chars.

For instance:

On The current master there is a test file named "testFontFeatures.pdf" in sdext/source/pdfimport/test/testdocs. This file is used in the CppunitTest_sdext_pdfimport (test case file is in sdext/source/pdfimport/test/tests.cxx, the test name is testFontFeatures).

If you run 'make CppunitTest_sdext_pdfimport', you can see the output in workdir/CppunitTest/sdext_pdfimport.test.log.

From the log, you can see in line 384 that, for the line "Times New Roman Normal" as shown in the PDF, the chars "Tim" are concatenated, but "e" and "s" (as in "Times" are not; the chars "Ne" (as in "New) are concatenated, but "w" is not.
(please note that in the log file the concatenated "Tim" are shown in three different lines because there is another problem in the code which added new-line characters in the output, but this does not have impact on this RTL bug, thus can be addressed separately)

Also, from the log file you can also see from line 691 that all those Chinese characters are not concatenated.

Could you please take a look and revise your patch?

Comment 61 Kevin Suo 2021-07-18 04:48:52 UTC

The failing to concatenate "Tim" and "e" may due to the issue as discussed here:
https://gitlab.freedesktop.org/poppler/poppler/-/issues/418

Comment 62 Armin Le Grand (collabora) 2021-07-18 10:21:08 UTC

Thanks for the clear description, made it very effective to start debugging.
I guess I found it -> I implied that rCurGC.Transformation gets *adapted* when conactenation happens in the true case, but it gets not. It always seems to stay in it's initial form, describing just the 1st character.
If this adaption is/was not needed for further processing it's pure luck, but wrong. Thinking about how to effectively adapt rCurGC.Transformation in case of concatenation...

Comment 63 Armin Le Grand (collabora) 2021-07-18 11:12:53 UTC

Adapting rCurGC.Transformation as needed streches texts in result -> so not intended (?), width scale *has* to stay relative to fontScaling. So need to internally add a correction value used to adapt to already concatenated distances while concatenation continues. Trying that...

Comment 64 Armin Le Grand (collabora) 2021-07-18 11:59:27 UTC

Created attachment 173657 [details]
revvised solution with adapting width during concatenation

Have to hand in a value that describes the already used expansion of the current char snippet, as explained could not change/adapt the involved transformation. Also adapted the unsharp decision/compare value to 0.77 now - saw cases where 0.75 was at the edge. It may even be that this needs adaption e.g. vertical decisions for chinese...? I do not know. Also possible that spaces may need extra caveat due to having eventually different FontSize?

Comment 65 V Stuart Foote 2021-07-18 14:09:31 UTC

@Armin, Kevin, *

(In reply to Armin Le Grand from comment #64)
> ... It may even be that this needs
> adaption e.g. vertical decisions for chinese...? I do not know. Also
> possible that spaces may need extra caveat due to having eventually
> different FontSize?

This is great work!  But ultimately doesn't this require more finesse than simply testing presence of a "gap" to end concatenation and signify the next span to be passed out to mirror? And guess the off baseline positioning of comment 61 might need to be handled.

When concatenating with color, font or transformation check as now, what happens at pucntuations, or kashida internal to the imported text run?  Or brackets/parenthesis that should bound the span--does the opening and closing match the script?  

Restoring the transformation tests in drawtreevisiting.cxx will close this issue. But would think testing against full range of script appropriate ICU word break iterators could be the trigger to end of the text run concatenation being passed to the mirror string action. 

Also maybe test for sentence break iterators?  Beside acting as a word bound, are they kept with the span being mirrored? And do they end up placed appropriately for the RTL scripts?

Maybe including logic to test enclosing parenthesis or bracketing--to get the beginning and ending in the correct position.

The ICU BiDi libs were not very robust when the import filter was laid down in OOo era for i90800 (see also).

Also, don't similar things need to be done for RTL the Writer PDF import filter (writertreevisiting.cxx/.hxx)?

Comment 66 Kevin Suo 2021-07-18 15:01:59 UTC

I am testing locally.But V Stuart Foote could you upload some a simple doc covering most of the cases you have mentioned?

I have already tweaked to combine the space chars (which will result in the using of the same font as its previous non-space char), and I also managed to remove the line-break (\n) char. The concatenating of Chinese chars is still an issue, but I am a Chinese, so I am thinking about this but I need some time to understand the staff suggested by Armin Le Grand first.

Comment 67 Armin Le Grand (collabora) 2021-07-18 15:18:19 UTC

@Kevin, Stuart: It's not really a gap - a pic would say more than 1000 words, but...
- if next char starts at the same pos (test before) it covers current -> bad
- it's about 'guessing' the width of curr char
- if curr char is tall -> 20% start value may be good
- if curr is wide -> 80% may be good
- so, together, when between that, it may be a good guess
With my current approach we check for 23% to 100% and horizontally which should be good. Extending to check for less than 80% would be easy, too.

Much better but more complicated would be to use the width of the current char -> it's not in the matrix, so would need an OutDev, the font, setting it and asking for width of that char, then using that width to create better - but still unsharp - bounds to test against. Unsharp due to we do not know if e.g. using block text layout how big the gaps btw single chars which should be concatenated really are :-(

Comment 68 V Stuart Foote 2021-07-18 17:17:01 UTC

(In reply to Kevin Suo from comment #66)
> I am testing locally.But V Stuart Foote could you upload some a simple doc
> covering most of the cases you have mentioned?
> 

The 1945 UN Charter, Chapter IV, Article 13 provides a multi-lingual testing paragraph ususally with some bracketing.  Can find it in PDF form for many languages, or as web content that can be printed to PDF.

=-=-=

https://www.un.org/en/about-us/un-charter/full-text

https://unic.un.org/aroundworld/unics/en/whatWeDo/productsAndServices/publications/index.asp?callPage=products&category=1

Comment 69 Kevin Suo 2021-07-18 17:25:06 UTC

(In reply to Armin Le Grand from comment #67)

No, the concatenation will only be applied to TextElements, and only apply if the following conditions are met:

1. The current char and next char use the same font, or the next char is a space char. Same font means the same font family, *font size*, bold/italic/outline feature etc. AND
2. The current and next chars use the same font color. AND
3. The current char and next char has the same transformation, or the next char is a blank char.

See code below:

'''
    if( (isSpaces(pNext) || pCur->FontId == pNext->FontId) &&
        rCurGC.FillColor.Red == rNextGC.FillColor.Red &&
        rCurGC.FillColor.Green == rNextGC.FillColor.Green &&
        rCurGC.FillColor.Blue == rNextGC.FillColor.Blue &&
        rCurGC.FillColor.Alpha == rNextGC.FillColor.Alpha &&
        ( isSpaces(pNext) ||
          canConcatenate(rCurGC.Transformation, rNextGC.Transformation, fAlreadyExpandedDistance)
        )
'''

In this regard, there actually no need to detect if the height of the two chars are the same. There is also no need to detect if the next object is an image or something else. The only issue here is the transformation. The transformation of font in my opinion may be something like rotation.

Comment 70 V Stuart Foote 2021-10-22 13:15:55 UTC

*** Bug 145265 has been marked as a duplicate of this bug. ***

Comment 71 Kevin Suo 2021-10-27 06:00:01 UTC

*** Bug 145334 has been marked as a duplicate of this bug. ***

Comment 72 Eyal Rozenberg 2022-01-04 23:02:13 UTC

*** Bug 146524 has been marked as a duplicate of this bug. ***

Comment 73 Eyal Rozenberg 2022-02-18 15:35:04 UTC

So, any news about this? A discussion had started last year about a proposed patch, and has somehow died off.

Comment 74 V Stuart Foote 2022-06-07 02:00:51 UTC

*** Bug 149457 has been marked as a duplicate of this bug. ***

Comment 75 Eyal Rozenberg 2022-07-29 19:16:55 UTC

*** Bug 149516 has been marked as a duplicate of this bug. ***

Comment 76 Eyal Rozenberg 2022-07-29 19:17:21 UTC

(In reply to Eyal Rozenberg from comment #73)
> So, any news about this? A discussion had started last year about a proposed
> patch, and has somehow died off.

Ping.

Comment 77 V Stuart Foote 2022-07-29 21:13:53 UTC

(In reply to Eyal Rozenberg from comment #76)
> Ping.

.

@Thorsten, @Armin have you any cycles to revisit this. It is problematic that the PDF import filter mishandles the RTL text runs so badly. Haven't seen much of Kevin Suo of late hope he's OK.

Comment 78 Kevin Suo 2022-10-11 16:09:18 UTC

The proposed patch in:
https://gerrit.libreoffice.org/c/core/+/141231

should be a fix. Could someone review and test.

Comment 79 Kevin Suo 2022-10-12 03:09:51 UTC

Created attachment 182980 [details]
tdf104597_textrun.odt

I upload the odt file (used to generate the sdext/source/pdfimport/test/testdocs/tdf104597_textrun.pdf) here for the record, so that someone, when revising the unit test "testTdf104597_textrun", can regenerate the PDF file using this ODT.

Comment 80 Kevin Suo 2022-10-12 03:50:40 UTC

Created attachment 182981 [details]
output.xml

xml output of unit test testTdf104597_textrun.

Comment 81 Commit Notification 2022-10-13 19:38:50 UTC

Kevin Suo committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/69e9925ded584113e52f84ef0ed7c224079fa061

sdext.pdfimport: resolves tdf#104597: RTL script text runs are reversed

It will be available in 7.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 82 Kevin Suo 2022-10-13 21:46:22 UTC

This is now fixed on master branch. There is another issue in bug 151473.

Comment 83 Eyal Rozenberg 2022-10-13 22:15:26 UTC

(In reply to Kevin Suo from comment #78)
> https://gerrit.libreoffice.org/c/core/+/141231
> 
> should be a fix. Could someone review and test.

I will try this out when it makes a nightly. However - could you explain why this was not considered to be a fix so far? Thorsten seemed to suggest that this was more complicated than one might think because of various implications I did not have the time to ask him about.

Comment 84 V Stuart Foote 2022-10-14 16:33:42 UTC

IMHO this is fixed now. The import to Draw of Persian and Arabic sample PDFs match pretty closely their layout and rendering in Adobe Acrobat, MS Edge, or FireFox

There are occasional font fallback issues, and a lot of spots where combining diacritics get separated from their root glyph garbling the text.

But much better than it was.

Unfortunately "Consolidate text" (bug 118370) of the runs will need some additional work to keep the runs RTL, and otherwise remains a challenge to select the draw text objects for the runs--but that is more bug 32249 related. 

The bug 151473 issues with bracketing punctuation is present, guess issues in the poppler libs could also be the reason for the orphaned combining diacritics.

Thanks Kevin!

=-testing-=
2022-10-14 nightly
Version: 7.5.0.0.alpha0+ (x64) / LibreOffice Community
Build ID: 8991cbb7986d3967bc6c3719d95254ff04428d1a
CPU threads: 8; OS: Windows 10.0 Build 19044; UI render: Skia/Vulkan; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded

Comment 85 Kevin Suo 2022-10-14 16:59:29 UTC

(In reply to V Stuart Foote from comment #84)
OK, good to hear that. Maybe more people can test to see if there are regressions caused by this.

Comment 86 Eyal Rozenberg 2022-10-15 08:54:30 UTC

I'm sorry, but this is not fully fixed - at least with the latest nightly, from 2022-10-14 09:02.

While the attachment 129523 [details] (PDF file) does not exhibit reversed text runs, I get text runs reversed with a simple PDF file created in LO. Instructions:

1. Start Writer
2. Switch paragraph direction to RTL
3. Enter the phrase "שלום"
4. Export to PDF (using File | Export As | Export as PDF)
5. Open the exported PDF file using the PDF import filter targeting Writer

... and you get the reversed "םולש".

Note that if you open the file without specifying an import filter, it opens in LO Draw, and the text run is not reversed. So it may just be the case that the patch doesn't cover the relevant code paths.

Build ID:

Version: 7.5.0.0.alpha0+ / LibreOffice Community
Build ID: a09c5c69e3b5fbf448cae1d6c476f39067e40023
CPU threads: 4; OS: Linux 5.19; UI render: default; VCL: gtk3
Locale: en-IL (en_IL); UI: en-US

Comment 87 Kevin Suo 2022-10-15 10:21:43 UTC

Eyal Rozenberg, you are right. There are two pdf imprort filters in sdext, one is Draw, and another is Writer. My patch was targeted in Dwaw only, the fix was in code https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?r=69e9925d. The Writer related code is in https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/writertreevisiting.cxx?r=3fe18ba1. While these two files are similar in many inspects, are are different somehow as it seems many changes for Draw side were not backported in the Writer side historically.

However I guess very fewer people use the Writer pdf import filter. Yes, this bug may stay open, and I may take a look of the Writer side, but I can not guarantee. Anyone else are welcome to do this, the codes share the similar logic on both sides.

Comment 88 Eyal Rozenberg 2022-10-15 11:15:57 UTC Comment hidden (off-topic)

(In reply to Kevin Suo from comment #87)
> However I guess very fewer people use the Writer pdf import filter.

That's probably because it's problematic and Writer's performance on the document it produces is kind of dismal. When I open the "PDF file" attachment in Writer - only 7 pages - it slows to a crawl with scroll delay times of ~ 10 seconds or higher. Is there a separate bug about that?


Also, the Writer PDF import filter is hidden away; and when you try to open a PDF from Writer it defaults to opening in Draw for some reason; I'm sure I opened a bug about this at some point.

If those two issues were resolved, I believe opening a PDF in Writer would become a common use case as well.

Regardless of all this - I'm not capable of helping with the code, but everyone is encouraged to go onto the LO RTL channel and ask for help with verification.

Comment 89 V Stuart Foote 2022-10-15 17:05:32 UTC Comment hidden (off-topic)

(In reply to Eyal Rozenberg from comment #88)
> 
> That's probably because it's problematic and Writer's performance on the
> document it produces is kind of dismal. When I open the "PDF file"
> attachment in Writer - only 7 pages - it slows to a crawl with scroll delay
> times of ~ 10 seconds or higher. Is there a separate bug about that?
> 
> 
> Also, the Writer PDF import filter is hidden away; and when you try to open
> a PDF from Writer it defaults to opening in Draw for some reason; I'm sure I
> opened a bug about this at some point.
> 

Wrong tool for the job!  The PDF Import filter was designed to extract all the presentation elements of a PDF as *Draw shape objects*--onto the Draw canvas. Text runs recorded into the PDF (as either char sequences from the PDFs ToUnicode chart, or as PDF /ActualText tags) are imported to draw textboxes. The reconstructed textboxes sometimes contain just one glyph with *NO* lexical context--just the sequence of characters as published into the PDF.

You should not expect directly "edit" the PDF text runs as laid down as draw Shape textboxes as if they were text strings. They are a facsimile of the original PDF publishing--not intended to be editable.

And of course using the alternate, and neglected, PDF Import to Writer filter and placing thousand of draw shapes onto a Writer document canvas, it is going to bog things down. Shapes including textboxes on Writer canvas are expensive and nonperformant.


> If those two issues were resolved, I believe opening a PDF in Writer would
> become a common use case as well

To what end? The result of PDF import is not editable text! Even in Draw where using the rudimentary "Consolidate text" utility (.uno:TextCombine, for bug 118370) takes the lexically broken text runs of multiple textboxes and merges them, they are still in a textbox--no paragraph controls, minimal formatting, etc.  To understand what you get to work with (Draw, Impress or Writer), open the SB Navigator deck. Every shape object would need to be 

As an alternative, perhaps explore the very functional PDF "Insert as image" filter. Completely different source based on pdfium rather than poppler but very high fidelity to the original PDF, with its own set of enhancements, e.g. bug 114234 bug 115811 and need to manually set the 'PDFIMPORT_RESOLUTION_DPI' for more than the 96dpi default. 

Going to close this Resolved Fixed again. PDF Filter import for Draw and Impress are addressed, opening a new issue for the refactoring of the Writer import filter 'PDF - Portable Document Format (Writer) (*.pdf)'

Comment 90 Kevin Suo 2022-10-15 17:19:54 UTC

> Open the exported PDF file using the PDF import filter targeting Writer
I submitted another patch related to the Writer part, please review and test:
https://gerrit.libreoffice.org/c/core/+/141420

Comment 91 V Stuart Foote 2022-10-15 17:25:49 UTC

Additional work needed for RTL scripts using the poppler based PDF import filter for Writer is open as bug 151546

Comment 92 Eyal Rozenberg 2022-10-15 18:37:08 UTC

(In reply to V Stuart Foote from comment #89)
> Wrong tool for the job!

Right too for the job.

>  The PDF Import filter was designed to extract all
> the presentation elements of a PDF as *Draw shape objects*--onto the Draw
> canvas. 

There are different import filters for Draw and for Writer. The project - correctly-  decided to support opening PDFs in Writer - which means editing the opened PDFs in Writer, as Writer documents.

If you believe that should be dropped, please open a separate bug and we can argue about it there.

> You should not expect directly "edit" the PDF text runs as laid down as draw
> Shape textboxes as if they were text strings. They are a facsimile of the
> original PDF publishing--not intended to be editable.

I should, and I do. But regardless - I expect the PDF import filters not to reverse text runs, which is what this bug is about.

> And of course using the ... neglected,
> PDF Import to Writer
> filter and placing thousand of draw shapes onto a Writer document canvas, it
> is going to bog things down.

The neglect is indeed a separate issue, as is the excessive use of draw shapes.

> To what end? The result of PDF import is not editable text! 

Completely disagree, but again - you're sneaking in an argument here about wanting to drop the Writer PDF import filter. That's inappropriate.


> As an alternative, perhaps explore the very functional PDF "Insert as image"
> filter.

Can we find that in the list of import filters right now?
 
> Going to close this Resolved Fixed again.

No, it's not fixed: RTL text runs are still reversed on PDF import. You can't reduce the scope of this bug retroactively because you believe one of the PDF import filters is not important.

Comment 93 V Stuart Foote 2022-10-15 19:07:35 UTC Comment hidden (off-topic)

(In reply to Eyal Rozenberg from comment #92)
> 
> No, it's not fixed: RTL text runs are still reversed on PDF import. You
> can't reduce the scope of this bug retroactively because you believe one of
> the PDF import filters is not important.

Don't believe I said that. And, as a routine QA matter we edit the BZ issues to keep them crisp.

This TLDR; bug is now 92 comments long and the original issue(s) (see dupes) with the DRAW import filter is fixed.

At comment 86 you highjacked the issue expanding scope to Writer. We knew the writer import filter needed the same rework, but this bug was already unwieldy and there will probably be substantive issues other than RTL handling (masking, font handling, transforms) for the Writer import filter.

Accordingly opened a new clean issue for import filter work on the residual WRITER issues, including RTL, and that needs a clean and concise BZ issue to track.

CLOSING THIS AGAIN  => FIXED

Please don't reopen.

@Xisco?

Comment 94 Eyal Rozenberg 2022-10-15 19:41:08 UTC

I see the dupes just fine. I will remind you that I'm one of the most active member of the RTL languages discussion channel over the past several years. And in this capacity I can tell you the reason Draw is mentioned in the dupes is not because people wanted to separate the bug for Draw and for Writer, but because of bug 141732, i.e. the fact that opening a PDF file from Writer (or from Impress for that matter) opens it in Draw.

Nobody ever suggested to separate these two bugs and make this one Draw-only. The title does not limit it to draw; and the opening comment doesn't limit it to Draw. It is only with the potential prize of closing this long-standing and widely-followed bug that this suggestion was first made.

I have no issue with tracking the Word part in a blocker bug, but RTL script text runs are still reversed on PDF import. When that stops happening, then the bug can be marked as fixed.

Comment 95 Commit Notification 2022-11-01 10:18:45 UTC

Kevin Suo committed a patch related to this issue.
It has been pushed to "libreoffice-7-4":

https://git.libreoffice.org/core/commit/bcb4fef4e321c6c0f1f9edd05b2f18ddb34792ed

sdext.pdfimport: resolves tdf#104597: RTL script text runs are reversed

It will be available in 7.4.3.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 96 Commit Notification 2022-11-18 18:51:05 UTC

Stephan Bergmann committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/50d73574b6c3d71f9a539c895a15d6fcda22390b

Related tdf#104597, tdf#151546: Introduce comphelper::string::reverseCodePoints

It will be available in 7.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 97 V Stuart Foote 2022-11-20 19:27:53 UTC

*** Bug 151950 has been marked as a duplicate of this bug. ***

Comment 98 V Stuart Foote 2022-11-20 19:32:34 UTC

*** Bug 149457 has been marked as a duplicate of this bug. ***

Comment 99 Commit Notification 2022-11-25 16:49:11 UTC

Stephan Bergmann committed a patch related to this issue.
It has been pushed to "libreoffice-7-4":

https://git.libreoffice.org/core/commit/f1db364f294d2d9a40d77004aeeb36729ae1c4ca

Related tdf#104597, tdf#151546: Introduce comphelper::string::reverseCodePoints

It will be available in 7.4.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 100 V Stuart Foote 2022-11-28 04:11:52 UTC

*** Bug 152258 has been marked as a duplicate of this bug. ***

Comment 101 Hanan Sela 2022-11-28 07:09:54 UTC

Created attachment 183843 [details]
Hebrew PDF file

Comment 102 Hanan Sela 2022-11-28 07:11:04 UTC

Created attachment 183844 [details]
screen shot of Hebrew pdf file

Comment 103 Kevin Suo 2022-11-28 07:38:40 UTC

(In reply to Hanan Sela from comment #102)
Could you explain what is the problem in this Hebrew pdf file when open with Draw? I tested and most of the text are shown correctly with current master build as of Sat Nov 26 00:27:23 2022 +0300. Also, when testing please make sure you have the correct fonts installed on your linux system (e.g. this pdf file uses Arial font in the text مكتب on the header section so you should at least have Arial font instaled).

Comment 104 Hanan Sela 2022-11-29 11:49:44 UTC

Th(In reply to Kevin Suo from comment #103)
> (In reply to Hanan Sela from comment #102)
> Could you explain what is the problem in this Hebrew pdf file when open with
> Draw? I tested and most of the text are shown correctly with current master
> build as of Sat Nov 26 00:27:23 2022 +0300. Also, when testing please make
> sure you have the correct fonts installed on your linux system (e.g. this
> pdf file uses Arial font in the text مكتب on the header section so you
> should at least have Arial font instaled).

The are two problems: 1) In lines where there  is mixed text of Hebrew and  Arabic numbers, the numbers and the  Hebrew text overlap. In some cases white space are missing. It is more obvious in  the file I uploaded on 28/11/22. Please see comparison screenshot.

Comment 105 Hanan Sela 2022-11-29 11:54:19 UTC

Created attachment 183883 [details]
screen shot of Hebrew pdf file  opened with evince (left) and draw  (right; last build 28/11/22))

Screen shot of Hebrew pdf file  opened with evince (left) and draw  (right; last build 28/11/22))

Comment 106 Kevin Suo 2022-11-30 08:07:49 UTC

(In reply to Hanan Sela from comment #105)

As I requested earlier:
"Also, when testing please make sure you have the correct fonts installed on your linux system (e.g. this pdf file uses Arial font in the text مكتب on the header section so you should at least have Arial font instaled)."

The screenshot you have provided indicates that fallback font rather than Arial is used.

Comment 107 Hanan Sela 2022-11-30 10:30:38 UTC

(In reply to Kevin Suo from comment #106)
> (In reply to Hanan Sela from comment #105)
> 
> As I requested earlier:
> "Also, when testing please make sure you have the correct fonts installed on
> your linux system (e.g. this pdf file uses Arial font in the text مكتب on
> the header section so you should at least have Arial font instaled)."
> 
> The screenshot you have provided indicates that fallback font rather than
> Arial is used.

I installed all MS fonts. Ariel is Installed. However, The Hebrew fonts does not seem to be Ariel but David,  but Draw selects some thing similar to Ariel called Liberation Serif.

Comment 108 Hanan Sela 2022-11-30 10:37:27 UTC

(In reply to Hanan Sela from comment #107)
> (In reply to Kevin Suo from comment #106)
> > (In reply to Hanan Sela from comment #105)
> > 
> > As I requested earlier:
> > "Also, when testing please make sure you have the correct fonts installed on
> > your linux system (e.g. this pdf file uses Arial font in the text مكتب on
> > the header section so you should at least have Arial font instaled)."
> > 
> > The screenshot you have provided indicates that fallback font rather than
> > Arial is used.
> 
> I installed all MS fonts. Ariel is Installed. However, The Hebrew fonts does
> not seem to be Ariel but David,  but Draw selects some thing similar to
> Ariel called Liberation Serif.

The same document seems OK in Windows installation.

Comment 109 Kevin Suo 2022-11-30 12:12:36 UTC

(In reply to Hanan Sela from comment #108)
So I guess the doc will look ok when you have the David font installed on your linux system.

Comment 110 Commit Notification 2022-11-30 12:41:30 UTC

Kevin Suo committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/e788317bb58f4e4ad3bf57ee7b0e995cb4407ede

tdf#104597 related: restore the mirroring of Bidi_Mirrored characters

It will be available in 7.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 111 Commit Notification 2022-11-30 21:53:36 UTC

Kevin Suo committed a patch related to this issue.
It has been pushed to "libreoffice-7-4":

https://git.libreoffice.org/core/commit/0a19375b73b12885f9022d82cb51e9c268cc0d6a

tdf#104597 related: restore the mirroring of Bidi_Mirrored characters

It will be available in 7.4.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 112 Khaled Hosny 2022-12-17 16:14:49 UTC

*** Bug 131022 has been marked as a duplicate of this bug. ***

Comment 113 Khaled Hosny 2022-12-25 19:39:19 UTC

*** Bug 152670 has been marked as a duplicate of this bug. ***

Armin.Le.Grand
aron.budea
ATIV-Q
bar.hofesh
Baruch.Bernard
eyalroz1
freelikeabird2510
hans
helpmepro1
hossein
ilan.tal
ilmari.lauhakangas
jluth
kelemeng
knimer
marklh9
nasrinkhaksar3
omarbahloul2005
quikee
raanraan
rainplugin
saad.shamsaee
sasha.cohn
sergey737
sergstesh
shivisi
suokunlong
techyx1210
thb
vgtvgy1
vmiklos
vsfoote
vvort
xiscofauli
yousifjkadom