Bug 151473 - Draw PDF import: ASCII brackets (e.g. (), [] etc) are reversed when the paragraph is RTL
Summary: Draw PDF import: ASCII brackets (e.g. (), [] etc) are reversed when the parag...
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Draw (show other bugs)
Version:
(earliest affected)
7.3.6.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: RTL-CTL PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2022-10-12 03:21 UTC by Kevin Suo
Modified: 2022-12-25 00:37 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Suo 2022-10-12 03:21:41 UTC
Steps to Reproduce:

1. Export attachment 182980 [details] to PDF
(which is identical to the unit test file in sdext/source/pdfimport/test/testdocs/tdf104597_textrun.pdf as added in https://gerrit.libreoffice.org/c/core/+/141231)

2. Open the PDF in Draw.

Current Result:
The 2nd paragraph shows as:
")ABC) DEF ]..."

Expected Result:
"(ABC) DEF [..."

Further information:
When open the pdf with pdf reader, if you copy paste the text the brackets would also be reversed. But anyway it shows fine on the UI of the PDF reader.

Reproduced in 7.6.2, also in 7.4.1.2, also on master with the change in https://gerrit.libreoffice.org/c/core/+/141231 (which is for the fix of bug 104597). As a result, this is not a regression caused by that change.
Comment 1 V Stuart Foote 2022-10-14 01:05:44 UTC
But isn't this to be expected. We're using ICU Libs BiDi for detection of RTL to trigger collection of the run.  [1]

But then use no other logic to parse the weakly typed Open punctuation and Close punctuation for the rest of the run to keep it ordered when other scripts are encountered.

In the example looks like we're finishing a run of RTL with closing bracket and the entering the western text, and finally entering a closing parenthesis when passing back into RTL.

IIRC similar issues Open & Close punctuation in CJK runs passing between runs of Chinese and parenthetically embedded English. 

=-ref-=
[1] https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?r=69e9925d#108
Comment 2 Kevin Suo 2022-10-14 02:20:12 UTC
Copy pasting my comment in https://gerrit.libreoffice.org/c/core/+/141231:

'''
The issue in tdf151473 is another issue, that was in the xpdfimport part (i.e. the GPL code in sdext/source/pdfimport/xpdfwrapper, which uses the third-party library "poppler" to process the pdf and yield xpdf tokens, then the tokens are passed to the sdext/source/pdfimport/wrapper and then further to sdext/source/pdfimport/tree to generate an ODF XML tree before it appears in Draw). That bug was not caused by the reverseString code in drawtreevisiting.cxx. This can be observed when you run:

$ ./instdir/program/xpdfimport sdext/source/pdfimport/test/testdocs/tdf104597_textrun.pdf -f /dev/null

and you get:

drawChar 145.050000 709.189000 163.822000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 A
drawChar 163.822000 709.189000 181.138000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 B
drawChar 181.112000 709.189000 198.428000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 C
drawChar 198.428000 709.189000 207.086000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 )
drawChar 207.086000 709.189000 213.586000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000  
drawChar 213.586000 709.189000 232.358000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 D
drawChar 232.358000 709.189000 248.218000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 E
drawChar 248.244000 709.189000 262.700000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 F
drawChar 136.100000 709.189000 145.044000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 )

when sorted by the 2nd column (which is the horizontal position of the character), we get:
drawChar 136.100000 709.189000 145.044000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 )
drawChar 145.050000 709.189000 163.822000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 A
drawChar 163.822000 709.189000 181.138000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 B
drawChar 181.112000 709.189000 198.428000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 C
drawChar 198.428000 709.189000 207.086000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 )
drawChar 207.086000 709.189000 213.586000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000  
drawChar 213.586000 709.189000 232.358000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 D
drawChar 232.358000 709.189000 248.218000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 E
drawChar 248.244000 709.189000 262.700000 709.189000 1.000000 0.000000 0.000000 1.000000 26.000000 F

which when combined, is: ")ABC) DEF", while we expect it to be "(ABC) DEF". In other words, the tokens returned by xpdfimport is already wrong. It is also wrong even if a paragraph is set to RTL direction with pure ASCII characters with brackets. This may due to a bug in poppler, or be a bug in the sdext/source/pdfimport/xpdfwrapper wrapper codes. Thus this should be fixed separately.
'''

> We're using ICU Libs BiDi for detection of RTL to trigger collection of the run.
As far as I know, we are not using ICU's BiDi in sdext/pdfimport.

> But isn't this to be expected.
If we have "(ABC) DEF" shown in the PDF, then we would expect "(ABC) DEF" if open it in Draw.
Comment 3 Kevin Suo 2022-10-15 12:39:45 UTC
I confirm that this is an upstream bug in the poppler code.

Simple way to verify this is, on linux:
$ pdftotext sdext/source/pdfimport/test/testdocs/tdf104597_textrun.pdf /tmp/output.txt
$ cat /tmp/output.txt
ٱلَّس َل اُم َع َلْي َك
)ABC) DEF ]ل
ل وسه ا
ٱلَّس َل اُم َع َلْي َك [أه ا
LibreOffice RTL
LibreOffice LTR (test)
中文测试,中文

pdftotext is a utility provided by poppler and is shipped by most linux distributions (within the poppler-utils package).
Comment 4 Kevin Suo 2022-10-15 13:27:15 UTC
Upstream bug reported at:
https://gitlab.freedesktop.org/poppler/poppler/-/issues/1301
Comment 5 Kevin Suo 2022-11-18 04:04:43 UTC
There seems to already be a fix on the Apache OpenOffice, see the m_aMirrorMap in PDFIProcessor::prepareMirrorMap():
http://opengrok.openoffice.org/xref/trunk/main/sdext/source/pdfimport/tree/pdfiprocessor.cxx?r=c142477c

Not sure whether it is OK to "borrow" some of its code. Is there any license issue? At what extent shall we use it?
Comment 6 ⁨خالد حسني⁩ 2022-12-25 00:30:43 UTC
I don’t think there is a bug here. The Latin-script text in the ODT is:

ABC) DEF)

It looks differently because it is part of a RTL paragraph. When imported, parts of the text ends in different text boxes because they use different fonts, but if you select all the boxes in the line and right click -> Consolidate Text, you get the same rendering as in Writer since they are now in one text box.
Comment 7 ⁨خالد حسني⁩ 2022-12-25 00:34:49 UTC
(In reply to Kevin Suo from comment #5)
> There seems to already be a fix on the Apache OpenOffice, see the
> m_aMirrorMap in PDFIProcessor::prepareMirrorMap():
> http://opengrok.openoffice.org/xref/trunk/main/sdext/source/pdfimport/tree/
> pdfiprocessor.cxx?r=c142477c
> 
> Not sure whether it is OK to "borrow" some of its code. Is there any license
> issue? At what extent shall we use it?

That is already what u_charMirror() does.