163082 – Difference vis-a-vis MSO in setting direction of neutrals following strong RTL char in LTR paragraph

Bug 163082 - Difference vis-a-vis MSO in setting direction of neutrals following strong RTL char in LTR paragraph

Summary: Difference vis-a-vis MSO in setting direction of neutrals following strong RT...

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	LibreOffice (show other bugs)
Version: (earliest affected)	unspecified
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	text:rtl

Depends on:
Blocks:	DOCX-RTL RTL Script-Assignment
	Show dependency tree / graph

Reported:	2024-09-21 14:35 UTC by Mike Kaganski
Modified:	2025-04-02 20:09 UTC (History)
CC List:	4 users (show)

See Also:	148257 119143
Crash report or crash signature:

Attachments
A DOCX with "e-=5" text (12.17 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document) 2024-09-21 14:35 UTC, Mike Kaganski	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mike Kaganski 2024-09-21 14:35:47 UTC

Created attachment 196590 [details]
A DOCX with "e-=5" text

The attachment has a single paragraph with the text "e- = 5". At least that's what Word shows, that created the file. In Writer, this shows as "e5 = -", because the "-" is U+05BE "HEBREW PUNCTUATION MAQAF", and Writer treats all "neutral" characters after the RTL characters as RTL.

In the document markup, it (alone) is marked as Hebrew language, with the two other runs ("e" and " = 5") have default document language (here: en-US, i.e., LTR). Writer ignores that.

The same problem exists in other modules, e.g. Impress. But let's mark this as "Writer", and potentially handle similar problems in other modules separately.

Comment 1 Eyal Rozenberg 2024-09-21 15:05:33 UTC Comment hidden (noise)

How

Comment 2 Eyal Rozenberg 2024-09-21 15:15:06 UTC

Here's what the paragraph looks like in word/document.xml:

		
<w:p w14:paraId="11A29BB0" w14:textId="1FC7FE17" w:rsidR="00BA5193" w:rsidRDefault="00BA5193" w:rsidP="00BA5193">
	<w:r w:rsidRPr="00BA5193">
		<w:t>e</w:t>
	</w:r>
	<w:r w:rsidRPr="00BA5193">
		<w:rPr>
			<w:rtl/>
			<w:lang w:bidi="he-IL"/>
		</w:rPr>
		<w:t>־</w:t>
	</w:r>
	<w:r w:rsidRPr="00BA5193">
		<w:t xml:space="preserve"> = 5</w:t>
	</w:r>
</w:p>

I don't speak OOXML'ese, but assuming the w:rPr tag is only supposed to apply within its enclosing w:r, then indeed we should see 

e,MAQAF,space,=,space,5

in Word and in LibreOffice. But Mike says, on IRC, that if he clears formatting in MS Word, the placement changes. So, weird.

Comment 3 Eyal Rozenberg 2024-09-21 15:50:45 UTC

> in Word and in LibreOffice. But Mike says, on IRC, that if he clears
> formatting in MS Word, the placement changes.

I misspoke, that doesn't actually happen.

Anyway, forget about the DOCX. To "reproduce" the issue, do the following in LO and in MS Word:

1. Create a new document
2. Make your paragraph LTR
3. Be in an English keyboard layout
4. Type a
5. Be in a Hebrew keyboard layout and type ALEPH, or alternatively paste ALEPH  (but don't insert it using the SCD)
6. Be in an English keyboard layout
7. Type: equals, 5

In LO, the direction of ALEPH,equals,5 will flip when you type the 5.
In MS Word - it will not, provided that you were using an English keyboard layout.

I believe this is due to MS Word infusing the characters with a language - and thus not treating them as purely neutral.

You can verify that by selecting individual characters after pasting each additional character. In LO, the equals, for example, will start out as English, but will then become Hebrew, supposedly, after typing the 5. In MS Word, the equals remains English - Word remembers that when you typed it, you were typing English. 

Mike notes that if you're in MS Notepad - you get the same behavior as in LO, because Notepad is a plain text editor, and does not save the language a character is supposed to be in, just the characters themselves.

Comment 4 V Stuart Foote 2024-09-21 16:24:31 UTC

Huh, no idea what secret sauce MS might provide for "...infusing the characters with language".

While we clearly implement Unicode bidi support, testing individual characters with ICU lib as "Strong", "Weak", "Neutral" or, "Explicit" and building against ICU lib tested word bounds.

Failure in our bidi implementation lies within parsing the logical order text spans composed of some mix of the four categories.

In Mike's example the "U+05BE HEBREW PUNCTUATION MAQAF" is strongly typed. So our handling should treat the whole expression to its bounds as a Hebrew language text run.

We'd have to recognize it and probably apply Explicit bidi bracketing (i.e. LRE/RLE or LRO/RLO) with or without termination (PDF). In any case requiring additional i18n logic in the editengine(s).

Khaled had done some work on the sm Formula editor to implement Arabic formula entry, so imagine some of that work could be applicable to broader bidi handling.

=-refs-=
https://en.wikipedia.org/wiki/Bidirectional_text

https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/Bidi.html

Comment 5 Eyal Rozenberg 2024-09-21 20:22:39 UTC

(In reply to V Stuart Foote from comment #4)
> In Mike's example the "U+05BE HEBREW PUNCTUATION MAQAF" is strongly typed.
> So our handling should treat the whole expression to its bounds as a Hebrew
> language text run.

Are you 100% sure that implication is mandated by the UBA? It's not what MS Notepad does. It's not what GTK text widgets do. It's not what Kate does (so, Qt widgets).

> We'd have to recognize it and probably apply Explicit bidi bracketing (i.e.
> LRE/RLE or LRO/RLO) with or without termination (PDF). In any case requiring
> additional i18n logic in the editengine(s).

Not 100% sure what "it" is. And not following the logic for applying bidi embed or override characters... do you mean you believe that's how MS Word is implementing its behavior?

Anyway, the way I see it, once 148527 is implemented (whenever that is), we'll need to both:

1. Reconsider which chars have what kind of directionality, given that we know their language.
2. Feed the UBA characters with amended directionality (which may mean a modified UBA implementation, or a preprocessing introducing marks, although that doesn't sound fun).
3. Make sure we extract the language info from the OOXML and apply it.

Comment 6 Mike Kaganski 2024-09-22 10:53:57 UTC

(In reply to V Stuart Foote from comment #4)
> We'd have to recognize it and probably apply Explicit bidi bracketing (i.e.
> LRE/RLE or LRO/RLO) with or without termination (PDF). In any case requiring
> additional i18n logic in the editengine(s).

It is important, that OOXML markup gives the clear borders where this automatic direction change happens. As Eyal mentioned in comment 2, the markup in the document is

<w:r><w:t>LTR text</w:t></w:r>
<w:r><w:rPr><w:rtl/><w:lang w:bidi="he-IL"/></w:rPr><w:t>RTL text</w:t></w:r>
<w:r><w:t>LTR text</w:t></w:r>

And editing the XML, dropping either one of <w:rtl/> or <w:lang w:bidi="he-IL"/> (but keeping the other) keeps the look in Word as expected (e- = 5); while dropping both makes Word show the same thing as Writer and Notepad: e5 = -

So this may be seen as import filter problem - then we need to check if the next run changes direction explicitly, then put the respective marks (and then, likely, these marks become the document content, which we will export); or we could consider this as more fundamental lack of functionality, where we can't explicitly mark various text runs as having wanted language/direction, e.g. when the run consists of neutral characters (bug 148257). However, that would still require to fix this interoperability problem, if the new feature would take long; then the explicit direction change marks would still be a valid temporary fix.

Comment 7 Mike Kaganski 2024-09-22 11:00:17 UTC

And likely, we would need to use "isolate" marks (from Unicode 6.3): LRI, RLI, PDI - in this case - exactly because those characters are designed for the problem that we see here, to mark some run in such a way, that it has no effect on direction outside of the run.

Comment 8 Mike Kaganski 2024-09-22 11:12:22 UTC

Compare the behavior of these two strings:

U+0065U+05beU+0020U+003dU+0020U+0035

(the original problematic string), and

U+0065U+2067U+05beU+2069U+0020U+003dU+0020U+0035

(the one where U+05be is surrounded by U+2067/U+2069 pair, i.e. RLI/PDI): this latter one renders as expected.

Comment 9 V Stuart Foote 2024-09-22 12:49:23 UTC

Our UBA handling aside, the original document source was probably misformatted, the formula authored with the U+05be when it probably should have used the appropriate Unicode U+207B SUPERSCRIPT MINUS, rather than the U+05BE HEBREW PUNCTUATION MAQAF in this case.

Just saying...

Comment 10 Mike Kaganski 2024-09-22 12:57:40 UTC

(In reply to V Stuart Foote from comment #9)
> Just saying...

Yes, and this is irrelevant. The original document was even not a DOCX, it was a binary PPT; and it was a real-world presentation on a chemistry topic (structure of atom); my strong guess was that the teacher (who didn't know much about formatting and Unicode) used this sequence in PowerPoint: typed "e = 5", then put cursor after "e", then used Insert->Symbol, and in the dialog (similar to our Special Symbol one), looked for something that would look ~correctly, which happened to be that RTL symbol. No matter how "incorrect" this procedure might be: a child using LibreOffice was unable to see the slide as intended.

Comment 11 Eyal Rozenberg 2024-09-22 19:35:11 UTC

(In reply to Mike Kaganski from comment #7)
> And likely, we would need to use "isolate" marks (from Unicode 6.3): LRI,
> RLI, PDI - in this case - exactly because those characters are designed for
> the problem that we see here, to mark some run in such a way, that it has no
> effect on direction outside of the run.

But if you stick isolation marks in the imported document - that would make it hellish to edit... if the cursor "recognizes" them, then you have an huge amount of invisible characters you would be waddling through all the time; and if the cursor skips them, then the typical user, who does not assume they exist, which just be faced with a large number of "magical" RTL runs which behave different than "non-magical" RTL runs which the user types in. And they will always be stressed out about whether their edits or copy-pastes will preserve the "magic" or not.

(In reply to Mike Kaganski from comment #6)
> the markup in the document is
> 
> <w:r><w:t>LTR text</w:t></w:r>
> <w:r><w:rPr><w:rtl/><w:lang w:bidi="he-IL"/></w:rPr><w:t>RTL text</w:t></w:r>
> <w:r><w:t>LTR text</w:t></w:r>

but I assume that there is an implicit higher-level setting of LTR direction and English (or other similar) language, which should actually be considered together with the explicit 'differential' markup.

Comment 12 Mike Kaganski 2024-09-22 20:26:15 UTC

(In reply to Eyal Rozenberg from comment #11)
> But if you stick isolation marks in the imported document - that would make
> it hellish to edit... if the cursor "recognizes" them, then you have an huge
> amount of invisible characters you would be waddling through all the time;

It does recognize them.
And in the meanwhile, while we don't have a property for this, it's better than nothing.

Comment 13 Mike Kaganski 2024-09-22 21:06:29 UTC

https://gerrit.libreoffice.org/c/core/+/173782

Comment 14 Eyal Rozenberg 2024-09-22 21:21:15 UTC

(In reply to Mike Kaganski from comment #13)
> https://gerrit.libreoffice.org/c/core/+/173782

I am not so sure it's better than nothing. The downsides may outweigh the upsides. Remember that if users are at all willing to mess with special Unicode direction control chars, they might just as well use RLM and LRM to indicate what they want. I would not make that call myself without talking about it to more people.

Also consider what happens when you open the DOCX then save it back and send it to your friend who use MSO.

Comment 15 Mike Kaganski 2024-09-23 04:39:01 UTC

(In reply to Eyal Rozenberg from comment #14)
> I am not so sure it's better than nothing. The downsides may outweigh the
> upsides. Remember that if users are at all willing to mess with special
> Unicode direction control chars, they might just as well use RLM and LRM to
> indicate what they want.

You seem to forget what this bug is about. It is not about creating content. It is about opening existing files created in MSOffice. What you wrote above is irrelevant in this bug's case.

> Also consider what happens when you open the DOCX then save it back and send
> it to your friend who use MSO.

They will see it OK. The round-trip should convert the characters back to the run properties.

Comment 16 Khaled Hosny 2024-09-30 18:26:39 UTC

I’m pretty sure there are a few other open bugs affected by the lack of support of inline direction control when importing DOCX documents.

It is generally a desirable feature, and even HTML/CSS has it (<bdi> element and unicode-bidi CSS property), so long term ODF should have it, and may be model it after HTML/CSS ones (internally, they can be implemented using Unicode bidi control characters, so no changes are needed to the bidi algorithm or ICU libraries).