Bug 107204 - Writer treats Hungarian Rovas (aka Old Hungarian) text as left-to-right script instead of right-to-left
Summary: Writer treats Hungarian Rovas (aka Old Hungarian) text as left-to-right scrip...
Status: ASSIGNED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
5.3.2.1 rc
Hardware: All All
: medium normal
Assignee: Khaled Hosny
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Font-Rendering
  Show dependency treegraph
 
Reported: 2017-04-16 12:11 UTC by Kovács Viktor
Modified: 2017-04-20 08:37 UTC (History)
7 users (show)

See Also:
Crash report or crash signature:


Attachments
Unicode 8.0 old hungarian conform font and screenshots (860.64 KB, application/zip)
2017-04-18 12:58 UTC, Kovács Viktor
Details
Screenshot, looks OK (166.36 KB, image/png)
2017-04-18 14:30 UTC, Khaled Hosny
Details
sample ODT document using Unicode 10c80:10cff (12.57 KB, application/vnd.oasis.opendocument.text)
2017-04-19 18:17 UTC, V Stuart Foote
Details
screen clip of sample ODT (99.10 KB, image/png)
2017-04-19 18:22 UTC, V Stuart Foote
Details
Comparison before/after removing the wrong optimization (151.67 KB, image/png)
2017-04-19 19:48 UTC, Khaled Hosny
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kovács Viktor 2017-04-16 12:11:07 UTC
Description:
Old Hungarian script is right to left script (unicode range: u10c80-u10cff), Libreoffice does not accept it.
With GNOME's gedit work perfectly, but copy gedit to Writer, the text again appear left to right

Actual Results:  
Old Hungarian script appears left to right.

Expected Results:
Old Hungarian script must be appear right to left, it works in my browser:
Helyes = 𐲏𐳉𐳗𐳉𐳤 



Reproducible: Always

User Profile Reset: No

Additional Info:
[Information automatically included from LibreOffice]
Locale: hu
Module: StartModule
[Information guessed from browser]
OS: Linux (All)
OS is 64bit: yes
Builds ID: LibreOffice 5.3.2.1


User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
Comment 1 V Stuart Foote 2017-04-16 17:43:48 UTC
Confrimed on Windows 10 Pro 64-bit en-US (1703) with
Version: 5.3.2.2 (x64)
Build ID: 6cd4f1ef626f15116896b1d8e1398b56da0d0ee1
CPU Threads: 8; OS Version: Windows 6.19; UI Render: GL; Layout Engine: new; 
Locale: en-US (en_US); Calc: group

The Old Hungarian block of text will still need a font with glyphs defined for those Unicode points. What font is providing you support for that range?  I had none, suspect most users would not either.

The GNU Unifont Glyps project provides low quality bitmap font coverage of SMP including the Unicode range [1], when I install the SMP supplement, confirm that character entry for the Range is RTL rather than LTR.

Same system and font, BableStone BablePad handles it correctly as RTL so assume there is hinting in the font.

=-ref-=
http://unifoundry.com/unifont.html
Comment 2 Kovács Viktor 2017-04-17 02:13:38 UTC
I am sorry, linux have required font with that unicode range.
Later I will attach that kind of font, I promiss.
Comment 3 Kovács Viktor 2017-04-18 12:58:21 UTC
Created attachment 132651 [details]
Unicode 8.0 old hungarian conform font and screenshots

I am attached my own font with Old Hungarian implementation and tester layout source codes for Linux and Windows opsistems, screenshot of Libreoffice Writer output (wrong) and GNOME's gedit output (right)
Comment 4 V Stuart Foote 2017-04-18 13:30:40 UTC
Thanks, that is a useful font.

Guess we need to determine if support for other RTL scripts from the Unicode Supplementary Multilingual Plane (SMP) is handled correctly or if it is a specific issue with HarfBuzz.

In hb is LTR/RTL read as a font attribute, or is it defined by Unicode block?

Khaled?
Comment 5 Kovács Viktor 2017-04-18 13:44:21 UTC
Unicode standard defines this block as rtl! I do not understand the question!
Comment 6 Kovács Viktor 2017-04-18 13:56:13 UTC
I tested the Windows layout, too. It works only notepad correctly,the "formatter-editors" fails under Windows, too. I would like, if exists one "formatter-editor", that works correctly!
Comment 7 Khaled Hosny 2017-04-18 14:28:49 UTC
(In reply to V Stuart Foote from comment #4)
> Thanks, that is a useful font.
> 
> Guess we need to determine if support for other RTL scripts from the Unicode
> Supplementary Multilingual Plane (SMP) is handled correctly or if it is a
> specific issue with HarfBuzz.
> 
> In hb is LTR/RTL read as a font attribute, or is it defined by Unicode block?
> 
> Khaled?

Directionality text not font property and is not handled by HarfBuzz, we determine the text direction based on Unicode bidirectional algorithm (we use ICU for that) before calling HarfBuzz.

The attachment does not contain any text documents, but the text in the bug description is shown RTL in Writer already as expected.
Comment 8 Khaled Hosny 2017-04-18 14:30:20 UTC
Created attachment 132654 [details]
Screenshot, looks OK
Comment 9 V Stuart Foote 2017-04-18 16:13:57 UTC
@Khaled, OK thanks! 

And on master I also can force a paragraph to RTL (from my en-US local's default LTR) and then paste special with the sample from comment 0 to match your clip.

Guess that with no Language defined for "Old Hungarian" script, and ICU UAX#31 suggests it would not be likely, our only choice is to toggle to RTL and author text without language tagging.

Should this be closed NOT A BUG?

=-ref-=
http://unicode.org/reports/tr31/#Table_Candidate_Characters_for_Exclusion_from_Identifiers
Comment 10 Khaled Hosny 2017-04-18 16:27:02 UTC
(In reply to V Stuart Foote from comment #9)
> @Khaled, OK thanks! 
> 
> And on master I also can force a paragraph to RTL (from my en-US local's
> default LTR) and then paste special with the sample from comment 0 to match
> your clip.

Paragraph direction and text direction are different (but related) things. The paragraph direction is set manually (Writer will try to be smart and use appropriate default), but the text direction is automatic. OP screenshot shows LTR text direction and I think that is the issue being reported here, but I can’t reproduce it.

Queststion to the OP, are you using TDF or distro builds of LibreOffice, if the later that what is the version of ICU you have and can you try with TDF builds?

> Guess that with no Language defined for "Old Hungarian" script, and ICU
> UAX#31 suggests it would not be likely, our only choice is to toggle to RTL
> and author text without language tagging.

Not sure what you are saying here. Text direction is language independent, it is controlled by fixed character properties provided by Unicode.
Comment 11 V Stuart Foote 2017-04-18 17:43:24 UTC
(In reply to Khaled Hosny from comment #10)
> (In reply to V Stuart Foote from comment #9)
> > @Khaled, OK thanks! 
> > 
> > And on master I also can force a paragraph to RTL (from my en-US local's
> > default LTR) and then paste special with the sample from comment 0 to match
> > your clip.
> 
> Paragraph direction and text direction are different (but related) things.
> The paragraph direction is set manually (Writer will try to be smart and use
> appropriate default), but the text direction is automatic. OP screenshot
> shows LTR text direction and I think that is the issue being reported here,
> but I can’t reproduce it.

If I set the text language to [none] and set the font for the style to OP's sample "Unicode_Maros_ext" font should it be detected as RTL for glyphs from the 10c80-10cff block? It is not.

And I can use the Tools -> Options -> Language Settings and "Ignore system input language", then in Paragraph Style dialog define the CTL Font and set "Hungarian (Szekely-Hungarian Rovas)" as the language.

Entering the sample SMP "Rovas" glyps using the Special Character dialog does not toggle the direction of the text from LTR to RTL. It can be toggled with the Formatting toolbar buttons--but is not automatic for the language. Seems like it should be.

> > Guess that with no Language defined for "Old Hungarian" script, and ICU
> > UAX#31 suggests it would not be likely, our only choice is to toggle to RTL
> > and author text without language tagging.
> 
> Not sure what you are saying here. Text direction is language independent,
> it is controlled by fixed character properties provided by Unicode.

My thought was that if ICU recommends these scripts not be processed for identification, we would not. Guess that is not the case as I then I found bug 97406 and that Eike has provided some support for the script.  But it does not appear in the drop list of Default Languages for Documents for the Complex text layout languages, only in the Character style dialogs.

@Eike, beyond setting it up for i18n/l10n Pootle support, was there more to be done in source for bug 97406 to accommodate LANGUAGE_USER_HUNGARIAN_ROVAS and the "Old Hungarian" Unicode SMP block for honoring its RTL script direction?
Comment 12 Khaled Hosny 2017-04-18 18:03:40 UTC
(In reply to V Stuart Foote from comment #11)
> (In reply to Khaled Hosny from comment #10)
> > (In reply to V Stuart Foote from comment #9)
> > > @Khaled, OK thanks! 
> > > 
> > > And on master I also can force a paragraph to RTL (from my en-US local's
> > > default LTR) and then paste special with the sample from comment 0 to match
> > > your clip.
> > 
> > Paragraph direction and text direction are different (but related) things.
> > The paragraph direction is set manually (Writer will try to be smart and use
> > appropriate default), but the text direction is automatic. OP screenshot
> > shows LTR text direction and I think that is the issue being reported here,
> > but I can’t reproduce it.
> 
> If I set the text language to [none] and set the font for the style to OP's
> sample "Unicode_Maros_ext" font should it be detected as RTL for glyphs from
> the 10c80-10cff block? It is not.

Language setting should not have any effect on text direction, and indeed setting it to [none] makes no difference whatsoever here

> And I can use the Tools -> Options -> Language Settings and "Ignore system
> input language", then in Paragraph Style dialog define the CTL Font and set
> "Hungarian (Szekely-Hungarian Rovas)" as the language.
> 
> Entering the sample SMP "Rovas" glyps using the Special Character dialog
> does not toggle the direction of the text from LTR to RTL. It can be toggled
> with the Formatting toolbar buttons--but is not automatic for the language.
> Seems like it should be.

I think you are still talking about paragraph direction (since that is the one you can change from the toolbar). LibreOffice has no way to manually change text direction, it is always automatic.

> > > Guess that with no Language defined for "Old Hungarian" script, and ICU
> > > UAX#31 suggests it would not be likely, our only choice is to toggle to RTL
> > > and author text without language tagging.
> > 
> > Not sure what you are saying here. Text direction is language independent,
> > it is controlled by fixed character properties provided by Unicode.
> 
> My thought was that if ICU recommends these scripts not be processed for
> identification, we would not.

First ICU is a software library (http://site.icu-project.org), Unicode is standard body, you are confusing the two. Second UAX #31 has nothing to do with text direction and I’m not sure why you are referring to it, it is a specifications for identifiers like programming language variables or hashtags (http://unicode.org/reports/tr31/#Introduction) it has no relevance to the issue being discussed here. 

> Guess that is not the case as I then I found
> bug 97406 and that Eike has provided some support for the script.  But it
> does not appear in the drop list of Default Languages for Documents for the
> Complex text layout languages, only in the Character style dialogs.

This also has no relevance to the issue of text direction.
Comment 13 V Stuart Foote 2017-04-18 19:14:08 UTC
(In reply to Khaled Hosny from comment #12)

Sorry, don't mean to be thick. 

At 5.3 we are claiming support for Hungarian (magyar) using the Rovás script, wherein the Hungarian is encoded RTL departing from "modern" Hungarian which is encoded LTR with Latin derived glyphs.

Seems the challenge is getting the assigned 10c80-10cff Unicode to render RTL when the language is set to Hungarian. Where is that breaking down?

If we are depending on the ICU libraries to handle identification of the script from its Unicode point range, and pass it for rendering as RTL rather than LTR it is not.

The UAX#31 suggested to me that it would not be, as the ICU project recommended against it. And that seemed to be the case.

Hungarian (default with Latin glyphs) are handled as Western language, Hungarian (Szekely-Hungaraian Rovas) as a CTL font, and I'm having trouble toggling between them, I keep getting dumped to Western and its LTR direction.

I don't see how I can force the UI to adjust, and the "automatic" detection seems to not work (if it was actually implemented for this SMP Unicode block).
Comment 14 Eike Rathke 2017-04-19 07:41:30 UTC
(In reply to V Stuart Foote from comment #11)
> But it
> does not appear in the drop list of Default Languages for Documents for the
> Complex text layout languages, only in the Character style dialogs.
Unrelated. Whether it's available as default language depends on whether locale data is available. Language tags for which no locale data is available are listed only for character attribution.

> @Eike, beyond setting it up for i18n/l10n Pootle support, was there more to
> be done in source for bug 97406 to accommodate LANGUAGE_USER_HUNGARIAN_ROVAS
> and the "Old Hungarian" Unicode SMP block for honoring its RTL script
> direction?
It's flagged as CTL (so it shows up in the CTL language list) and RTL (for whatever queries for it). However, for text rendering this is irrelevant. Text rendering uses the code points' Bidi Class property assigned by the Unicode Standard, which for the Unicode block "Old Hungarian" 10C80:10CFF *is* RTL. Given that was introduced with Unicode version 8.0, if it doesn't work it could be that LibreOffice was build against / is used with an ICU library that doesn't support Unicode 8 yet. For Unicode 8 support at least ICU 56 is needed. Any lower version will not do. The LibreOffice 5.3 internal ICU is 58 but that is used only in builds provided by TDF, Linux distributions usually build against the ICU version available in their release.

Given that for Khaled the issue does not occur and assuming he uses the LibreOffice internal ICU, my guess is that all boils down to the ICU version used. Maybe the original poster could answer that question? Setting NEEDINFO.

If it's due to the ICU version there's nothing we can do and we should close this as NOTABUG.
Comment 15 Tamas Rumi 2017-04-19 08:36:24 UTC
Please provide me some background information or link of "LibreOffice internal ICU version" in order to understand the problem arised here. Thanks in advance.
Comment 16 V Stuart Foote 2017-04-19 11:49:36 UTC
(In reply to Eike Rathke from comment #14)
> It's flagged as CTL (so it shows up in the CTL language list) and RTL (for
> whatever queries for it). However, for text rendering this is irrelevant.
> Text rendering uses the code points' Bidi Class property assigned by the
> Unicode Standard, which for the Unicode block "Old Hungarian" 10C80:10CFF
> *is* RTL. 

OK can't argue with that.

> Given that was introduced with Unicode version 8.0, if it doesn't
> work it could be that LibreOffice was build against / is used with an ICU
> library that doesn't support Unicode 8 yet. For Unicode 8 support at least
> ICU 56 is needed. Any lower version will not do. The LibreOffice 5.3
> internal ICU is 58 but that is used only in builds provided by TDF, Linux
> distributions usually build against the ICU version available in their
> release.
> 

Unfortunately, for Windows builds I am using Clop'hs TDF configured TB62 Tinderbox for nightlies and his release builds. So on Windows at least the correct ICU libraries do not help, yet the Unicode block is not rendered to canvas RTL. So something is not right.
Comment 17 Khaled Hosny 2017-04-19 12:15:23 UTC
(In reply to V Stuart Foote from comment #16)

> Unfortunately, for Windows builds I am using Clop'hs TDF configured TB62
> Tinderbox for nightlies and his release builds. So on Windows at least the
> correct ICU libraries do not help, yet the Unicode block is not rendered to
> canvas RTL. So something is not right.

Please attach a screenshot.
Comment 18 V Stuart Foote 2017-04-19 18:17:14 UTC
Created attachment 132692 [details]
sample ODT document using Unicode 10c80:10cff

Attached is an Writer document prepared on Windows 8.1 Ent with
Version: 5.3.2.2 (x64)
Build ID: 6cd4f1ef626f15116896b1d8e1398b56da0d0ee1
CPU Threads: 8; OS Version: Windows 6.29; UI Render: GL; Layout Engine: new; 
Locale: en-US (en_US); Calc: group 

Cleared the language defaults.

The default style Western font is set to the sample "Unicode_Maros_ext" at 12 pt with no language. The default style CTL font is also set to "Unicode_Maros_ext" but at 40 pt, and the language set to Hungarian (Szekely-Hungarian Rovas).

With glyphs for codepoints used in a "Western" paragraph, the layout/shaping is not RTL.  But if forced to be CTL with RTL paragraph, they codepoints are layout/shaped as RTL.

If it were correct, shouldn't strings coded with the glyphs from 10c80:10cff always be RTL direction even when mixed into a Western class text? Presumably for example Hungarian (with Latin derived fonts) mixed with Hungarian strings using Rovás font.  Doesn't seem we can do that.
Comment 19 V Stuart Foote 2017-04-19 18:22:29 UTC
Created attachment 132693 [details]
screen clip of sample ODT

Note that with a paragraph with Western layout, the 10c80:10cff is laid out LTR. And when forced into a CTL (by setting paragraph RTL) the string for the SMP codepoints are laid out RTL.
Comment 20 Khaled Hosny 2017-04-19 19:18:40 UTC
I can reproduce the issue on Windows using 5.3.2.2 release builds as well as self built master. Though now I think the difference is not Windows/Linux or ICU versions, but rather that I’m testing on an RTL locale on Linux and LTR on Windows.

I suspect there is some simplistic optimization somewhere that skips doing bidi under some conditions.
Comment 21 Khaled Hosny 2017-04-19 19:48:00 UTC
Created attachment 132697 [details]
Comparison before/after removing the wrong optimization

There are two places in Writer that check for the default direction (whatever this is) and whether the text has any scripts we classify as complex before applying the bidi algorithm. Removing these checks seem to fix the issue. The left window in the screenshot is the fixed code.
Comment 22 V Stuart Foote 2017-04-19 20:21:24 UTC
(In reply to Khaled Hosny from comment #21)
> Created attachment 132697 [details]
> Comparison before/after removing the wrong optimization
> 

That is more like what I'd expect. Does it also work with a new document without a CTL language default format defined?
Comment 23 Khaled Hosny 2017-04-19 22:17:55 UTC
(In reply to V Stuart Foote from comment #22)
> (In reply to Khaled Hosny from comment #21)
> > Created attachment 132697 [details]
> > Comparison before/after removing the wrong optimization
> > 
> 
> That is more like what I'd expect. Does it also work with a new document
> without a CTL language default format defined?

Yes.
Comment 24 Kovács Viktor 2017-04-20 07:44:26 UTC
I have several stupid question: what can I do, if I would like use new feature? It solved? Which version has or will have this feature?
Comment 25 V Stuart Foote 2017-04-20 08:37:48 UTC
(In reply to Kovács Viktor from comment #24)
> I have several stupid question: what can I do, if I would like use new
> feature? It solved? Which version has or will have this feature?

Khaled has a patch up [1] for code review that should clear this up and allow direct input of the 10c80:10cff glyphs RTL in line with other LTR Hungarian text. Text input would be using the Special Character dialog, or our toggle method for Unicode codepoints, e.g. enter U+10c8f and toggle to 𐲏 glyph using <Alt>+X 

If it all works out it will first be available in nightly builds of master (testing appreciated), and be released at 5.4.0

It may then also be "back ported" to the 5.3 branch

More complete support for Hungarian (Szekely-Hungarian Rovas) as a supported locale in the User Interface awaits translation/transliteration work noted in bug 94706 c#23 [2] and bug 103405

=-refs-=
[1] https://gerrit.libreoffice.org/#/c/36704/

[2] https://bugs.documentfoundation.org/show_bug.cgi?id=97406#c28