Bug 98879 - Some CJK characters are wrongly converted in vertical layout (Noto CJK font)
Summary: Some CJK characters are wrongly converted in vertical layout (Noto CJK font)
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: graphics stack (show other bugs)
Version:
(earliest affected)
5.1.1.2 rc
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Not Assigned
QA Contact:
URL:
Whiteboard: target:5.3.0
Keywords:
Depends on: HarfBuzz
Blocks: CJK Vertical-Text
  Show dependency treegraph
 
Reported: 2016-03-25 10:16 UTC by Hiunn-hué
Modified: 2017-10-04 14:18 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
This file demonstrate the bug on two versions of LibO (194.88 KB, image/png)
2016-03-25 10:16 UTC, Hiunn-hué
Details
Example (8.36 KB, application/vnd.oasis.opendocument.text)
2016-03-25 10:20 UTC, Hiunn-hué
Details
Screenshot from LibO 5.2 (20.83 KB, image/jpeg)
2016-04-07 09:04 UTC, Buovjaga
Details
Comparison after and before setting SAL_USE_COMMON_LAYOUT=1 in Windows10 (90.89 KB, image/png)
2016-10-22 15:49 UTC, Mark Hung
Details
With SAL_USE_COMMON_LAYOUT=1 in Ubuntu 16.03 (290.71 KB, image/png)
2016-10-23 02:18 UTC, Mark Hung
Details
Screenshot with SAL_USE_COMMON_LAYOUT (246.97 KB, image/png)
2016-10-24 21:26 UTC, Khaled Hosny
Details
missing 0x300c in Windows10. (35.22 KB, image/png)
2016-11-04 16:30 UTC, Mark Hung
Details
screenshot from 5.4.0. (218.75 KB, image/png)
2017-01-09 16:45 UTC, Hiunn-hué
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hiunn-hué 2016-03-25 10:16:49 UTC
Created attachment 123829 [details]
This file demonstrate the bug on two versions of LibO

[Description]

In vertical layout, some characters in Noto Sans CJK are wrongly converted to other characters. Please see attachment.


[Environment]

 OS: elementary OS 0.3.2 Freya (based on Ubuntu 14.04)
 LibreOffice 5.1.1.2 (1:5.1.1~rc2-0ubuntu1~trusty0)
 LibreOffice 5.2.0.0.alpha0+ (70dad291a6baf1eff59b61f9d344c10018585dd4)
Comment 1 Hiunn-hué 2016-03-25 10:20:06 UTC
Created attachment 123830 [details]
Example
Comment 2 Hiunn-hué 2016-03-26 13:38:54 UTC
[Additional Description]

This bug report contains two issues:

 (1) Some characters, ellipsis (…) for example, should rotate 90 degrees in vervical layout, but they don't. This bug depends on the fonts (in my case it's Noto Sans CJK), and it seems to be related with #95836.

Note that not only punctuations are affected. The Japanese Kana "ー" doesn't rotate 90 degrees either.

 (2) Many characters are rendered as different one. For example, the Chinese Bopomofo "ㄡ" becomes Japanese Kana "ア"; the punctuation "」" becomes "〆"; and Kana "ス" becomes Kana "ん". If you copy those and paste to gedit, you'll find they remain the original one.
Comment 3 Buovjaga 2016-04-07 09:04:54 UTC
Created attachment 124153 [details]
Screenshot from LibO 5.2

My result does not match yours. I guess I would have to test on Linux.

The font: https://github.com/googlei18n/noto-cjk/blob/master/NotoSansCJKtc-Regular.otf

Win 7 Pro 64-bit Version: 5.2.0.0.alpha0+
Build ID: 7a7be32e5265f897174f3880adc061dac0203f1f
CPU Threads: 4; OS Version: Windows 6.1; UI Render: default; 
TinderBox: Win-x86@62-merge-TDF, Branch:MASTER, Time: 2016-03-31_07:39:11
Locale: fi-FI (fi_FI)
Comment 4 Buovjaga 2016-04-07 14:59:06 UTC
Confirmed, Linux-only.

Arch Linux 64-bit, KDE Plasma 5
Version: 5.2.0.0.alpha0+
Build ID: 9b0069c2833313ba9fab91f45edd354ba334f94b
CPU Threads: 8; OS Version: Linux 4.4; UI Render: default; 
Locale: fi-FI (fi_FI.UTF-8)
Built on April 3rd 2016
Comment 5 Mark Hung 2016-10-22 15:49:02 UTC
Created attachment 128163 [details]
Comparison after and before setting SAL_USE_COMMON_LAYOUT=1 in Windows10


This issue no more a Linux only issue after 5.3.0-alpha1. :-(
Worse yet, punctuation marks from MS built-in traditional Chinese fonts PMing-Liu and ZhengHei are incorrect now (Highlight in red box.
Comment 6 Mark Hung 2016-10-22 16:25:07 UTC
For reference
 http://www.unicode.org/reports/tr50/
Comment 7 Mark Hung 2016-10-23 02:18:32 UTC
Created attachment 128179 [details]
With SAL_USE_COMMON_LAYOUT=1 in Ubuntu 16.03

It get better in Linux with  SAL_USE_COMMON_LAYOUT=1
Comment 8 Mark Hung 2016-10-23 23:51:24 UTC
Hi Khaled,

Here are some findings and my thoughts about processing vertical writing -

For common layout, GetVerticalFlagsForScript() returns GF_NONE for USCRIPT_COMMON, so the CJK punctuation marks in the sample do not have chance to be processed processed as its vertical writing form in CommonSalLayout::LayoutText().

OTOH, I'm thinking maybe GetNextRun should merge runs with script type USCRIPT_COMMON with previous run or set it to script type of the whole text if the run contains CJK punctuation marks as mentioned in 5.1 Handling Characters with the Common Script Property in UAX#24.

The last thing is about how LayoutText set vertical flags for each glyph.
Define rotation purely based on codepoint seems to be error prone. It also depend on availability of vertical writing form in different font.
Comment 9 Khaled Hosny 2016-10-24 13:42:15 UTC
(In reply to Mark Hung from comment #8)
> Hi Khaled,
> 
> Here are some findings and my thoughts about processing vertical writing -
> 
> For common layout, GetVerticalFlagsForScript() returns GF_NONE for
> USCRIPT_COMMON, so the CJK punctuation marks in the sample do not have
> chance to be processed processed as its vertical writing form in
> CommonSalLayout::LayoutText().

We shouldn’t get any USCRIPT_COMMON by the time this function is called, since it should have been resolved into a specific script by script itemization.

> OTOH, I'm thinking maybe GetNextRun should merge runs with script type
> USCRIPT_COMMON with previous run or set it to script type of the whole text
> if the run contains CJK punctuation marks as mentioned in 5.1 Handling
> Characters with the Common Script Property in UAX#24.

This should be the case already.

> The last thing is about how LayoutText set vertical flags for each glyph.
> Define rotation purely based on codepoint seems to be error prone.

I caouldn’t come up with a better way to do it, any suggestions are highly appreciated.

What I have been thinking is that we should itemize the runs by character orientation for vertical text, in addition to the bidi and script itemization we currently do. But I couldn’t find any resources to know how to that, may be you will have a better luck than me.

> It also depend on availability of vertical writing form in different font.

I think we will have to depend on the fonts doing the right thing for now, after making sure we handle correct fonts correctly we can look into some fallback behavior for defective fonts.
Comment 10 Khaled Hosny 2016-10-24 13:50:46 UTC
What about this patch:
diff --git a/vcl/source/gdi/CommonSalLayout.cxx b/vcl/source/gdi/CommonSalLayout.cxx
index 91d7762..42e33ae 100644
--- a/vcl/source/gdi/CommonSalLayout.cxx
+++ b/vcl/source/gdi/CommonSalLayout.cxx
@@ -547,9 +547,8 @@ bool CommonSalLayout::LayoutText(ImplLayoutArgs& rArgs)
                 if (bVertical)
                 {
                     int nVertFlag;
-#if 0               /* XXX: does not work as expected for Common script */
-                    UErrorCode error = U_ZERO_ERROR;
-                    nVertFlag = GetVerticalFlagsForScript(uscript_getScript(aChar, &error));
+#if 1               /* XXX: does not work as expected for Common script */
+                    nVertFlag = GetVerticalFlagsForScript(aScriptRun.maScript);
 #else
                     nVertFlag = GetVerticalFlags(aChar);
                     if (nVertFlag == GF_ROTR)

It seems to improve this test file, but breaks parentheses :(
Comment 11 Mark Hung 2016-10-24 14:01:50 UTC
(In reply to Khaled Hosny from comment #9)

> We shouldn’t get any USCRIPT_COMMON by the time this function is called,
> since it should have been resolved into a specific script by script
> itemization.
> 
> > OTOH, I'm thinking maybe GetNextRun should merge runs with script type
> > USCRIPT_COMMON with previous run or set it to script type of the whole text
> > if the run contains CJK punctuation marks as mentioned in 5.1 Handling
> > Characters with the Common Script Property in UAX#24.
> 
> This should be the case already.

While dumping the script type, I found there are a few are UCSCRIPT_COMMON:
Char=《 Code=300a Script=0 GR_NONE
Char=… Code=2026 Script=0 GR_NONE
Char=… Code=2026 Script=0 GR_NONE
Char=》 Code=300b Script=0 GR_NONE
Char=「 Code=300c Script=0 GR_NONE
Char=… Code=2026 Script=25    GR_NONE
Char=」 Code=300d Script=25    GR_NONE

So I mean there is something wrong in GetNextRun


> 
> > The last thing is about how LayoutText set vertical flags for each glyph.
> > Define rotation purely based on codepoint seems to be error prone.
> 
> I caouldn’t come up with a better way to do it, any suggestions are highly
> appreciated.
> 
> What I have been thinking is that we should itemize the runs by character
> orientation for vertical text, in addition to the bidi and script
> itemization we currently do. But I couldn’t find any resources to know how
> to that, may be you will have a better luck than me.
> 
> > It also depend on availability of vertical writing form in different font.
> 
> I think we will have to depend on the fonts doing the right thing for now,
> after making sure we handle correct fonts correctly we can look into some
> fallback behavior for defective fonts.

I would trust Microsoft PMingLiu more for Chinese font, but I'm not a font expert.
Comment 12 Khaled Hosny 2016-10-24 14:17:37 UTC
(In reply to Mark Hung from comment #11)
> (In reply to Khaled Hosny from comment #9)
> 
> > We shouldn’t get any USCRIPT_COMMON by the time this function is called,
> > since it should have been resolved into a specific script by script
> > itemization.
> > 
> > > OTOH, I'm thinking maybe GetNextRun should merge runs with script type
> > > USCRIPT_COMMON with previous run or set it to script type of the whole text
> > > if the run contains CJK punctuation marks as mentioned in 5.1 Handling
> > > Characters with the Common Script Property in UAX#24.
> > 
> > This should be the case already.
> 
> While dumping the script type, I found there are a few are UCSCRIPT_COMMON:
> Char=《 Code=300a Script=0 GR_NONE
> Char=… Code=2026 Script=0 GR_NONE
> Char=… Code=2026 Script=0 GR_NONE
> Char=》 Code=300b Script=0 GR_NONE
> Char=「 Code=300c Script=0 GR_NONE
> Char=… Code=2026 Script=25    GR_NONE
> Char=」 Code=300d Script=25    GR_NONE

This shouldn’t happen, can you debug it?
Comment 13 Khaled Hosny 2016-10-24 21:26:04 UTC
Created attachment 128227 [details]
Screenshot with SAL_USE_COMMON_LAYOUT

This is how the attached document looks like on my system.
Comment 14 Mark Hung 2016-10-25 14:20:04 UTC
(In reply to Khaled Hosny from comment #13)
> Created attachment 128227 [details]
> Screenshot with SAL_USE_COMMON_LAYOUT
> 
> This is how the attached document looks like on my system.

Do you mean that's what it looks like without any code change?

That looks similar to mine on Linux ( some ellipsis are exception, not centered ). But we have regression on Windows.
Comment 15 Khaled Hosny 2016-10-25 22:56:18 UTC
(In reply to Mark Hung from comment #14)
> (In reply to Khaled Hosny from comment #13)
> > Created attachment 128227 [details]
> > Screenshot with SAL_USE_COMMON_LAYOUT
> > 
> > This is how the attached document looks like on my system.
> 
> Do you mean that's what it looks like without any code change?

Yes.

> That looks similar to mine on Linux ( some ellipsis are exception, not
> centered ). But we have regression on Windows.

I see, then it is more likely a bug in the Windows rendering code not the cross-platform layout code. I have some local changes to improve Windows rendering code for local text, but not finished yet.
Comment 16 Commit Notification 2016-10-28 19:24:54 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=3d456dfa6637c6c3ebe7a21f1f1a5b05039cee2a

tdf#98879: Fix vertical text on Windows for the new layout

It will be available in 5.3.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 17 Khaled Hosny 2016-11-04 12:28:33 UTC
@Mark: Is this fixed now, or there are still pending issues?
Comment 18 Mark Hung 2016-11-04 16:30:35 UTC
Created attachment 128492 [details]
missing 0x300c in Windows10.
Comment 19 Khaled Hosny 2016-11-04 17:38:08 UTC
Hmm, that is weird. I’m pretty sure they were there when I worked on https://gerrit.libreoffice.org/gitweb?p=core.git;a=commitdiff;h=3d456dfa6637c6c3ebe7a21f1f1a5b05039cee2a.

Looks like some fonts are affected and some are not, no idea what is going on.
Comment 20 Mark Hung 2017-01-07 17:18:19 UTC
With latest build from master, brackets are all fine now. But, ellipsis isn't rotated 90 degree clockwise on Windows. It's fine on Linux.

Version: 5.4.0.0.alpha0+ (x64)
Build ID: 4badf4000837488987b08d67d75b97b442ff0242
CPU Threads: 8; OS Version: Windows 6.19; UI Render: default; 
Locale: zh-TW (zh_TW); Calc: group
Comment 21 Buovjaga 2017-01-07 17:36:47 UTC
(In reply to Mark Hung from comment #20)
> With latest build from master, brackets are all fine now. But, ellipsis
> isn't rotated 90 degree clockwise on Windows. It's fine on Linux.
> 
> Version: 5.4.0.0.alpha0+ (x64)
> Build ID: 4badf4000837488987b08d67d75b97b442ff0242
> CPU Threads: 8; OS Version: Windows 6.19; UI Render: default; 
> Locale: zh-TW (zh_TW); Calc: group

Confirmed with Noto CJK in attachment 123830 [details]

Version: 5.4.0.0.alpha0+
Build ID: 3afe82bd63fde41d2a88418fb64e4ff587b05436
CPU Threads: 4; OS Version: Windows 6.19; UI Render: default; 
TinderBox: Win-x86@42, Branch:master, Time: 2017-01-04_23:42:57
Locale: fi-FI (fi_FI); Calc: group
Comment 22 Hiunn-hué 2017-01-09 16:45:06 UTC
Created attachment 130279 [details]
screenshot from 5.4.0.

I have the following issues on 5.4.0.0.alpha0+ ...

 (1) Some brackets are wrongly converted when using the Droid Sans Fallback. They were correct on LibreOffice 5.2.3.2.  

 (2) Fullwidth Colon (U+FF1A) and Fullwidth Semicolon (U+FF1B) are not rotated 90 degreed clockwise with fonts such as Droid Sans Fallback or AR PL UMing. They were correct on LibreOffice 5.2.3.2.

 (3) Althogh the brackets like 《》 or 「」 are now correct, but they are not centered and too close to the left border.


Please see attachment.


Version: 5.4.0.0.alpha0+
Build ID: 63ddc8dc4ae1f3c3ee2f860c34886688b0ed2d57
CPU Threads: 4; OS Version: Linux 4.4; UI Render: default; VCL: gtk2; 
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2017-01-08_00:28:48
Locale: zh-TW (zh_TW.UTF-8); Calc: group
Comment 23 Volga 2017-01-24 04:06:44 UTC
(In reply to Hiunn-hué from comment #22)
>  (3) Althogh the brackets like 《》 or 「」 are now correct, but they are not
> centered and too close to the left border.

This problem looks similar to what I have seen in attachment 128631 [details], when I disable vert feature for 《》(Mongolian Baiti:vert=0), they looks close to the top border.

Version: 5.3.0.2 (x64)
Build ID: 5ad7b2889021c491af62f7930a4b1cb631392f16
CPU Threads: 4; OS Version: Windows 6.19; UI Render: default; 布局引擎:新; 
Locale: zh-CN (zh_CN); Calc: group
Comment 24 Volga 2017-01-24 04:13:15 UTC
(In reply to Hiunn-hué from comment #22)
>  (1) Some brackets are wrongly converted when using the Droid Sans Fallback.
> They were correct on LibreOffice 5.2.3.2.  

This is font issue, I found this problem also appearing in Chrome, Firefox, Opera on Android, maybe this font has bad vert feature.
Comment 25 Volga 2017-01-24 04:28:55 UTC
(In reply to Hiunn-hué from comment #22)
>  (2) Fullwidth Colon (U+FF1A) and Fullwidth Semicolon (U+FF1B) are not
> rotated 90 degreed clockwise with fonts such as Droid Sans Fallback or AR PL
> UMing. They were correct on LibreOffice 5.2.3.2.

According to Unicode data, U+FF1A/FF1B has Tr property, which means they are fallback to Rotated if they does not have proper feature for vertical layout in a font, but consider these characters are different from bracket characters, I suggest LibO should giving an exception.
Comment 26 Volga 2017-01-24 04:30:41 UTC
(In reply to Volga from comment #25)
> According to Unicode data, U+FF1A/FF1B has Tr property, which means they are
> fallback to Rotated if they does not have proper feature for vertical layout
> in a font, but consider these characters are different from bracket
> characters, I suggest LibO should giving an exception.
See: http://www.unicode.org/reports/tr50/#data
Comment 27 Mark Hung 2017-02-28 13:55:08 UTC
Hi,
(In reply to Hiunn-hué from comment #22)
> Created attachment 130279 [details]
> screenshot from 5.4.0.

According to the attachment, characters that use Noto Sans CJK Regular TC are displayed with correct orientation in Linux. Although there are still residual issues, let's create different issues for them if necessary.

With knowledge we gain about vertical writing, I'd like to suggest to create different for the combination of 
1. Whether OpenGL is enabled
2. the characters under investigation
3. the Font under investigation, and the expectation about the result
- Whether the font support vertical writing (i.e. have vhea and vmtx table) make differences.
- Each Font have it's own Glyph substitution table (GSUB) for vertical writing so that we can not expect same character have the same result among different fonts in vertical writing. It should be verified against font design first. Intuition about the result or past expectation isn't always correct.
Comment 28 Hiunn-hué 2017-03-03 13:11:32 UTC
>  (2) Fullwidth Colon (U+FF1A) and Fullwidth Semicolon (U+FF1B) are not
> rotated 90 degreed clockwise with fonts such as Droid Sans Fallback or AR PL
> UMing. They were correct on LibreOffice 5.2.3.2.

I opened a new issue for this: 
https://bugs.documentfoundation.org/show_bug.cgi?id=106295