Bug 128249 - Calc does not correctly show some Chinese characters
Summary: Calc does not correctly show some Chinese characters
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
6.3.2.2 release
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-19 10:50 UTC by Owen Parry
Modified: 2019-10-25 08:29 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Owen Parry 2019-10-19 10:50:53 UTC
Description:
Certain valid Chinese characters are rendered as boxes when using Droid Sans font.

Steps to Reproduce:
Working with the utf8 codepoints "\xE5\x90\xB4\xE7\x8F\xAE\xE9\xA3\x8F"

Enter the above 3 Chinese characters into a cell, and select the font "Droid Sans", The text is displayed as either 3 boxes (invalid codepoint?) or the first character is a box, or the last character is a box. This behaviour changes depending on where you click in the editor. I've validated within other text editors that the "Droid Sans" font works correctly.

The input box, directly above the sheet, will display all characters correctly when selecting the cell. But when clicking the input box itself, all the characters within the input change to boxes.

Now enter the string "\x50\x61\x6D\xEF\xBC\x88\xE5\x90\xB4\xE7\x8F\xAE\xE9\xA3\x8F\xEF\xBC\x89" into a cell. This is the letters "Pam" followed by open brace, 3 Chinese characters above, closing brace). The third Chinese character is always displayed as a box. Furthermore, trying to change the font will only change the font of the first 3 ASCII characters. The It is impossible to change the font of the utf8 characters. (Perhaps the sheet needs to be configured to use DroidSans first to see this bug).

Earlier versions of Calc (I cannot downgrade on my OS) did not have this issue, but I'm unable to specify in exactly which version this appeared.

Actual Results:
Invalid codepoint character displayed

Expected Results:
Actual Character displayed


Reproducible: Always


User Profile Reset: Yes



Additional Info:
Using Manjaro Linux with the 5.3.6 kernel. LibreOffice-fresh 6.3.2 from the arch repos. Tested on both Gnome and KDE.

Version: 6.3.2.2
Build ID: 6.3.2-2
CPU threads: 16; OS: Linux 5.3; UI render: default; VCL: kde5; 
Locale: en-US (en_US.UTF-8); UI-Language: en-US
Calc: threaded
Comment 1 V Stuart Foote 2019-10-19 23:45:10 UTC
A 2,3 and 4-byte utf-8 coding will not work for direct Unicode input. Calc's =UNICHAR() function can be used against the _decimal_ value of the Unicode point--but is cumbersome compared to the global <Alt>+x toggle implemented for bug 73691

Convert the input string to utf-16, LibreOffice handles that conversion to Unicode cleanly by prepending "U+", e.g. "U+0050" and then an <Alt>+x to toggle applied from the end of the string. 

So, in utf-16 your sample string for Unicode toggle would be:

u+0050u+0061u+006du+ff08u+5434u+73eeu+98cfu+ff09

Try that and see if you get better results.

Please note that Droid CJK coverage requires an Ascender Pro purchase, not sure the opensource builds included the CJK. 

The Google Noto Sans CJK successor to Droid is probably a more functional font and  is readily available opensource builds.

So, if actually using Droid without a CJK locale--you will receive fallback font handling for some font with coverage on system.
Comment 2 Owen Parry 2019-10-20 01:17:25 UTC
Please note, I priveded the strings as a UTF8 array for YOUR convenience. If I just pasted Chinese characters into the bug report, I have no way of knowing if you can copy them. On your end, use the UTF8 codepoints to convert to readable text and enter them into a cell.

Here are the two strings as readable text:
吴珮飏
Pam(吴珮飏)

If the font itself was at fault, I would expect to see the box character 100% of the time, that is not the case. As mentioned, either the first, last or all characters are sometimes by the box depending on what is currently selected in the editor. The input bar displays the cells contents correctly, until it is clicked, at which point all characters are converted to boxes.

I have just tried with 'Noto Sans CJK'. If the document is saved used Noto, then reopened, the exact same problem exists.

All other Chinese characters, as far as I can tell, are displayed correctly. The above combination of 3 characters are causing the problem.
Comment 3 V Stuart Foote 2019-10-20 02:12:27 UTC
Can not confirm on Windows 10 Hoime 64-bit en-US (1903) with
Version: 6.3.2.2 (x64)
Build ID: 98b30e735bda24bc04ab42594c85f7fd8be07b9c
CPU threads: 4; OS: Windows 10.0; UI render: GL; VCL: win; 
Locale: en-US (en_US); UI-Language: en-US
Calc: threaded

The Formula input bar uses system font Segoe UI with fall back - it drops the 飏 (yang) fallback and shows an undefined codepoint when it is the last entry on the line. 

Forcing replacement of system font Segoe UI with Noto Sans CJK TC (from Tools -> Options -> Fonts: Replacement table) fixed the missing fallback glyph, and removes the annoying fallback resize.

Otherwise the cells on the sheet pick up their assigned Noto Sans CJK TC and are always fully formed, last glyph or not.

Please save a sample .ODS spreadsheet to 'Flat XML ODF Spreadsheet (.fods)' format and attach.

But, think the missing glyphs are expected depending on the system locale, and are controllable by forcing a font substitution.
Comment 4 Owen Parry 2019-10-24 23:52:11 UTC
Changing the systemwide default font to 'Noto Sans CJK SC', and changing each cell in the document to use the same, AND selecting the cell clicking the formula input bar and changing the font again, has resolved the issue.

It appears as though two fonts are associated with each cell. When I select 'Droid Sans' it fellback to whatever font the cell was previously saved with. Clicking the input bar, the font name would change.

I'm putting this down to a mix of system upgrades, LibreOffice updates and old documents.
Comment 5 Ming Hua 2019-10-25 08:29:32 UTC
I'm glad that this problem has been solved for Owen, I'd just like to add that indeed there are two fonts associated to each cell - and to each style for that matter - the western font and the Asian font (sometime even a third font, the CTL font is involved, but I have no experience in those).

In this case, the character 飏 (U+98CF) is causing problem mostly because it is in neither GB2312/CP936 or Big5 legacy encodings (I didn't check other than looking at its Unicode codepoint data), and therefore not included in a lot of international fonts.  Only fonts that are dedicated to CJK coverage includes it.  Of course this doesn't explain why Owen sometimes sees only the first character 吴 as a box.

There is definitely something changed in LO 6.3 that handles which of the western or Asian font to use, I am worried that bugs like these will keep popping out.