Bug 134350 - ICU locale assignment at word bounds for mixed CJK and Western text, wrong assignment for opening and closing the text run
Summary: ICU locale assignment at word bounds for mixed CJK and Western text, wrong as...
Status: CLOSED DUPLICATE of bug 66791
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: CJK
  Show dependency treegraph
 
Reported: 2020-06-28 01:33 UTC by Kevin Suo
Modified: 2023-07-01 10:19 UTC (History)
11 users (show)

See Also:
Crash report or crash signature:


Attachments
Chinese and English mixed.odt (10.84 KB, application/vnd.oasis.opendocument.text)
2020-06-28 01:33 UTC, Kevin Suo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Suo 2020-06-28 01:33:14 UTC
Created attachment 162464 [details]
Chinese and English mixed.odt

Steps to Reproduce:
1. Copy and paste (or type mannually) the following into Writer, as un-formatted text:
中文 abc 中文
中文“abc”中文
中文‘abc’中文

You may set the Default Style's Asian font to "Noto Serif CJK SC" and increase the font size, for better identification.

Current Behaviour:
The space or quote at the left side of each paragraph (i.e., following the Chiense characters) is using Asian font, while the quote at the right side (i.e., following the English characters) is using Western font.

Expected:
The font used for both the left quote and right quote should be the same - in this case it should be using the Chinese font, because the quotes typed-in is actually full-width Chinese quotation.

版本: 6.3.5.2
Build ID: dd0751754f11728f69b42ee2af66670068624673
CPU 线程: 4; 操作系统: Linux 5.6; UI 渲染: 默认; VCL: gtk3; 
区域语言: zh-CN (zh_CN.UTF-8); UI 语言: zh-CN
Calc: threaded

Fedora 32 x64.

Also reproducible in the current master.


This issue is originally reported in the Chinese LibreOffice discussion forum:
https://bbs.libreofficechina.org/thread-2301-1-1.html

A test ODT document is attached to illustrate this issue.
Comment 1 Xisco Faulí 2020-07-13 14:37:25 UTC
@Regina, I thought you might be interested in this issue
Comment 2 Aron Budea 2020-11-25 18:15:56 UTC
Already the same in 3.3.0.
Comment 3 V Stuart Foote 2020-11-25 19:39:21 UTC
Not sure this is incorrect.

We use ICU libs to change locale for text run at word bounds. Not sure we can then look back and change the locale for the opening word bound--here U+0020, U+2018 or U+201c--to match the locale assigned to the text run.  Or conversely change the closing word bound back when passing into the next run.
Comment 4 Volga 2021-05-02 16:27:51 UTC
(In reply to V Stuart Foote from comment #3)
> Not sure this is incorrect.
> 
> We use ICU libs to change locale for text run at word bounds. Not sure we
> can then look back and change the locale for the opening word bound--here
> U+0020, U+2018 or U+201c--to match the locale assigned to the text run.  Or
> conversely change the closing word bound back when passing into the next run.
There is a solution posted at bug 66791, WDYT?
Comment 5 V Stuart Foote 2021-05-02 16:59:00 UTC
(In reply to Volga from comment #4)
> (In reply to V Stuart Foote from comment #3)
> > Not sure this is incorrect.
> > 
> > We use ICU libs to change locale for text run at word bounds. Not sure we
> > can then look back and change the locale for the opening word bound--here
> > U+0020, U+2018 or U+201c--to match the locale assigned to the text run.  Or
> > conversely change the closing word bound back when passing into the next run.
> There is a solution posted at bug 66791, WDYT?

That does not look like a solution, to hardcode typical punctuation to paragraph language or document default language, and bypass ICU handling. But it might be more performant than capturing closing and opening punctuation around embedded different language text runs.
Comment 6 Volga 2021-05-26 15:01:22 UTC
(In reply to Aron Budea from comment #2)
> Already the same in 3.3.0.
There is also some long time complains for this:
https://yongweiwu.wordpress.com/2014/12/18/a-complaint-of-odfs-asian-language-support/
https://ask.libreoffice.org/en/question/19750/problem-with-full-width-asian-punctuation/

(In reply to V Stuart Foote from comment #5)
> (In reply to Volga from comment #4)
> > (In reply to V Stuart Foote from comment #3)
> > > Not sure this is incorrect.
> > > 
> > > We use ICU libs to change locale for text run at word bounds. Not sure we
> > > can then look back and change the locale for the opening word bound--here
> > > U+0020, U+2018 or U+201c--to match the locale assigned to the text run.  Or
> > > conversely change the closing word bound back when passing into the next run.
> > There is a solution posted at bug 66791, WDYT?
> 
> That does not look like a solution, to hardcode typical punctuation to
> paragraph language or document default language, and bypass ICU handling.
> But it might be more performant than capturing closing and opening
> punctuation around embedded different language text runs.
If ODF specification have instruction for this, every developer would be easy to get solution to set the locale font face for such characters. As my investigation, CJK fonts usually assign these punctuations in the same width as CJK ideographs (i.e. full-width), and you need to consider more on them:
U+2013 (Chinese only), U+2014-16, U+2018-19, U+201C-1D, U+2022, U+2024-27, U+2032-33, U+2035, U+203B.
But anyway, we also need to investigate MS Office prefer which characters to assign to CJK fonts in documents with mixed CJK and Western texts, especially in CJK versions of MS Office.
Comment 7 Jacob 2021-05-29 11:05:31 UTC Comment hidden (spam)
Comment 8 Volga 2021-07-28 07:49:38 UTC Comment hidden (off-topic)
Comment 9 Volga 2021-07-28 07:51:50 UTC Comment hidden (off-topic)
Comment 10 Volga 2021-07-28 07:52:59 UTC Comment hidden (off-topic)
Comment 11 Volga 2021-07-28 07:55:31 UTC
As my investigation, the Simplified Chinese version of MS Office usually prefer to change the locale language and font face of such characters to match the locale assigned to the text run. But MS Office usually break a sentence into individual words, then put them into every markups (https://www.youtube.com/watch?v=OyxAvhrA-6Y), and LibreOffice may facing to some special cases as bi-directional texts (bug 106306), so I suggest change the closing word bound back for fix, and before the assignment, LibreOffice should also build a logical functionality to judge which script/locale is mostly used in a document.
Comment 12 Volga 2021-12-23 03:21:02 UTC
If ICU have a parameter to configure, we can try make use of this, otherwise we have to write a patch to reassign the word bounds.
Comment 13 Volga 2022-01-22 04:42:20 UTC Comment hidden (no-value)
Comment 14 Assignmenthelpmart.com 2022-04-28 13:05:35 UTC Comment hidden (spam)
Comment 15 Nancie L Beckett 2023-01-13 09:55:49 UTC Comment hidden (spam)
Comment 16 ⁨خالد حسني⁩ 2023-06-23 17:09:56 UTC

*** This bug has been marked as a duplicate of bug 66791 ***
Comment 17 Elizabeth Williams 2023-07-01 08:39:41 UTC Comment hidden (spam)
Comment 18 himajin100000 2023-07-01 08:42:12 UTC

*** This bug has been marked as a duplicate of bug 66791 ***