134350 – ICU locale assignment at word bounds for mixed CJK and Western text, wrong assignment for opening and closing the text run

Bug 134350 - ICU locale assignment at word bounds for mixed CJK and Western text, wrong assignment for opening and closing the text run

Summary: ICU locale assignment at word bounds for mixed CJK and Western text, wrong as...

Status:	CLOSED DUPLICATE of bug 66791

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	Inherited From OOo
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	CJK
	Show dependency tree / graph

Reported:	2020-06-28 01:33 UTC by Kevin Suo
Modified:	2023-07-01 10:19 UTC (History)
CC List:	11 users (show)

See Also:	66791
Crash report or crash signature:

Attachments
Chinese and English mixed.odt (10.84 KB, application/vnd.oasis.opendocument.text) 2020-06-28 01:33 UTC, Kevin Suo	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Kevin Suo 2020-06-28 01:33:14 UTC

Created attachment 162464 [details]
Chinese and English mixed.odt

Steps to Reproduce:
1. Copy and paste (or type mannually) the following into Writer, as un-formatted text:
中文 abc 中文
中文“abc”中文
中文‘abc’中文

You may set the Default Style's Asian font to "Noto Serif CJK SC" and increase the font size, for better identification.

Current Behaviour:
The space or quote at the left side of each paragraph (i.e., following the Chiense characters) is using Asian font, while the quote at the right side (i.e., following the English characters) is using Western font.

Expected:
The font used for both the left quote and right quote should be the same - in this case it should be using the Chinese font, because the quotes typed-in is actually full-width Chinese quotation.

版本: 6.3.5.2
Build ID: dd0751754f11728f69b42ee2af66670068624673
CPU 线程: 4; 操作系统: Linux 5.6; UI 渲染: 默认; VCL: gtk3; 
区域语言: zh-CN (zh_CN.UTF-8); UI 语言: zh-CN
Calc: threaded

Fedora 32 x64.

Also reproducible in the current master.


This issue is originally reported in the Chinese LibreOffice discussion forum:
https://bbs.libreofficechina.org/thread-2301-1-1.html

A test ODT document is attached to illustrate this issue.

Comment 1 Xisco Faulí 2020-07-13 14:37:25 UTC

@Regina, I thought you might be interested in this issue

Comment 2 Aron Budea 2020-11-25 18:15:56 UTC

Already the same in 3.3.0.

Comment 3 V Stuart Foote 2020-11-25 19:39:21 UTC

Not sure this is incorrect.

We use ICU libs to change locale for text run at word bounds. Not sure we can then look back and change the locale for the opening word bound--here U+0020, U+2018 or U+201c--to match the locale assigned to the text run.  Or conversely change the closing word bound back when passing into the next run.

Comment 4 Volga 2021-05-02 16:27:51 UTC

(In reply to V Stuart Foote from comment #3)
> Not sure this is incorrect.
> 
> We use ICU libs to change locale for text run at word bounds. Not sure we
> can then look back and change the locale for the opening word bound--here
> U+0020, U+2018 or U+201c--to match the locale assigned to the text run.  Or
> conversely change the closing word bound back when passing into the next run.
There is a solution posted at bug 66791, WDYT?

Comment 5 V Stuart Foote 2021-05-02 16:59:00 UTC

(In reply to Volga from comment #4)
> (In reply to V Stuart Foote from comment #3)
> > Not sure this is incorrect.
> > 
> > We use ICU libs to change locale for text run at word bounds. Not sure we
> > can then look back and change the locale for the opening word bound--here
> > U+0020, U+2018 or U+201c--to match the locale assigned to the text run.  Or
> > conversely change the closing word bound back when passing into the next run.
> There is a solution posted at bug 66791, WDYT?

That does not look like a solution, to hardcode typical punctuation to paragraph language or document default language, and bypass ICU handling. But it might be more performant than capturing closing and opening punctuation around embedded different language text runs.

Comment 6 Volga 2021-05-26 15:01:22 UTC

(In reply to Aron Budea from comment #2)
> Already the same in 3.3.0.
There is also some long time complains for this:
https://yongweiwu.wordpress.com/2014/12/18/a-complaint-of-odfs-asian-language-support/
https://ask.libreoffice.org/en/question/19750/problem-with-full-width-asian-punctuation/

(In reply to V Stuart Foote from comment #5)
> (In reply to Volga from comment #4)
> > (In reply to V Stuart Foote from comment #3)
> > > Not sure this is incorrect.
> > > 
> > > We use ICU libs to change locale for text run at word bounds. Not sure we
> > > can then look back and change the locale for the opening word bound--here
> > > U+0020, U+2018 or U+201c--to match the locale assigned to the text run.  Or
> > > conversely change the closing word bound back when passing into the next run.
> > There is a solution posted at bug 66791, WDYT?
> 
> That does not look like a solution, to hardcode typical punctuation to
> paragraph language or document default language, and bypass ICU handling.
> But it might be more performant than capturing closing and opening
> punctuation around embedded different language text runs.
If ODF specification have instruction for this, every developer would be easy to get solution to set the locale font face for such characters. As my investigation, CJK fonts usually assign these punctuations in the same width as CJK ideographs (i.e. full-width), and you need to consider more on them:
U+2013 (Chinese only), U+2014-16, U+2018-19, U+201C-1D, U+2022, U+2024-27, U+2032-33, U+2035, U+203B.
But anyway, we also need to investigate MS Office prefer which characters to assign to CJK fonts in documents with mixed CJK and Western texts, especially in CJK versions of MS Office.

Comment 7 Jacob 2021-05-29 11:05:31 UTC Comment hidden (spam)

Simultaneously writing assignments with proper research is time-consuming and hard to achieve the task. But here is a solution for you, get a brilliant assignment without doing hard work. Ask online experts, Do my assignment and relax at home; you will get the ultimate final product in your email. Many writers are ready to do help you. 
https://www.globalassignmenthelp.com/uk/do-my-assignment

Comment 8 Volga 2021-07-28 07:49:38 UTC Comment hidden (off-topic)

As my investigation, the Simplified Chinese version of MS Office usually prefer to change the locale language and font face of such characters to match the locale assigned to the text run (bug 106306). But MS Office usually break a sentence into individual words, then put them into every mackups (https://www.youtube.com/watch?v=OyxAvhrA-6Y), and LibreOffice may facing to some special cases as bi-directional texts, so I suggest change the closing word bound back for fix, and before the assignment, LibreOffice should also build a logical funtionaly to judge which script/language is mostly used in a document.

Comment 9 Volga 2021-07-28 07:51:50 UTC Comment hidden (off-topic)

As my investigation, the Simplified Chinese version of MS Office usually prefer to change the locale language and font face of such characters to match the locale assigned to the text run. But MS Office usually break a sentence into individual words, then put them into every mackups (https://www.youtube.com/watch?v=OyxAvhrA-6Y), and LibreOffice may facing to some special cases as bi-directional texts (bug 106306), so I suggest change the closing word bound back for fix, and before the assignment, LibreOffice should also build a logical functionality to judge which script/language is mostly used in a document.

Comment 10 Volga 2021-07-28 07:52:59 UTC Comment hidden (off-topic)

As my investigation, the Simplified Chinese version of MS Office usually prefer to change the locale language and font face of such characters to match the locale assigned to the text run. But MS Office usually break a sentence into individual words, then put them into every markups (https://www.youtube.com/watch?v=OyxAvhrA-6Y), and LibreOffice may facing to some special cases as bi-directional texts (bug 106306), so I suggest change the closing word bound back for fix, and before the assignment, LibreOffice should also build a logical functionality to judge which script/language is mostly used in a document.

Comment 11 Volga 2021-07-28 07:55:31 UTC

As my investigation, the Simplified Chinese version of MS Office usually prefer to change the locale language and font face of such characters to match the locale assigned to the text run. But MS Office usually break a sentence into individual words, then put them into every markups (https://www.youtube.com/watch?v=OyxAvhrA-6Y), and LibreOffice may facing to some special cases as bi-directional texts (bug 106306), so I suggest change the closing word bound back for fix, and before the assignment, LibreOffice should also build a logical functionality to judge which script/locale is mostly used in a document.

Comment 12 Volga 2021-12-23 03:21:02 UTC

If ICU have a parameter to configure, we can try make use of this, otherwise we have to write a patch to reassign the word bounds.

Comment 13 Volga 2022-01-22 04:42:20 UTC Comment hidden (no-value)

As my suggestion, any word surrounding by "" «» () [] {}（）［］｛｝“”‘’〈〉《》〔〕【】〘〙「」『』︵︶︷︸︹︺︻︼︽︾︿﹀﹁﹂ should have word boundary within these pairs of characters for better avoidance.

Comment 14 Assignmenthelpmart.com 2022-04-28 13:05:35 UTC Comment hidden (spam)

Here I am sharing a wonderful platform for students in Australia , US and UK seeking for Assignmenthelpmart.com
 This is Assignmenthelpmart.com , world's no1 Assignment  help company since 2014. They cover almost all wide range of matlab subjects, here you go:
 
<a href="https://assignmenthelpmart.com/">Assignment Help</a>, <a href="https://assignmenthelpmart.com/research-paper-help.php">Research Paper Help</a>,  <a href="https://assignmenthelpmart.com/essay-writing-help.php">Essay Writing Help</a>, <a href="https://assignmenthelpmart.com/homework-help.php">Homework Help</a>, <a href="https://assignmenthelpmart.com/management-assignment-help.php">Management Assignment Help</a>, <a href="https://assignmenthelpmart.com/dissertation-topic-ideas.php">Dissertation Topics</a>.

Comment 15 Nancie L Beckett 2023-01-13 09:55:49 UTC Comment hidden (spam)

Sometimes you will find yourself facing a crunch of time. This lack of time as a student will have an adverse effect on your pending workload. Whatever the reason, your deadline won't stop for anyone. So what should you do when you are stuck in such a dilemma? no fear! Because we have the best python programming homework solution for you.

Comment 16 Khaled Hosny 2023-06-23 17:09:56 UTC


*** This bug has been marked as a duplicate of bug 66791 ***

Comment 17 Elizabeth Williams 2023-07-01 08:39:41 UTC Comment hidden (spam)

At University Homework Help, our primary goal is to help students achieve their full potential through personalized and high-quality assistance. We believe that every student deserves the opportunity to excel academically and grow as an individual. Our team of well-educated experts is available 24/7 to provide homework support in a variety of subjects including finance, accounting, management, economics, statistics, physics, biology, math, engineering, and programming. We take pride in offering top-quality assistance that not only guarantees a grade A but also helps students learn from each experience and develop essential skills that will serve them in their future careers.

Comment 18 himajin100000 2023-07-01 08:42:12 UTC


*** This bug has been marked as a duplicate of bug 66791 ***