Bug 66791 - FORMATTING: Incorrect application of "Asian text font" for quotation marks when the paragraph contains a mixture of western and asian characters
Summary: FORMATTING: Incorrect application of "Asian text font" for quotation marks wh...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Volga
URL:
Whiteboard: target:25.8.0 target:25.2.0.2 inRelea...
Keywords:
: 101751 124657 126387 134350 (view as bug list)
Depends on:
Blocks: CJK Language-Detection
  Show dependency treegraph
 
Reported: 2013-07-10 19:16 UTC by simonjwiles
Modified: 2025-01-14 07:57 UTC (History)
16 users (show)

See Also:
Crash report or crash signature:


Attachments
Screenshot (6.99 KB, image/png)
2013-07-10 19:16 UTC, simonjwiles
Details
test cases of English and Chinese quotes (58.94 KB, application/vnd.oasis.opendocument.text)
2013-09-03 03:14 UTC, Kevin Suo
Details
screenshot_including_complex_text_layout (95.42 KB, image/png)
2017-10-29 15:03 UTC, Hiunn-hué
Details
Screenshot on WordPad (30.80 KB, image/png)
2023-06-29 20:53 UTC, Volga
Details
The same file opened with LibreOffice Writer (17.65 KB, image/png)
2023-06-29 20:59 UTC, Volga
Details
Change illustration (304.23 KB, image/png)
2024-12-17 10:17 UTC, Jonathan Clark
Details
Screenshot after the last commit (51.64 KB, image/png)
2025-01-01 13:13 UTC, Volga
Details
Another screenshot after the last commit (53.07 KB, image/png)
2025-01-11 19:51 UTC, Volga
Details
Test cases for variation selectors (12.92 KB, application/vnd.oasis.opendocument.text)
2025-01-14 07:21 UTC, Volga
Details

Note You need to log in before you can comment on or make changes to this bug.
Description simonjwiles 2013-07-10 19:16:15 UTC
Created attachment 82294 [details]
Screenshot

If I have an East-Asian character in my (predominantly English) document, followed by a quotation mark (opening or closing), the quotation mark takes the font settings from the "Asian text font" section of the style definition.  This results in very ugly copy.


Steps to reproduce:
1. Type some western text into LO Writer, surrounded by quotation marks (e.g. "sun").
2. Move the cursor to before the opening quotation mark, and type (or paste -- the IME is not relevant) an East-Asian character (e.g. 日).


Current behaviour:
The initial quotation mark takes the settings from "Asian text font" instead of "Western text font".  The behaviour is the same if a (normal-width, western) space comes between the East-Asian character and the opening quotation mark.


Expected behaviour:
The opening quotation mark, being surrounded by a normal-width space on one side, and a Latin letter ("s" in this case) on the other, should take the "Western text font" settings.


The only way to "work-around" this problem is to select the characters that have been rendered incorrectly and manually force the application of the "Western text font" settings.  Of course, this breaks if "Clear Direct Formatting" is used.

It's not clear to me why typing an opening quotation mark immediately after an East-Asian character results in the insertion of Asian punctuation (e.g. 「 or 『).  If I wanted Asian punctuation, I would, of course, type Asian punctuation.  I don't know if this is connected.


ask.libreoffice.org link: http://ask.libreoffice.org/en/question/19750/problem-with-full-width-asian-punctuation/

May perhaps be linked to this bug: https://bugs.freedesktop.org/show_bug.cgi?id=60106


I'm currently using LO Version 4.0.4.2 (Build ID: 400m0(Build:2)) on Linux Mint 14 amd64, but the problem has been around as long as I can remember and on every platform I've tried.
Comment 1 Kevin Suo 2013-09-03 03:14:09 UTC
Created attachment 85096 [details]
test cases of English and Chinese quotes

I confirm this bug in LibreOffice 4.0.5.2 and 4.1.1.2.

I did some test in the attached file, see the highlighted part. Quotes are incorrect when in the first line or after a different language.

When disable "double quotes replacement" in autocorrection option, everything is OK, so its a replacement problem.
Comment 2 Kevin Suo 2014-06-25 06:24:35 UTC
Today I tested attachment 85096 [details] in 4.3.0.1, 
And it seems that it's getting worse.
All the start quote which are at the beginning of paragraph are always shown as "half-width", regardless of whether the following chars are westen or Asian.
Comment 3 QA Administrators 2015-07-18 17:43:52 UTC Comment hidden (obsolete)
Comment 4 simonjwiles 2015-07-18 17:57:48 UTC
can confirm this bug is still present:

Version: 4.4.4.3
Build ID: 40m0(Build:3)
Locale: en_GB.UTF-8

(LO from "LibreOffice Fresh" PPA, on Linux Mint 17.2 (package base == Trusty).
Comment 5 QA Administrators 2016-09-20 10:18:03 UTC Comment hidden (obsolete)
Comment 6 Volga 2016-12-12 02:55:27 UTC
I think using “East Asian text font” is more suitable.
Comment 7 tommy27 2017-02-18 15:08:08 UTC
(In reply to Volga from comment #6)
> I think using “East Asian text font” is more suitable.

@simon
does this helps?
Comment 8 Eric Ding 2017-06-23 03:07:33 UTC
Four years after the initial report, this bug still exists in LibreOffice 5.3.4 (running on Windows) with a mix of East Asian (CJK) and non-CJK fonts and text.
Comment 9 Hiunn-hué 2017-10-29 15:03:36 UTC
Created attachment 137355 [details]
screenshot_including_complex_text_layout

This also happens to languages like Thai (Complex text layout), please see attached PNG file.

It's actually quite annoying ...

--
Version: 6.0.0.0.alpha1+
Build ID: 81d50fd137fdf712a0f37988217c43278cf24c26
CPU threads: 4; OS: Linux 4.4; UI render: default; VCL: gtk2; 
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2017-10-28_00:31:27
Locale: zh-TW (zh_TW.UTF-8); Calc: group
--
Comment 10 QA Administrators 2018-11-01 03:52:22 UTC Comment hidden (obsolete)
Comment 11 Eric Ding 2018-11-07 06:12:28 UTC
I confirm that this bug is still present in:

Version: 6.1.3.2
Build ID: 86daf60bf00efa86ad547e59e09d6bb77c699acb
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk2; 
Locale: en-US (en_US.UTF-8); Calc: group threaded
Comment 12 Volga 2019-05-12 03:12:03 UTC
(In reply to tommy27 from comment #7)
> (In reply to Volga from comment #6)
> > I think using “East Asian text font” is more suitable.
> 
> @simon
> does this helps?
Oh I made a misunderstand, but I thought that is more proper name.
Comment 13 Volga 2019-05-12 03:23:53 UTC Comment hidden (obsolete)
Comment 14 Volga 2019-05-13 03:46:10 UTC Comment hidden (obsolete)
Comment 15 Volga 2019-05-13 07:11:32 UTC
*** Bug 124657 has been marked as a duplicate of this bug. ***
Comment 16 Volga 2019-07-15 11:21:03 UTC
*** Bug 126387 has been marked as a duplicate of this bug. ***
Comment 17 Volga 2019-07-16 13:35:22 UTC
Anyone who has an idea for this?
Comment 18 Liaison to zh-CN User Community 2019-07-28 07:18:40 UTC
The core issue of this bug, IMHO, is that curly double quotation marks (U+201C and U+201D) are widely used in both English and (simplified) Chinese, so LO has no way to know which style (western or Asian) it should apply to these quotation marks, and has to rely on context.

There are potentially more characters that cause such problem, the most obvious being single quotation marks.  But I've also seen the middle dot (U+00B7) and em dash (U+2014) with similar problems.

The quotation marks are especially visible because the current bug makes them unsymmetrical, which brings quite some visual discomfort.  So the obvious brute-force solution is that instead of determining their style according to context, LO can just make sure the quotation marks are consistently using the same style, either through some language/locale setting as comment 14 mentioned, or as an special setting that can be changed by the user.  In other words, treat quotation marks differently than the other characters.
Comment 19 Kevin Suo 2022-11-30 14:58:18 UTC
*** Bug 101751 has been marked as a duplicate of this bug. ***
Comment 20 Volga 2023-06-22 11:40:18 UTC
Mr. Khaled, what do you think of?
Comment 21 ⁨خالد حسني⁩ 2023-06-23 17:09:37 UTC
(In reply to Volga from comment #20)
> Mr. Khaled, what do you think of?

I checked MS Word, and it seems to treat the quotation marks as western text unless their language is set to Chinese, then it treats them as Asian text regardless of the context.

This seems simpler and more reliable than what we currently do. I wounder if it does this to all punctuation characters?

It feels less smart, though. The smart, and more Unicode-compliant way is to try to resolve common characters based on context like we do know, except that our implementation is buggy.

I’m not sure which is the better way, to be honest, as either option has compatibility considerations (either with older LO versions if we go MS way, or both if we fix our current way).

I’m not sure who should decide this.
Comment 22 ⁨خالد حسني⁩ 2023-06-23 17:09:56 UTC
*** Bug 134350 has been marked as a duplicate of this bug. ***
Comment 23 Volga 2023-06-24 07:01:31 UTC
I've seen someone made a tsukkomi for a long time.
https://yongweiwu.wordpress.com/2014/12/18/a-complaint-of-odfs-asian-language-support/
Although MS Word set the good example for this, I believe implement a smart rules to assign would be better choice. In this way LibreOffice would be able to assign font face for such punctuations to make them match the mostly used language/locale without breaking down text style or file structure.
Comment 24 Volga 2023-06-29 20:53:26 UTC
Created attachment 188133 [details]
Screenshot on WordPad

From the last comment I found this test file by blog author
https://yongweiwu.files.wordpress.com/2014/12/odf_test.odt
Then I remembered WordPad, a native word processor in Windows, so let's see what happened on WordPad.
Comment 25 Volga 2023-06-29 20:59:49 UTC
Created attachment 188134 [details]
The same file opened with LibreOffice Writer

Then this screenshot is made after the same file opened with LibreOffice Writer, note both two apps are zh-CN locale when I see them. So Khaled, what happened if you open this ODT in WordPad or MS Word?
Comment 26 himajin100000 2023-07-01 08:42:12 UTC
*** Bug 134350 has been marked as a duplicate of this bug. ***
Comment 27 Volga 2023-07-21 16:13:25 UTC
Have you checked Windows WordPad so far?
Comment 28 Volga 2023-07-28 17:43:45 UTC
Seen from the commit d6efe8c302b81886706e18640148c51cf7883bbf, I think there is an  to fix this bug, from which I believe this could be done by assigning font face to such punctuations dependes on surrounding texts.

For characters that could be affcted by this bug, see:
https://www.w3.org/International/clreq/#tables_of_chinese_punctuation_marks
https://www.w3.org/International/jlreq/#cl-01
https://www.w3.org/International/klreq/#chars-grouping
Comment 29 Kevin Suo 2023-12-08 01:24:16 UTC
See a related articles:

中西文混合排版中标点符号的渲染 https://blog.1a23.com/2020/06/28/zhong-xi-wen-hunhe-paiban-zhong-biaodian-fuhao-de-xuanran/

中英混排中的标点符号问题 https://www.hutrua.com/blog/2018/07/22/punctuation.html
Comment 30 Volga 2024-09-27 17:48:51 UTC Comment hidden (no-value)
Comment 31 Volga 2024-09-27 17:52:29 UTC Comment hidden (obsolete)
Comment 32 Jonathan Clark 2024-11-25 18:52:53 UTC
This bug is due to the greedy algorithm we use to assign script types to weakly-associated characters. It does not properly handle punctuation.

The current algorithm works something like this:

- First, any weak characters at the start of a paragraph are assigned to the same script as the first strong character in the paragraph.
- Then, the paragraph is scanned in reading order. Weak characters are assigned to the previously-seen script, with a few hard-coded exceptions (e.g. bug 112594).
- Finally, we run the Unicode bidi algorithm, and reassign all right-to-left text to the complex script type.

The last step hides the depth of the problem. The Unicode bidi algorithm accounts for nested punctuation, so the output seems correct-but-buggy for RTL languages (while not working at all for other language pairs).


In my opinion, we should replace the current algorithm with one that extends the RTL behavior to all languages. Existing RTL documents depend on the current behavior, and impacted CJK documents likely already include manual formatting to achieve the same effect, so this seems like the least-disruptive option.
Comment 33 Commit Notification 2024-12-16 21:46:59 UTC
Jonathan Clark committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/537645c0834eab2d277113f1e3fcf039c994832d

tdf#66791 sw: Treat weak punctuation as Asian in Asian paragraphs

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 34 Jonathan Clark 2024-12-17 10:17:24 UTC
Created attachment 198154 [details]
Change illustration

Screenshots comparing Word to LO, both with and without this patch. Blue-highlighted quotation marks have the Asian script type. Orange-highlighted quotation marks have the Complex script type.
Comment 35 Jonathan Clark 2024-12-17 12:00:53 UTC
While investigating this bug, I found that our script assignment implementation is broadly similar to (and shares many problems with) the algorithm used by Microsoft Word. The main difference is that Word treats certain punctuation characters as Asian script group when used in paragraphs containing CJ characters. Rather than risk compatibility, I applied a similar heuristic to our implementation. I also restructured the code so it will be easier to make changes in the future.

This fix is narrow and sub-optimal. It's not possible to write an algorithm that perfectly assigns characters to script groups. The ideal solution is to let users specify language manually, and this is tracked by bug 151290.
Comment 36 Commit Notification 2024-12-24 10:58:11 UTC
Jonathan Clark committed a patch related to this issue.
It has been pushed to "libreoffice-25-2":

https://git.libreoffice.org/core/commit/73a96633d672f344f6415f050405b19174031f37

tdf#66791 sw: Treat weak punctuation as Asian in Asian paragraphs

It will be available in 25.2.0.2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 37 Volga 2025-01-01 13:13:37 UTC
Created attachment 198343 [details]
Screenshot after the last commit

The problem is still happened if a line is started with “ (U+201C, LEFT DOUBLE QUOTATION MARK).

Version: 25.2.0.1.0+ (X86_64) / LibreOffice Community
Build ID: 16b35a9ea05c9a1a566baf502236b45cfd628d11
CPU threads: 4; OS: Windows 10 X86_64 (10.0 build 19045); UI render: default; VCL: win
Locale: zh-CN (zh_CN); UI: zh-CN
Calc: threaded
Test file: comment 23
Comment 38 Commit Notification 2025-01-06 12:45:36 UTC
Jonathan Clark committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/9b505f583954c88ce7b72a07c9bfd65d78d863ef

tdf#66791 sw: Apply first-seen script type to leading weak characters

It will be available in 25.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 39 Jonathan Clark 2025-01-06 12:49:01 UTC
(In reply to Volga from comment #37)
> The problem is still happened if a line is started with “ (U+201C, LEFT
> DOUBLE QUOTATION MARK).

This was intentional, to avoid regressing an earlier fix (#94331#). However, while looking up that bug number for this post, I found a code comment deleted 24 years ago explaining that the current behavior was meant to be temporary.
Comment 40 Commit Notification 2025-01-08 08:40:24 UTC
Jonathan Clark committed a patch related to this issue.
It has been pushed to "libreoffice-25-2":

https://git.libreoffice.org/core/commit/e4b74e8bc282d0fd396265ec893491b0bebe576d

tdf#66791 sw: Apply first-seen script type to leading weak characters

It will be available in 25.2.0.2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 41 Volga 2025-01-11 19:51:00 UTC
Created attachment 198496 [details]
Another screenshot after the last commit

I found another problem happened after the last commit. In this case Chinese quotation mark is used in English brackets.

Version: 25.2.0.1.0+ (X86_64) / LibreOffice Community
Build ID: 5acb7648c3eff7371385df442a627768762a7aa6
CPU threads: 4; OS: Windows 10 X86_64 (10.0 build 19045); UI render: default; VCL: win
Locale: zh-CN (zh_CN); UI: zh-CN
Calc: threaded
Test file: https://bz.apache.org/ooo/attachment.cgi?id=81108 (also from comment 23)
Comment 42 Volga 2025-01-11 19:53:01 UTC
I think there's need to have additional rules for texts within brackets.
Comment 43 Jonathan Clark 2025-01-13 11:20:48 UTC
Resetting to fixed.

(In reply to Volga from comment #41)
> I found another problem happened after the last commit. In this case Chinese
> quotation mark is used in English brackets.

Correctness in script assignment is subjective. No algorithm, or even human editor, can perfectly reconstruct authorial intent from raw text.

The current algorithm guarantees matching pairs of quotation marks have the same font, which was the most distracting part of this bug. It also makes LibreOffice behave more like other office suites (which don't handle the parenthesized text case shown in this screenshot, either).

In my opinion, the current state is good enough to consider this bug fixed.

Instead of adding more complex language processing, we should fix bug 151290. This would give users/documents more control. We could also use that feature to handle cases like this example at proofing time, and not risk changing the behavior of existing documents from version-to-version.
Comment 44 Volga 2025-01-14 02:31:48 UTC
I think they should have special exceptions for Variation Selectors, when they followed by VS01, they should be always rendered with Western text font, when they are followed by VS02, they should always rendered with Asian text font.

See: https://www.unicode.org/charts/PDF/Unicode-16.0/U160-2000.pdf
Comment 45 Volga 2025-01-14 07:21:17 UTC
Created attachment 198523 [details]
Test cases for variation selectors

This is made for above comment.
Comment 46 Ming Hua 2025-01-14 07:26:10 UTC
(In reply to Volga from comment #44)
> I think they should have special exceptions for Variation Selectors
And that would suit better for discussion in a new, separate bug report.

I feel it is very impolite to repeatedly reopen a bug that you didn't report yourself, when the developer consider it fixed and has marked it so.
Comment 47 Volga 2025-01-14 07:42:07 UTC
OK, I'm sorry, so let's go to bug 164700.