Bug 163814 - when transforming the simplified chinese word '艺术' to traditional one, the result is wrong.
Summary: when transforming the simplified chinese word '艺术' to traditional one, the re...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: CJK
  Show dependency treegraph
 
Reported: 2024-11-08 04:50 UTC by brandos
Modified: 2024-11-11 16:11 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description brandos 2024-11-08 04:50:24 UTC
Description:
When transforming the simplified chinese word '艺术' to traditional chinese in LO Writer, the result is '藝朮', in which 朮 is wrong. The correct one should be 藝術.

Steps to Reproduce:
1. type 艺术
2. apply the tool, chinese traditional/simplifed transoformation (maybe not exact words). ps. I'm using the Chinese version, which is "中文简繁转换"
3. you get '藝朮'

Actual Results:
3. you get '藝朮'

Expected Results:
should be '藝術'


Reproducible: Always


User Profile Reset: No

Additional Info:
Version: 24.8.2.1 (X86_64) / LibreOffice Community
Build ID: 0f794b6e29741098670a3b95d60478a65d05ef13
CPU threads: 16; OS: Windows 11 X86_64 (10.0 build 26100); UI render: Skia/Vulkan; VCL: win
Locale: zh-CN (zh_CN); UI: zh-CN
Calc: threaded
Comment 1 Ming Hua 2024-11-08 05:04:57 UTC
Reproduced with 24.2.7:
Version: 24.2.7.2 (X86_64) / LibreOffice Community
Build ID: ee3885777aa7032db5a9b65deec9457448a91162
CPU threads: 12; OS: Windows 10.0 Build 22631; UI render: Skia/Raster; VCL: win
Locale: zh-CN (zh_CN); UI: en-US
Calc: CL threaded

As a native speaker (from mainland China, so zh-CN), I agree with the reported that the conversion from "艺术" to "藝術" is wrong.

However I don't think this will be fixed anytime soon.  It's better to use a dedicated tool for simplified-to-traditional conversion.  I personally recommend OpenCC: https://github.com/BYVoid/OpenCC
Comment 2 Mike Kaganski 2024-11-08 05:33:31 UTC
As explained in the help [1]:

> Common Terms
> ...
> Translate common terms
> Converts words with two or more characters that are in the list of common terms.
> After the list is scanned, the remaining text is converted character by character.

The "Translate common terms" is unchecked by default; and its list is empty (I have all the dictionaries installed on my system, if that matters). The conversion may consider the terms list; but its main mode is "character by character".

The fix (that *may* be fixed anytime soon, *iif* the interested people knowing the language would engage) should be creating the default "common terms" list (similar to the "autocorrect" default list), and maybe making the "Translate common terms" checkbox enabled by default then.

The code pointer for the "Translate common terms" enabled/disabled setting is IsTranslateCommonTerms in https://opengrok.libreoffice.org/xref/core/officecfg/registry/schema/org/openoffice/Office/Linguistic.xcs?r=0264999b&mo=11382&fi=274#274.

The dictionaries seem to be initialized by https://opengrok.libreoffice.org/xref/core/linguistic/source/convdiclist.cxx?r=6f1508f4#351 (using names "ChineseS2T" and "ChineseT2S").

https://help.libreoffice.org/24.8/en-US/text/shared/01/06010600.html?DbPAR=WRITER
Comment 3 Ming Hua 2024-11-08 06:18:46 UTC
Hi Mike,

(In reply to Mike Kaganski from comment #2)
> (that *may* be fixed anytime soon, *iif* the interested people
> knowing the language would engage)
I am actually interested in engaging, but my programming skills is a bit lacking (never programmed in C++), and I don't have much time for Open Source (for me it's a pure hobby) now.

So bear with me when I try to explain the details and difficulty of fixed to this bug.  Hope more non-CJK developers can understand this issue better and share their insights.

The simplified-to-traditional Chinese conversion, even at the character level and not taking terms (it's actually closer to words and phrases, instead of terms in special areas, but I digress) into consideration, is not a simple matter.

When the mainland Chinese government made the simplification in 1950s, there are many cases that multiple traditional characters are simplified to one character.  In the example reported here, both "術" (U+8853) [1] and "朮" (U+672E) [2] are simplified/standardized as "术" (U+672F) [3], and U+8853 is actually much more commonly used than U+672E, the word "艺术/藝術" (means art/artwork) being an example.

For some reason (my guess is the similarity of glyph shape), LibreOffice (or the conversion table LO uses) chose U+672E instead of U+8853 when doing the reversed one-to-multi conversion and ended with the wrong character most of the time.

With Mike's pointer, I can write a patch fixing this specific mis-conversion reported, but there are probably dozens, if not hundreds, similar ones still in LO.  Even Microsoft Word doesn't do a very good job in this simplified-to-traditional conversion work, therefore I recommended a dedicated tool in my earlier reply.

(The following links are all in Chinese)
1. https://zi.tools/zi/%E8%A1%93
2. https://zi.tools/zi/%E6%9C%AE
3. https://zi.tools/zi/%E6%9C%AF
Comment 4 Ming Hua 2024-11-08 07:12:05 UTC
(In reply to Mike Kaganski from comment #2)
> The "Translate common terms" is unchecked by default; and its list is empty
> (I have all the dictionaries installed on my system, if that matters). The
> conversion may consider the terms list; but its main mode is "character by
> character".
>
> [...]
>
> The dictionaries seem to be initialized by
> https://opengrok.libreoffice.org/xref/core/linguistic/source/convdiclist.
> cxx?r=6f1508f4#351 (using names "ChineseS2T" and "ChineseT2S").
When digging deeper, I believe I've found the Chinese conversion term list not stored with other dictionaries, but at https://opengrok.libreoffice.org/xref/core/i18npool/source/textconversion/data/stc_word.dic , and it is not empty.

The character-to-character conversion should be based on the neighboring stc_char.dic file, and I'll have a look at it and see if I can provide a patch for this specific "艺术 -> 藝術" issue.