Download it now!
Bug 114760 - Word Count problem with symbols in Chinese mixed with English text
Summary: Word Count problem with symbols in Chinese mixed with English text
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
3.6.0.4 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: CJK
  Show dependency treegraph
 
Reported: 2017-12-30 02:26 UTC by Cheng-Chia Tseng
Modified: 2019-07-04 05:42 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Cheng-Chia Tseng 2017-12-30 02:26:03 UTC
Description:
In Word Count dialogue, there is a "Words" count section. 
It counts English text for words without symbols while it counts Chinese text for characters AND symbols.

In Chinese text we have 2 counting ways: one to count Chinese characters and symbols, and the other to count only Chinese characters (no symbols). The previous one method counting Chinese symbols is much more popular in press.

So when we are counting a text document including Chinese text and English text, we add the Word count of English (not counting symbols) and the Word count of Chinese (either counting symbols or not) together.

The "Words" count in LibreOffice now uses the first method above to count English "words" and "Chinese characters and Chinese symbols." I think that is confusing because we see "phonogram words" equal to "Chinese characters."

"Words count" should be divided into 
1. Words => be corrected by only counting words and Chinese characters.
2. Words and Chinese symbols => the method we use for Words count now.

Steps to Reproduce:
1. Open Writer
2. Copy paste "Hello, world! 世界,你好!"
3. Select Tools > Word Count to see the stats

Actual Results:  
1. Words: 8
2. Characters including spaces: 20
3. Characters excluding spaces: 18
4. Asian characters and Korean syllables: 6

Expected Results:
In "Hello, world! 世界,你好!" sentence, there are 2 English words (Hello world), 4 Chinese characters (世界你好), 4 symbols (,!,!), 2 Chinese symbols (,!) and 2 spaces.

1. Words: 6 => Should be corrected as "Words" not including symbols 
2. Words and Chinese symbols: 8 => What the Words count method now
3. Words and symbols: 10
4. Characters including spaces: 20
5. Characters excluding spaces: 18
6. Asian characters and Korean syllables: 6


Reproducible: Always


User Profile Reset: No



Additional Info:


User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0
Comment 1 Buovjaga 2018-01-27 18:37:43 UTC
Confirmed.

Arch Linux 64-bit
Version: 6.1.0.0.alpha0+
Build ID: 2d8f17565ebe867210f5769851d91b2e7b612a8f
CPU threads: 8; OS: Linux 4.14; UI render: default; VCL: kde4; 
Locale: fi-FI (fi_FI.UTF-8); Calc: group threaded
Built on January 27th 2018
Comment 2 QA Administrators 2019-01-28 03:42:17 UTC Comment hidden (obsolete)
Comment 3 Ming Hua 2019-05-27 08:25:11 UTC
(In reply to QA Administrators from comment #2)
Still reproducible in 6.2.4.

Version: 6.2.4.2 (x64)
Build ID: 2412653d852ce75f65fbfa83fb7e7b669a126d64
CPU threads: 2; OS: Windows 10.0; UI render: GL; VCL: win; 
Locale: zh-CN (zh_CN); UI-Language: en-US
Calc: threaded
Comment 4 Naruhiko Ogasawara 2019-06-23 04:05:40 UTC
Just exclude symbols to word count is enough?  Or we need extra count (currently "word" count)?

Now I'm digging this issue, then I would like to confirm the real problem we should fix.
Comment 5 Cheng-Chia Tseng 2019-06-23 16:51:05 UTC
In my opinion, "words" does not include symbols basically.

The method used by LibreOffice now takes Chinese symbols into account to help users in Taiwan or China to know what the press wants to know.

Note: The press/media in Taiwan or China count Chinese symbols as well to give the pay for writers.

I suggest adding an extra count to know the "real word" count regardless any form of symbols.
Comment 6 Ming Hua 2019-07-04 05:42:03 UTC
In my opinion, there are multiple issues here, some illustrated by the example from the bug submitter, some not.  Maybe I should file separate bugs.

1. Exclude Chinese punctuations and symbols from the "Words" count.  Or alternatively, exclude all Chinese characters and symbols from the "Words" count, as "words" (词/詞) is a rather vague concept in Chinese anyway, and counting each Chinese character as a word would never be correct.

2. Recognize full-width space (U+3000) in the "Characters excluding spaces" count;

3. Provide Asian character count excluding punctuations and symbols, as that number is sometimes preferred.