57776 – Bad strings cause word count to fail in Japanese

Bug 57776 - Bad strings cause word count to fail in Japanese

Summary: Bad strings cause word count to fail in Japanese

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	3.5.2 release
Hardware:	All All

Importance:	high major
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	Word-Count
	Show dependency tree / graph

Reported:	2012-12-01 15:06 UTC by Matt Rosin
Modified:	2016-10-24 21:20 UTC (History)
CC List:	1 user (show)

See Also:
Crash report or crash signature:

Attachments
1 - 980 chars word doc.doc (21.50 KB, application/msword) 2012-12-01 15:06 UTC, Matt Rosin	Details
2 - screenshot 980 chars LO incorrect says 990.png (455.02 KB, image/png) 2012-12-01 15:07 UTC, Matt Rosin	Details
3 - screenshot 980 chars MSWORD (391.04 KB, image/png) 2012-12-01 15:08 UTC, Matt Rosin	Details
4 - short document showing bad sequence 1.doc (9.50 KB, application/msword) 2012-12-01 15:08 UTC, Matt Rosin	Details
5 - screenshot - forced to 6 char offset (233.08 KB, image/png) 2012-12-01 15:11 UTC, Matt Rosin	Details
6 - screenshot - reverted to 1 char offset (230.82 KB, image/png) 2012-12-01 15:11 UTC, Matt Rosin	Details
7 - short document showing bad sequence 2.odt (9.62 KB, application/vnd.oasis.opendocument.text) 2012-12-01 15:12 UTC, Matt Rosin	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matt Rosin 2012-12-01 15:06:34 UTC

Created attachment 70876 [details]
1 - 980 chars word doc.doc

This count may or may not be related to bugs 56975, 54918, 54483, 55359, and 55707.

Bad strings cause word count to fail in Japanese.

LibreOffice’s Tools>Word Count function displays number of characters (NOC) and number of characters excluding spaces (NOCES).

I discovered NOCES > NOC cases due to certain character sequences in a Japanese document, which should be impossible.

For example, the word count for attached word document (file 1) was 980 NOC and 990 NOCES in LibreOffice (screenshot of file 2).

NOC was found to be correct, and was the same as that calculated by Microsoft Word (screenshot of file 3)

By taking a short 100 character portion I was able to discover a bad character sequence (word document of file 4) which can be used to arbitrarily increase the difference NOCES minus NOC (screenshot of file 5). 

By deleting a character from that string in each instance I was able to revert the difference NOCES-NOC from 6 to 1 (screenshot of file 6). The 100 character portion also contains 1 instance so it did not go to 0.

Note that the original 980 character document (file 1) has NOCES-NOC = 10. This offset means there are 10 bad sequences in 1 page of text, a very high error rate.

I picked out a second bad string for your comparison (file 7).

In addition, I think the function should be checked to make sure NOC is correct. I thought I found a case where it was wrong, but it seems okay now.

There may be similar problems with English / Unicode. This text was produced on Mac OS X and is probably Japanese Unicode... not sure about that.

I would also like to mention an enhancement request:

In general it would be useful if the user could input characters to be ignored when counting. In particular, some customers will set a project monetary value based number of characters, not counting any English letters, numerals, or Japanese punctuation.

In the following I would like to also mention a few points about how this function is used in the real world (I am also a professional translator). This is provided for closure. Incidentally the same Word Count function in Microsoft Word is a source of mystery for all users of MS Word for decades so it pays to think it through. It would easy to make a superior function to that in Word for Japanese. 

The word count function is that it provides a count of “number of words” (NOW). It is very hard to count words in Japanese, although code does exist (academic morphological analyzers like IIRC, Kakashi) which gives a set of English character strings given Japanese text. It would be useful to tell the user how NOW is calculated in LO for Japanese text, as this can be used as a basis for communication with a customer. 

In general, people do not count Japanese words. Although I have seen a client count them and it was impossible to refute the number without counting them myself.

An easier way is you can try to multiply number of characters by an average number of characters per Japanese “word” (where for example counting means just the same as in English where single letter participles count for a word and compounds that translate to two English words also are two words). LO gave a number close to but different from the number I got, so one wonders what the algorithm is. Anyway this is not a critical matter but part of the mystery of this dialog. 

The uses of Word Count that I have myself seen are:
- For billing purposes. The customer sometimes bills based on number of target English words, and sometimes based on number of source Japanese characters.
- For estimating amount of time it will take to do a job, or one’s efficiency.
- For calculating the Japanese characters per English word ratio which differs according to the subject matter. A normal ratio is 1.7 whereas it can go up near 3 for biochemistry, so this is important in order to ensure the billing rate reflects the amount of effort involved.

Comment 1 Matt Rosin 2012-12-01 15:07:36 UTC

Created attachment 70877 [details]
2 - screenshot 980 chars LO incorrect says 990.png

Comment 2 Matt Rosin 2012-12-01 15:08:07 UTC

Created attachment 70878 [details]
3 - screenshot 980 chars MSWORD

Comment 3 Matt Rosin 2012-12-01 15:08:38 UTC

Created attachment 70879 [details]
4 - short document showing bad sequence 1.doc

Comment 4 Matt Rosin 2012-12-01 15:11:08 UTC

Created attachment 70880 [details]
5 - screenshot - forced to 6 char offset

Comment 5 Matt Rosin 2012-12-01 15:11:35 UTC

Created attachment 70881 [details]
6 - screenshot - reverted to 1 char offset

Comment 6 Matt Rosin 2012-12-01 15:12:07 UTC

Created attachment 70882 [details]
7 - short document showing bad sequence 2.odt

Comment 7 Matt Rosin 2012-12-01 15:14:50 UTC

Comment on attachment 70876 [details]
1 - 980 chars word doc.doc

Large document showing disparity of 10 between NOC and NOCES

Comment 8 tommy27 2013-09-01 10:18:29 UTC

I reproduce the bug with 3.5.7 under Win7 64bit (chars count is 980/990)
I don't reproduce it with LibO 4.1.1 (chars count is 980/980)
marking as RESOLVED WORKSFORME

do you still see this bug in recent 4.0.5 or 4.1.1 releases?
if the bug is still there change it to REOPENED