Bug 55707 - Word count incorrect if language is set to Finnish
Summary: Word count incorrect if language is set to Finnish
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.4 release
Hardware: Other All
: medium normal
Assignee: Caolán McNamara
URL:
Whiteboard: target:4.3.0
Keywords:
Depends on:
Blocks: Word-Count
  Show dependency treegraph
 
Reported: 2012-10-06 22:29 UTC by Simo Kaupinmäki
Modified: 2016-10-24 21:20 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Different word counts for corresponding Finnish and English texts. (10.31 KB, application/vnd.oasis.opendocument.text)
2012-10-06 22:29 UTC, Simo Kaupinmäki
Details
Different word counts for corresponding Finnish and English texts. (10.31 KB, application/vnd.oasis.opendocument.text)
2012-10-06 22:56 UTC, Simo Kaupinmäki
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Simo Kaupinmäki 2012-10-06 22:29:55 UTC
Created attachment 68181 [details]
Different word counts for corresponding Finnish and English texts.

Word count does not function correctly if the language of the text is set to Finnish: any punctuation or combination of punctuation that is not immediately preceded by a letter is counted as the end of a previous word (even if the punctuation occurs in the beginning of a paragraph, so there is no previous word). Furthermore, some special symbols are counted together with the previous word (typically but not necessarily written with numerical digits, e.g. "10 %") as if they were a single word, despite their being separated by a space.

If the language is set to another language, the word count seems to function correctly.

Found in 3.5.4 (backported for Debian Squeeze); also present in 3.6.2 (Windows).

The issue may be related to bug 33774 (it seems that the fix does not affect Finnish text for some reason).

Steps to reproduce:

1) Open the attached test file. The first column has some samples set in Finnish, whereas the second column has corresponding samples set in English.
2) Open the word-count dialog box (if using 3.5.4).
3) Select the first sample line (including the punctuation) in the Finnish column. According to the word count, the current selection contains two words, the opening quotation mark being counted as a separate word. The corresponding English sample is correctly reported to contain only one word.
4) On the second line, select one by one each of the Finnish sample words. The string "USA:n" is correctly counted as a single word (since the colon is preceded by a letter), but both "90:n" and "%:n" are counted as two separate words (and by further experimentation, one can see that the string "n, %:" is counted as a single word, mixing the ending of one word, the intervening punctuation, and the stem of the following word). In the corresponding English column, "USA's", "90's" and "%'s" are each counted as a single word.
5) Select the entire third line ("10 %, 10 €") in the Finnish column. The current selection is counted as two words, whereas the identical string in the English column is counted as four words.

If the language setting of the Finnish column is changed into English, the word count works correctly, and vice versa. Also, if the language setting is changed into French, German, or Swedish, the word count works correctly. The issue only seems to affect Finnish. (Could this somehow be connected to the fact that the spell checker for Finnish is not Hunspell but Voikko?)
Comment 1 Simo Kaupinmäki 2012-10-06 22:56:35 UTC
Created attachment 68182 [details]
Different word counts for corresponding Finnish and English texts.
Comment 2 Juan Canham 2012-12-24 01:00:30 UTC
Confirmed on Version 3.6.2.2 (Build ID: 360m1(Build:2)) on Kubuntu 12.10 (Linux)
Comment 3 Julien Nabet 2014-05-02 12:48:32 UTC
Any update with a recent LO version? (4.1.5 or 4.2.3)
Indeed, I'm quite sure there have been some fixes about word counting since 3.6. Now I can't say it'll solve your problem.
Comment 4 Simo Kaupinmäki 2014-05-02 14:35:08 UTC
The bug is still present in 4.2.3.3.
Comment 5 Julien Nabet 2014-05-02 15:19:38 UTC
Simo: thank you for your feedback

Caolán: I thought you might be interested in this one (seeing http://cgit.freedesktop.org/libreoffice/core/commit/?id=eae2e87ba4de1ae59779cbfc56109aa6c27fbc17 for example)
Comment 6 Julien Nabet 2014-05-02 15:33:47 UTC
Simo: Just realized that this commit isn't in 4.2 branch just on master (future 4.3.0 but could help (take a look to fdo#51818 put in See Also)
For the test (because it's a development version), could you give a try to a daily build from master sources (see http://dev-builds.libreoffice.org/daily/master/)?
Comment 7 Simo Kaupinmäki 2014-05-02 16:40:31 UTC
OK, thanks for the tip. Unfortunately, the bug still seems to be present in the development version too. Tested on Windows:

Version: 4.3.0.0.alpha1+
Build ID: 0b03f7ed575838f90e6b1ebec3538a3a214f81fb
TinderBox: Win-x86@42, Branch:master, Time: 2014-04-30_02:30:23
Comment 8 Caolán McNamara 2014-05-12 14:57:43 UTC
The difference is (I'm guessing) probably because we have (for some reason) customized rules in http://cgit.freedesktop.org/libreoffice/core/tree/i18npool/source/breakiterator/data for Finnish (the _fi files) and persumably they are wrong or out of date.

Ideally we would have no such custom rules and prefer just the in-built icu rules. Its likely that the custom rules were created long ago when icu had some poor Finnish rules and its possible that the normal rules are no better than our custom ones.

The customizations are due to these two old reports.
https://issues.apache.org/ooo/show_bug.cgi?id=58513
https://issues.apache.org/ooo/show_bug.cgi?id=85411

The other possibility is that the problem lies in icu and the custom rules here are a red herring.
Comment 9 Commit Notification 2014-05-12 16:09:53 UTC
Caolan McNamara committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=6e225b41f1ab3e6cac395b0c0c6db73414658625

Resolves: fdo#55707 Word count incorrect if language is set to Finnish



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 10 Caolán McNamara 2014-05-12 16:13:51 UTC
With the above changes I get 10 words for the Finnish text in LibreOffice and MSOffice and ctrl+right/left gives equal boundaries.

Both apps believe "10 %" and "10 €" comprise of 2 words each. In English they definitely do form two diffent words as the practice is 10% and 10€ in that language. Though I know the practice is "10 %" in other languages I don't know if it counts as a single word or not. Any issues around that would have to be raised in icu itself.

Give the dailies a go tomorrow or so and see if there are side-effects of the change.
Comment 11 Simo Kaupinmäki 2014-05-16 14:42:21 UTC
Thank you for the fix! The word count now works as expected for Finnish too. Indeed, I'd expect "10 %" etc. to count as 2 words, since basically it stands for "ten percent" (or "kymmenen prosenttia" in Finnish).