Created attachment 68181 [details] Different word counts for corresponding Finnish and English texts. Word count does not function correctly if the language of the text is set to Finnish: any punctuation or combination of punctuation that is not immediately preceded by a letter is counted as the end of a previous word (even if the punctuation occurs in the beginning of a paragraph, so there is no previous word). Furthermore, some special symbols are counted together with the previous word (typically but not necessarily written with numerical digits, e.g. "10 %") as if they were a single word, despite their being separated by a space. If the language is set to another language, the word count seems to function correctly. Found in 3.5.4 (backported for Debian Squeeze); also present in 3.6.2 (Windows). The issue may be related to bug 33774 (it seems that the fix does not affect Finnish text for some reason). Steps to reproduce: 1) Open the attached test file. The first column has some samples set in Finnish, whereas the second column has corresponding samples set in English. 2) Open the word-count dialog box (if using 3.5.4). 3) Select the first sample line (including the punctuation) in the Finnish column. According to the word count, the current selection contains two words, the opening quotation mark being counted as a separate word. The corresponding English sample is correctly reported to contain only one word. 4) On the second line, select one by one each of the Finnish sample words. The string "USA:n" is correctly counted as a single word (since the colon is preceded by a letter), but both "90:n" and "%:n" are counted as two separate words (and by further experimentation, one can see that the string "n, %:" is counted as a single word, mixing the ending of one word, the intervening punctuation, and the stem of the following word). In the corresponding English column, "USA's", "90's" and "%'s" are each counted as a single word. 5) Select the entire third line ("10 %, 10 €") in the Finnish column. The current selection is counted as two words, whereas the identical string in the English column is counted as four words. If the language setting of the Finnish column is changed into English, the word count works correctly, and vice versa. Also, if the language setting is changed into French, German, or Swedish, the word count works correctly. The issue only seems to affect Finnish. (Could this somehow be connected to the fact that the spell checker for Finnish is not Hunspell but Voikko?)
Created attachment 68182 [details] Different word counts for corresponding Finnish and English texts.
Confirmed on Version 3.6.2.2 (Build ID: 360m1(Build:2)) on Kubuntu 12.10 (Linux)
Any update with a recent LO version? (4.1.5 or 4.2.3) Indeed, I'm quite sure there have been some fixes about word counting since 3.6. Now I can't say it'll solve your problem.
The bug is still present in 4.2.3.3.
Simo: thank you for your feedback Caolán: I thought you might be interested in this one (seeing http://cgit.freedesktop.org/libreoffice/core/commit/?id=eae2e87ba4de1ae59779cbfc56109aa6c27fbc17 for example)
Simo: Just realized that this commit isn't in 4.2 branch just on master (future 4.3.0 but could help (take a look to fdo#51818 put in See Also) For the test (because it's a development version), could you give a try to a daily build from master sources (see http://dev-builds.libreoffice.org/daily/master/)?
OK, thanks for the tip. Unfortunately, the bug still seems to be present in the development version too. Tested on Windows: Version: 4.3.0.0.alpha1+ Build ID: 0b03f7ed575838f90e6b1ebec3538a3a214f81fb TinderBox: Win-x86@42, Branch:master, Time: 2014-04-30_02:30:23
The difference is (I'm guessing) probably because we have (for some reason) customized rules in http://cgit.freedesktop.org/libreoffice/core/tree/i18npool/source/breakiterator/data for Finnish (the _fi files) and persumably they are wrong or out of date. Ideally we would have no such custom rules and prefer just the in-built icu rules. Its likely that the custom rules were created long ago when icu had some poor Finnish rules and its possible that the normal rules are no better than our custom ones. The customizations are due to these two old reports. https://issues.apache.org/ooo/show_bug.cgi?id=58513 https://issues.apache.org/ooo/show_bug.cgi?id=85411 The other possibility is that the problem lies in icu and the custom rules here are a red herring.
Caolan McNamara committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=6e225b41f1ab3e6cac395b0c0c6db73414658625 Resolves: fdo#55707 Word count incorrect if language is set to Finnish The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
With the above changes I get 10 words for the Finnish text in LibreOffice and MSOffice and ctrl+right/left gives equal boundaries. Both apps believe "10 %" and "10 €" comprise of 2 words each. In English they definitely do form two diffent words as the practice is 10% and 10€ in that language. Though I know the practice is "10 %" in other languages I don't know if it counts as a single word or not. Any issues around that would have to be raised in icu itself. Give the dailies a go tomorrow or so and see if there are side-effects of the change.
Thank you for the fix! The word count now works as expected for Finnish too. Indeed, I'd expect "10 %" etc. to count as 2 words, since basically it stands for "ten percent" (or "kymmenen prosenttia" in Finnish).