Created attachment 64144 [details] Screenshot of "misspelled" Khmer words that should be treated as two words Problem description: While ICU automatic line-breaking now works for Khmer in LibreOffice 3.6, Hunspell does not seem to be using the same word-breaking data and only sees one long line of text (Khmer does not have traditional "spaces" between words, like Thai). Steps to reproduce: 1. Type ឲ្យគេ (should be automatically broken by ICU into ឲ្យ|គេ) 2. If you have the SBBIC spelling checker installed http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version and CTL enabled, you will see that ឲ្យគេ is treated as one word, rather than two, and is therefore misspelled. 3. You might need a font to correctly display Khmer (download one here: http://www.sbbic.org/2011/01/19/khmer-sbbic-unicode-system-font/ ) Current behavior: No Khmer words are automatically broken for Hunspell, so we have to continue manually putting zero-width spaces between words to spell check (even though line-breaking is now automatic) Expected behavior: Khmer words should be automatically broken for Hunspell to check. Platform (if different from the browser): Browser: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11
It would be great to see this feature included in LibreOffice for Cambodians.Thanks
I don't like to type with space between word in khmer. it speed down my typing.
Caolan McNamara committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=8ad1d4443e67784a8c0d3c1a3a72f089cb0cd3ec Resolves: fdo#52020 ICU breakiterator not used for Khmer
Caolan McNamara committed a patch related to this issue. It has been pushed to "libreoffice-3-6": http://cgit.freedesktop.org/libreoffice/core/commit/?id=5e3c37c8a3b567cf3d8c9a47b37155e3c2ffefb9&g=libreoffice-3-6 Resolves: fdo#52020 ICU breakiterator not used for Khmer It will be available in LibreOffice 3.6.
Wonderful news! Thank you for your time on this!
After re-evaluating this solution, I want to ask that this patch be reversed for the time being in relation to these two bugs: https://bugs.freedesktop.org/show_bug.cgi?id=59448 and https://bugs.freedesktop.org/show_bug.cgi?id=59447 Currently this patch makes it so that the user cannot be sure that all the words are correctly spelled in Khmer because the ICU word-breaker is not 100% accurate (so if it splits a word wrong it might be shown as being spelled correctly when in fact it is not). I originally thought this patch would be a good thing, but it is now apparent that until the two other bugs/feature requests are solved, the ICU breaker should not be used for Hunspell/spell checking for Khmer. Thanks, and sorry for causing this mess!
@Nathan would you please give an update of the bug current status with latest 4.4.5.2 or 5.0.0.5 releases? is it still a bug or it's fixed?
@tommy27 Yes the bug still exists. We've tried to push a fix through on the ICU level, but as things go, it is very slow. It would still be great to either add the ability for the user to "turn-off" ICU breakiterator in LibreOffice (not only for Khmer, but also for some minority languages that were effected by this change).
Should I create a new bug, since this one is confusing, since I first requested for the breakiterator to be enabled, and now I am asking for it to be disabled again?
I think it's better to continue discussion here. just drop a brief summary explaining why it's better to revert the previous fix.
Nathan Wells committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=10199478b841a87e6436996bde221e424d1df708 Related: tdf#52020 Disable ICU Breakiterator for Khmer It will be available in 5.1.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
I'll mark this as "resolved" "not our bug" in the sense that at the moment with icu 54 the word boundary detection is still considered insufficient. If icu ever gets to that point then the above commit can be used as reference to reenable it.