52020 – ICU breakiterator not working with Khmer and Hunspell

Bug 52020 - ICU breakiterator not working with Khmer and Hunspell

Summary: ICU breakiterator not working with Khmer and Hunspell

Status:	RESOLVED NOTOURBUG

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	LibreOffice (show other bugs)
Version: (earliest affected)	3.6.0.0.beta2
Hardware:	Other All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:	BSA target:3.7.0 target:3.6.0.2 targe...
Keywords:

Depends on:
Blocks:

Reported:	2012-07-12 16:39 UTC by Nathan Wells
Modified:	2016-10-25 19:24 UTC (History)
CC List:	2 users (show)

See Also:	59448 59447
Crash report or crash signature:

Attachments
Screenshot of "misspelled" Khmer words that should be treated as two words (9.42 KB, image/jpeg) 2012-07-12 16:39 UTC, Nathan Wells	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nathan Wells 2012-07-12 16:39:53 UTC

Created attachment 64144 [details]
Screenshot of "misspelled" Khmer words that should be treated as two words

Problem description: While ICU automatic line-breaking now works for Khmer in LibreOffice 3.6, Hunspell does not seem to be using the same word-breaking data and only sees one long line of text (Khmer does not have traditional "spaces" between words, like Thai). 

Steps to reproduce:
1. Type ឲ្យគេ (should be automatically broken by ICU into ឲ្យ|គេ)
2. If you have the SBBIC spelling checker installed http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version and CTL enabled, you will see that ឲ្យគេ is treated as one word, rather than two, and is therefore misspelled.
3. You might need a font to correctly display Khmer (download one here: http://www.sbbic.org/2011/01/19/khmer-sbbic-unicode-system-font/ )

Current behavior: No Khmer words are automatically broken for Hunspell, so we have to continue manually putting zero-width spaces between words to spell check (even though line-breaking is now automatic)

Expected behavior: Khmer words should be automatically broken for Hunspell to check.

Platform (if different from the browser): 
              
Browser: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11

Comment 1 Tia Seng 2012-07-13 03:01:56 UTC

It would be great to see this feature included in LibreOffice for Cambodians.Thanks

Comment 2 chomneau 2012-07-13 04:41:41 UTC

I don't like to type with space between word in khmer. it speed down my typing.

Comment 3 Not Assigned 2012-07-13 08:55:41 UTC

Caolan McNamara committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=8ad1d4443e67784a8c0d3c1a3a72f089cb0cd3ec

Resolves: fdo#52020 ICU breakiterator not used for Khmer

Comment 4 Not Assigned 2012-07-13 09:19:24 UTC

Caolan McNamara committed a patch related to this issue.
It has been pushed to "libreoffice-3-6":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=5e3c37c8a3b567cf3d8c9a47b37155e3c2ffefb9&g=libreoffice-3-6

Resolves: fdo#52020 ICU breakiterator not used for Khmer


It will be available in LibreOffice 3.6.

Comment 5 Nathan Wells 2012-07-13 09:42:15 UTC

Wonderful news! Thank you for your time on this!

Comment 6 Nathan Wells 2013-07-19 03:06:19 UTC

After re-evaluating this solution, I want to ask that this patch be reversed for the time being in relation to these two bugs: https://bugs.freedesktop.org/show_bug.cgi?id=59448

and

https://bugs.freedesktop.org/show_bug.cgi?id=59447

Currently this patch makes it so that the user cannot be sure that all the words are correctly spelled in Khmer because the ICU word-breaker is not 100% accurate (so if it splits a word wrong it might be shown as being spelled correctly when in fact it is not).

I originally thought this patch would be a good thing, but it is now apparent that until the two other bugs/feature requests are solved, the ICU breaker should not be used for Hunspell/spell checking for Khmer.

Thanks, and sorry for causing this mess!

Comment 7 tommy27 2015-08-14 18:42:13 UTC

@Nathan
would you please give an update of the bug current status with latest 4.4.5.2 or 5.0.0.5 releases?

is it still a bug or it's fixed?

Comment 8 Nathan Wells 2015-08-17 02:56:36 UTC

@tommy27
Yes the bug still exists. We've tried to push a fix through on the ICU level, but as things go, it is very slow. It would still be great to either add the ability for the user to "turn-off" ICU breakiterator in LibreOffice (not only for Khmer, but also for some minority languages that were effected by this change).

Comment 9 Nathan Wells 2015-08-26 12:55:35 UTC

Should I create a new bug, since this one is confusing, since I first requested for the breakiterator to be enabled, and now I am asking for it to be disabled again?

Comment 10 tommy27 2015-08-26 14:09:57 UTC

I think it's better to continue discussion here.

just drop a brief summary explaining why it's better to revert the previous fix.

Comment 11 Commit Notification 2015-09-03 11:29:09 UTC

Nathan Wells committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=10199478b841a87e6436996bde221e424d1df708

Related: tdf#52020 Disable ICU Breakiterator for Khmer

It will be available in 5.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.

Comment 12 Caolán McNamara 2015-09-03 11:34:13 UTC

I'll mark this as "resolved" "not our bug" in the sense that at the moment with icu 54 the word boundary detection is still considered insufficient. If icu ever gets to that point then the above commit can be used as reference to reenable it.