Problem description: Currently for languages that have no spaces (like Thai and Khmer), the ICU breakiterator is turned on and cannot be turned off without changing code within LibreOffice. Since the breakiterators for both languages are not 100% accurate, this can cause problems for users that want precise control of word-breaks for spelling and line-breaks and would rather input zero-width spaces manually.
Desired behavior: Add an additional check-box option to turn off/on the ICU breakiterator in Options->Language Settings->Complex Text Layout if the current CTL language has the option for using the ICU breakitorator
Default should be enabled.
Operating System: All
Version: 22.214.171.124 release
Yes, It is helpful to Khmer Language. please help!
The ICU BreakIterator creates chaos for minority languages written in Khmer script, since it inserts line breaks at all the wrong places, even when the text has been typed with ZWSP between words.
This simple option would fix the problem, if LibreOffice reverted to the pre version 3.6.0 behavior of line breaking only at ZWSP, space, and punctuation.
Hi Khaled, do you know if there is any reason to keep this RFE in unconfirmed status?
Thank you very much for your help.
Best regards. JBF
I know very little about this area of LibreOffice, but as an enhancement request there is no need to mark it as unconfirmed.
It would be great to see this feature added in LibreOffice. The bug still exists and this feature is still a necessary for many Khmer and minority language users (who have stopped using LibreOffice because of ICU being enabled for line-breaking and spell checking).
Here is a high level description of a proposed change for 5.2 to handle Khmer linebreaking.
The new algorithm is as follows:
The insert of a ZWSP (U+200B) introduces a line break opportunity directly following the ZWSP but it also inhibits any dictionary based breaks up to 3 clusters before and after the ZWSP. A cluster is defined as a base + medials. A medial is any general category M character and also a coeng+base sequence. Likewise the insertion of a WJ (U+2060) inhibits a line break opportunity at that point and up to 3 clusters before and after.
For normal string boundaries (change of script, spaces, etc.) there is a potential 3 cluster inhibition before and after the boundary. But the inhibiting behaviour is only sustained if there is another boundary before or after the boundary such that it's potential or actual inhibition overlaps the 3 cluster potential boundary. So, for example, if there is a space followed by 5 clusters then another space, the two potential inhibition ranges overlap and the inhibition becomes actual and there will be no dictionary breaks in that run. But if the string extends to be 6 clusters, then the ranges don't overlap and they collapse and dictionary breaking can occur anywhere in the string.
A further change is that there is a class of characters (etc., repeat) which should never break following a word. Such breaks are inhibited.
This change is being implemented as a patch against the ICU library in libo, where we can test it and play with it with real data and real projects. If there is a user consensus that this works well, we will propose it as a change to ICU.
The change has been written to be generic so that it could, potentially, be used for other scripts should that be found to be beneficial.
While we are at it, there are plans to review the Khmer breaking dictionary.
Watch this space for a patch/gerrit commit so that you can go and play.
The patch is available at https://gerrit.libreoffice.org/20655 with a SHA of I36a97e0d6dffd536ab53255d53b9e7babbd0bc84
Some other things to notice about it.
It has changed the handling of unknown text. This may not be ideal, but if the general breaking of Khmer has problems, perhaps an algorithm closer to the CJK breaker is needed.
Scanning for ZWSP or appropriate boundaries happens even through open and closing punctuation.
WJ may occur in the middle of a word where it will be ignored when searching the dictionary.
Adding more options is never the right way to solve a problem. But luckily the suggested change does not do that;)
The patch has moved to 20748: Change-Id: Iec71ad4918cd333f0a44d372017ecee300e3aca9.
In addition to providing a solution to handle ZWSP and WJ. The code has been refactored to use a frequency based dictionary as per the CJK breaker. The dictionary breaker also handles run edge breaks appropriately for enclosing punctuation and whether word breaking (exclude enclosing punctuation) or line breaking (include enclosing punctuation).
The change is a rather large patch set in external/icu. Hopefully we can agree on things and push this up to ICU in due course.
The changes to fix this bug have been applied to master (5.2). The precise patch is: fbb00383d82da5ce375f1b034d3fb9ebdd9a8f0e
The changes are available in daily builds, etc.
In summary here are the changes to the khmer line breaking algorithm:
1. ZWSP and WJ introduce a 3 cluster break exclusion around them
2. Spaces and other boundaries only introduce a 3 cluster break exclusion if that exclusion overlaps another exclusion.
3. Open and closing punctuation are ignored for the purposes of exclusion identification.
4. coengs are handled correctly.
5. The dictionary is frequency based and a maximal match algorithm is used to identify word break.
6. The dictionary has automatic common misspelling confusions added to give sensible behaviours for common misspellings.
The final result is a much smaller and more accurate dictionary as well as a much improved algorithm.
Adding a link for completeness =) http://cgit.freedesktop.org/libreoffice/core/commit/?id=fbb00383d82da5ce375f1b034d3fb9ebdd9a8f0e