59448 – Allow users to turn off automatic ICU breakiterator

Bug 59448 - Allow users to turn off automatic ICU breakiterator

Summary: Allow users to turn off automatic ICU breakiterator

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	4.1.2.3 release
Hardware:	Other All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:	target:5.2.0
Keywords:

Depends on:
Blocks:	ICU
	Show dependency tree / graph

Reported:	2013-01-16 04:24 UTC by Nathan Wells
Modified:	2022-05-15 00:27 UTC (History)
CC List:	5 users (show)

See Also:	52020 59447
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nathan Wells 2013-01-16 04:24:57 UTC

Problem description: Currently for languages that have no spaces (like Thai and Khmer), the ICU breakiterator is turned on and cannot be turned off without changing code within LibreOffice. Since the breakiterators for both languages are not 100% accurate, this can cause problems for users that want precise control of word-breaks for spelling and line-breaks and would rather input zero-width spaces manually.



Desired behavior: Add an additional check-box option to turn off/on the ICU breakiterator in Options->Language Settings->Complex Text Layout if the current CTL language has the option for using the ICU breakitorator
Default should be enabled.
              
Operating System: All
Version: 3.6.4.3 release

Comment 1 bfoman (inactive) 2013-06-21 08:48:21 UTC

Enhancement request.

Comment 2 Chomneau 2013-06-21 09:07:01 UTC

Yes, It is helpful to Khmer Language. please help!

-Men Chomneau

Comment 3 EricP 2013-10-16 06:37:51 UTC

The ICU BreakIterator creates chaos for minority languages written in Khmer script, since it inserts line breaks at all the wrong places, even when the text has been typed with ZWSP between words.

This simple option would fix the problem, if LibreOffice reverted to the pre version 3.6.0 behavior of line breaking only at ZWSP, space, and punctuation.

Comment 4 Jean-Baptiste Faure 2013-11-11 11:12:23 UTC

Hi Khaled, do you know if there is any reason to keep this RFE in unconfirmed status?
Thank you very much for your help.

Best regards. JBF

Comment 5 Khaled Hosny 2013-11-11 18:17:58 UTC

I know very little about this area of LibreOffice, but as an enhancement request there is no need to mark it as unconfirmed.

Comment 6 Nathan Wells 2015-08-17 03:01:16 UTC

It would be great to see this feature added in LibreOffice. The bug still exists and this feature is still a necessary for many Khmer and minority language users (who have stopped using LibreOffice because of ICU being enabled for line-breaking and spell checking).
Thanks!

Comment 7 martin_hosken 2015-12-12 02:35:56 UTC

Here is a high level description of a proposed change for 5.2 to handle Khmer linebreaking.

The new algorithm is as follows:

The insert of a ZWSP (U+200B) introduces a line break opportunity directly following the ZWSP but it also inhibits any dictionary based breaks up to 3 clusters before and after the ZWSP. A cluster is defined as a base + medials. A medial is any general category M character and also a coeng+base sequence. Likewise the insertion of a WJ (U+2060) inhibits a line break opportunity at that point and up to 3 clusters before and after.

For normal string boundaries (change of script, spaces, etc.) there is a potential 3 cluster inhibition before and after the boundary. But the inhibiting behaviour is only sustained if there is another boundary before or after the boundary such that it's potential or actual inhibition overlaps the 3 cluster potential boundary. So, for example, if there is a space followed by 5 clusters then another space, the two potential inhibition ranges overlap and the inhibition becomes actual and there will be no dictionary breaks in that run. But if the string extends to be 6 clusters, then the ranges don't overlap and they collapse and dictionary breaking can occur anywhere in the string.

A further change is that there is a class of characters (etc., repeat) which should never break following a word. Such breaks are inhibited.

This change is being implemented as a patch against the ICU library in libo, where we can test it and play with it with real data and real projects. If there is a user consensus that this works well, we will propose it as a change to ICU.

The change has been written to be generic so that it could, potentially, be used for other scripts should that be found to be beneficial.

While we are at it, there are plans to review the Khmer breaking dictionary.

Watch this space for a patch/gerrit commit so that you can go and play.

Comment 8 martin_hosken 2015-12-12 04:46:39 UTC

The patch is available at https://gerrit.libreoffice.org/20655 with a SHA of I36a97e0d6dffd536ab53255d53b9e7babbd0bc84

Some other things to notice about it.

It has changed the handling of unknown text. This may not be ideal, but if the general breaking of Khmer has problems, perhaps an algorithm closer to the CJK breaker is needed.

Scanning for ZWSP or appropriate boundaries happens even through open and closing punctuation.

WJ may occur in the middle of a word where it will be ignored when searching the dictionary.

Comment 9 How can I remove my account? 2015-12-17 06:42:41 UTC

Adding more options is never the right way to solve a problem. But luckily the suggested change does not do that;)

Comment 10 martin_hosken 2015-12-17 09:32:08 UTC

The patch has moved to 20748: Change-Id: Iec71ad4918cd333f0a44d372017ecee300e3aca9.

In addition to providing a solution to handle ZWSP and WJ. The code has been refactored to use a frequency based dictionary as per the CJK breaker. The dictionary breaker also handles run edge breaks appropriately for enclosing punctuation and whether word breaking (exclude enclosing punctuation) or line breaking (include enclosing punctuation).

The change is a rather large patch set in external/icu. Hopefully we can agree on things and push this up to ICU in due course.

Comment 11 martin_hosken 2016-01-06 04:21:05 UTC

The changes to fix this bug have been applied to master (5.2). The precise patch is: fbb00383d82da5ce375f1b034d3fb9ebdd9a8f0e

The changes are available in daily builds, etc.

In summary here are the changes to the khmer line breaking algorithm:

1. ZWSP and WJ introduce a 3 cluster break exclusion around them
2. Spaces and other boundaries only introduce a 3 cluster break exclusion if that exclusion overlaps another exclusion.
3. Open and closing punctuation are ignored for the purposes of exclusion identification.
4. coengs are handled correctly.
5. The dictionary is frequency based and a maximal match algorithm is used to identify word break.
6. The dictionary has automatic common misspelling confusions added to give sensible behaviours for common misspellings.

The final result is a much smaller and more accurate dictionary as well as a much improved algorithm.

Comment 12 Adolfo Jayme Barrientos 2016-01-06 09:00:56 UTC

Adding a link for completeness =) http://cgit.freedesktop.org/libreoffice/core/commit/?id=fbb00383d82da5ce375f1b034d3fb9ebdd9a8f0e