Bug 49885 - sync custom breakiterator rules with icu originals
Summary: sync custom breakiterator rules with icu originals
Status: ASSIGNED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
Master old -3.6
Hardware: Other All
: medium normal
Assignee: Jonathan Clark
URL:
Whiteboard: target:24.8.0
Keywords:
Depends on:
Blocks: ICU
  Show dependency treegraph
 
Reported: 2012-05-13 14:54 UTC by Caolán McNamara
Modified: 2024-04-17 11:29 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Caolán McNamara 2012-05-13 14:54:17 UTC
http://cgit.freedesktop.org/libreoffice/core/tree/i18npool/source/breakiterator/data/README

We have a bunch of breakiterator rules that are used to find the right place to break a line or word etc.

They are all derived from originals bundled into icu, the "master" versions can be found via 
svn checkout
http://source.icu-project.org/repos/icu/icu/trunk/source/data/brkitr 
(They no longer appear in the icu tarballs, but are in icu's svn)

At various stages these copies have been customized and are now horribly out of sync. It's unclear which diffs from the base versions are deliberate and which are now accidental :-(

What's needed is a review of the various issues referenced in the commits to our breakiterator rules that caused customizations and see if those are still relevant or overtaken by changes in later unicode specifications. Ideally then writing regression tests for them (see i18npool/qa/cppunit/test_breakiterator.cxx) and if any are still relavant then apply those changes back on top of the latest versions from icu, otherwise simply drop the rules entirely and fall directly back to build-in icu ones.
Comment 1 Björn Michaelsen 2013-10-04 18:47:15 UTC
adding LibreOffice developer list as CC to unresolved EasyHacks for better visibility.

see e.g. http://nabble.documentfoundation.org/minutes-of-ESC-call-td4076214.html for details
Comment 2 Robinson Tryon (qubit) 2015-12-14 07:02:12 UTC Comment hidden (obsolete)
Comment 3 Robinson Tryon (qubit) 2016-02-18 14:52:24 UTC Comment hidden (obsolete)
Comment 4 Jonah Janzen 2024-03-20 01:53:17 UTC
The ICU upstream has now migrated to GitHub: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr

Since the time that this ticket was opened, ICU's brkiter implementation has changed drastically, and I believe it would require nearly a complete rewrite of the LibreOffice BreakIterator system to fully sync to it. The rules are structured in a totally new format with less coupling between languages and rules, and the API of the base classes is quite different as well.

This is, unfortunately, outside of the scope of what I thought this easy hack would be, so I am unassigning myself.
Comment 5 Hossein 2024-04-10 13:19:55 UTC
Removing EasyHack tag, as per above discussion in comment 4.

@Jonathan:
I thought this might be interesting for you. Please take a look.
Comment 6 Jonathan Clark 2024-04-11 23:31:38 UTC
I've reviewed the outstanding customizations, and added characteristic tests for those that look pertinent. The test and documentation changes are here:

https://gerrit.libreoffice.org/c/core/+/166017

This changeset doesn't include any rule changes, which will be investigated as a separate changeset.
Comment 7 Jonathan Clark 2024-04-15 20:39:21 UTC
As part of this task, I have evaluated the state of the CJK BreakIterator.

Currently, LibreOffice uses a bespoke CJK BreakIterator with custom dictionaries for Chinese and Japanese. The code dates back to 2002, and the dictionary data was added in 2004. I am unsure what motivated this custom implementation; the original documentation for this effort was on the internal StarOffice bug tracker, and I believe lost.

Since this time, ICU has moved on to a more sophisticated approach based on frequency analysis, using a unified Chinese-Japanese frequency dictionary originally created for the Chromium project. It might sound linguistically dubious to combine so many languages into one dictionary, but this approach is used practically everywhere today, and is what most users will expect for e.g. double-click word selection. Doing it this way also results in a smaller dictionary.

The main benefit of shipping our own dictionary is the ability to customize it. However,

- The Chinese dictionary has not been modified since it was originally added in 2004.
- The Japanese dictionary has only been modified twice:
  - 'shutdown' was added, a common katakana loanword (already in ICU dictionary)
  - 'reiwa', the current regnal epoch (also already in the ICU dictionary)

Besides not customizing these dictionaries very often, they also haven't been regularly maintained. The unified ICU dictionary includes 155,848 words that are not present in either of the custom LibreOffice dictionaries.

The LibreOffice dictionaries do, however, include 195,769 words that are not present in the ICU dictionary. In order to assess potential user impact from removing these words, I spot-checked the differences.

Some example classes of entries:

- The LO dictionaries include a large number of hiragana entries for words that are usually written with kanji or a combination of scripts. For example, 'rat trap' ねずみおとし (ネズミ落し). In theory, doing this makes it easier to edit text consisting only of hiragana without spaces, which is technically a valid way to write Japanese. This approach is error-prone, though, and in practice people avoid writing this way because it's too difficult to read, even for humans.

- The LO dictionaries include a number of other entries of questionable value, like "AT&T" or "NTTソフトウェア". I think normal rules-based boundary analysis is sufficient for words like these (even if all it's doing is breaking on a script change or on punctuation).

- The LO dictionaries also include many compound entries like "通産省工業技術院北海道工業開発試験所". This is not a word, this is a noun phrase. If I double-click on 北海道 or 工業 in this phrase, I would expect to select only those words. Currently, in LO, it will select the entire passage. (This is an unusually extravagant example; there are plenty of subtler ones that arguably should be treated as multiple words too, like 植松町.)

Unless someone else has comments to the contrary, I think it would be best to delete these custom dictionaries and move to the upstream implementation for CJ word boundary analysis. Note that the current line breaking behavior must be preserved, as it implements hanging punctuation and forbidden characters. Only the word boundary behavior should be changed.
Comment 8 Hossein 2024-04-15 21:55:46 UTC
@Jonathan: Thank you for the progress report for this issue.

I think you may find some useful information in AOO Wiki if you search for BreakIterator in the AOO Wiki. You may have seen it, but just to make sure:

Implementing a New Locale
https://wiki.openoffice.org/wiki/Documentation/DevGuide/OfficeDev/Implementing_a_New_Locale#XBreakIterator

It is a part of the DevGuide, which is now imported into TDF Wiki:
https://wiki.documentfoundation.org/Documentation/DevGuide/Office_Development#XBreakIterator_2

Also, this has some relevant information:
https://wiki.openoffice.org/wiki/LoadICUBreakIterator

You may find some of the old globalization-related specifications here:
https://www.openoffice.org/specs/g11n/index.html

I see at least one document related to CJK word breaking there.

The main spec page is here:
https://www.openoffice.org/specs/

You may also find some related information here:

Universal I18n Framework for Office Applications, Technical Overview
https://svn.apache.org/repos/asf/openoffice/ooo-site/trunk/content/l10n/archive/Universal_i18n_framework.pdf
Comment 9 Jonathan Clark 2024-04-15 23:19:25 UTC
(In reply to Jonathan Clark from comment #7)
> Unless someone else has comments to the contrary, I think it would be best
> to delete these custom dictionaries and move to the upstream implementation
> for CJ word boundary analysis. Note that the current line breaking behavior
> must be preserved, as it implements hanging punctuation and forbidden
> characters. Only the word boundary behavior should be changed.

I've posted a candidate patch to do this here:

https://gerrit.libreoffice.org/c/core/+/166136
Comment 10 Jonathan Clark 2024-04-16 14:49:06 UTC
(In reply to Hossein from comment #8)
> Implementing a New Locale
> https://wiki.openoffice.org/wiki/Documentation/DevGuide/OfficeDev/
> Implementing_a_New_Locale#XBreakIterator

@Hossein: Thank you for the doc pointers.

This one in particular touches on a question I wanted to ask: Are these language-specific BreakIterators part of a user-facing API?

Currently, there is a custom BreakIterator implementation for Thai. All it does is implement grapheme clusters. ICU now supports this well, and their implementation has been used for other languages all along (Korean, Tamil, etc.). I haven't fully validated this, but I believe this custom iterator is redundant and I would like to delete it. However, that could be an issue if these UNO objects are user-visible.
Comment 11 Jonathan Clark 2024-04-16 17:13:25 UTC
After some more testing (manual and automated), I'm satisfied that the custom grapheme boundary analysis for Thai is superfluous. I've posted a patch to remove the Thai BreakIterator here:

https://gerrit.libreoffice.org/c/core/+/166156

This change completely removes BreakIterator_th from the codebase. As far as I can tell, the sole intention behind using the service registry stuff was so BreakIteratorImpl could dynamically look up these iterators by name at run-time. I don't think these were meant to be user-visible, and they don't show up in either `udkapi.idl` or `offapi.idl`. I suspect it is safe to do this, but confirmation is appreciated.
Comment 12 Caolán McNamara 2024-04-17 10:17:38 UTC
The likely motivation in at least the first few rounds of customization were probably efforts at compatibility with competitor office suites.
Comment 13 Commit Notification 2024-04-17 11:29:00 UTC
Jonathan Clark committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/fb94cc0d1348140d03c2826771c57255ff74a94a

tdf#49885 Reviewed BreakIterator customizations

It will be available in 24.8.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.