http://cgit.freedesktop.org/libreoffice/core/tree/i18npool/source/breakiterator/data/README We have a bunch of breakiterator rules that are used to find the right place to break a line or word etc. They are all derived from originals bundled into icu, the "master" versions can be found via svn checkout http://source.icu-project.org/repos/icu/icu/trunk/source/data/brkitr (They no longer appear in the icu tarballs, but are in icu's svn) At various stages these copies have been customized and are now horribly out of sync. It's unclear which diffs from the base versions are deliberate and which are now accidental :-( What's needed is a review of the various issues referenced in the commits to our breakiterator rules that caused customizations and see if those are still relevant or overtaken by changes in later unicode specifications. Ideally then writing regression tests for them (see i18npool/qa/cppunit/test_breakiterator.cxx) and if any are still relavant then apply those changes back on top of the latest versions from icu, otherwise simply drop the rules entirely and fall directly back to build-in icu ones.
adding LibreOffice developer list as CC to unresolved EasyHacks for better visibility. see e.g. http://nabble.documentfoundation.org/minutes-of-ESC-call-td4076214.html for details
Migrating Whiteboard tags to Keywords: (EasyHack SkillCpp DifficultyInteresting TopicCleanup) [NinjaEdit]
JanI is default CC for Easy Hacks (Add Jan; remove LibreOffice Dev List from CC) [NinjaEdit]
The ICU upstream has now migrated to GitHub: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr Since the time that this ticket was opened, ICU's brkiter implementation has changed drastically, and I believe it would require nearly a complete rewrite of the LibreOffice BreakIterator system to fully sync to it. The rules are structured in a totally new format with less coupling between languages and rules, and the API of the base classes is quite different as well. This is, unfortunately, outside of the scope of what I thought this easy hack would be, so I am unassigning myself.
Removing EasyHack tag, as per above discussion in comment 4. @Jonathan: I thought this might be interesting for you. Please take a look.
I've reviewed the outstanding customizations, and added characteristic tests for those that look pertinent. The test and documentation changes are here: https://gerrit.libreoffice.org/c/core/+/166017 This changeset doesn't include any rule changes, which will be investigated as a separate changeset.
As part of this task, I have evaluated the state of the CJK BreakIterator. Currently, LibreOffice uses a bespoke CJK BreakIterator with custom dictionaries for Chinese and Japanese. The code dates back to 2002, and the dictionary data was added in 2004. I am unsure what motivated this custom implementation; the original documentation for this effort was on the internal StarOffice bug tracker, and I believe lost. Since this time, ICU has moved on to a more sophisticated approach based on frequency analysis, using a unified Chinese-Japanese frequency dictionary originally created for the Chromium project. It might sound linguistically dubious to combine so many languages into one dictionary, but this approach is used practically everywhere today, and is what most users will expect for e.g. double-click word selection. Doing it this way also results in a smaller dictionary. The main benefit of shipping our own dictionary is the ability to customize it. However, - The Chinese dictionary has not been modified since it was originally added in 2004. - The Japanese dictionary has only been modified twice: - 'shutdown' was added, a common katakana loanword (already in ICU dictionary) - 'reiwa', the current regnal epoch (also already in the ICU dictionary) Besides not customizing these dictionaries very often, they also haven't been regularly maintained. The unified ICU dictionary includes 155,848 words that are not present in either of the custom LibreOffice dictionaries. The LibreOffice dictionaries do, however, include 195,769 words that are not present in the ICU dictionary. In order to assess potential user impact from removing these words, I spot-checked the differences. Some example classes of entries: - The LO dictionaries include a large number of hiragana entries for words that are usually written with kanji or a combination of scripts. For example, 'rat trap' ねずみおとし (ネズミ落し). In theory, doing this makes it easier to edit text consisting only of hiragana without spaces, which is technically a valid way to write Japanese. This approach is error-prone, though, and in practice people avoid writing this way because it's too difficult to read, even for humans. - The LO dictionaries include a number of other entries of questionable value, like "AT&T" or "NTTソフトウェア". I think normal rules-based boundary analysis is sufficient for words like these (even if all it's doing is breaking on a script change or on punctuation). - The LO dictionaries also include many compound entries like "通産省工業技術院北海道工業開発試験所". This is not a word, this is a noun phrase. If I double-click on 北海道 or 工業 in this phrase, I would expect to select only those words. Currently, in LO, it will select the entire passage. (This is an unusually extravagant example; there are plenty of subtler ones that arguably should be treated as multiple words too, like 植松町.) Unless someone else has comments to the contrary, I think it would be best to delete these custom dictionaries and move to the upstream implementation for CJ word boundary analysis. Note that the current line breaking behavior must be preserved, as it implements hanging punctuation and forbidden characters. Only the word boundary behavior should be changed.
@Jonathan: Thank you for the progress report for this issue. I think you may find some useful information in AOO Wiki if you search for BreakIterator in the AOO Wiki. You may have seen it, but just to make sure: Implementing a New Locale https://wiki.openoffice.org/wiki/Documentation/DevGuide/OfficeDev/Implementing_a_New_Locale#XBreakIterator It is a part of the DevGuide, which is now imported into TDF Wiki: https://wiki.documentfoundation.org/Documentation/DevGuide/Office_Development#XBreakIterator_2 Also, this has some relevant information: https://wiki.openoffice.org/wiki/LoadICUBreakIterator You may find some of the old globalization-related specifications here: https://www.openoffice.org/specs/g11n/index.html I see at least one document related to CJK word breaking there. The main spec page is here: https://www.openoffice.org/specs/ You may also find some related information here: Universal I18n Framework for Office Applications, Technical Overview https://svn.apache.org/repos/asf/openoffice/ooo-site/trunk/content/l10n/archive/Universal_i18n_framework.pdf
(In reply to Jonathan Clark from comment #7) > Unless someone else has comments to the contrary, I think it would be best > to delete these custom dictionaries and move to the upstream implementation > for CJ word boundary analysis. Note that the current line breaking behavior > must be preserved, as it implements hanging punctuation and forbidden > characters. Only the word boundary behavior should be changed. I've posted a candidate patch to do this here: https://gerrit.libreoffice.org/c/core/+/166136
(In reply to Hossein from comment #8) > Implementing a New Locale > https://wiki.openoffice.org/wiki/Documentation/DevGuide/OfficeDev/ > Implementing_a_New_Locale#XBreakIterator @Hossein: Thank you for the doc pointers. This one in particular touches on a question I wanted to ask: Are these language-specific BreakIterators part of a user-facing API? Currently, there is a custom BreakIterator implementation for Thai. All it does is implement grapheme clusters. ICU now supports this well, and their implementation has been used for other languages all along (Korean, Tamil, etc.). I haven't fully validated this, but I believe this custom iterator is redundant and I would like to delete it. However, that could be an issue if these UNO objects are user-visible.
After some more testing (manual and automated), I'm satisfied that the custom grapheme boundary analysis for Thai is superfluous. I've posted a patch to remove the Thai BreakIterator here: https://gerrit.libreoffice.org/c/core/+/166156 This change completely removes BreakIterator_th from the codebase. As far as I can tell, the sole intention behind using the service registry stuff was so BreakIteratorImpl could dynamically look up these iterators by name at run-time. I don't think these were meant to be user-visible, and they don't show up in either `udkapi.idl` or `offapi.idl`. I suspect it is safe to do this, but confirmation is appreciated.
The likely motivation in at least the first few rounds of customization were probably efforts at compatibility with competitor office suites.
Jonathan Clark committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/fb94cc0d1348140d03c2826771c57255ff74a94a tdf#49885 Reviewed BreakIterator customizations It will be available in 24.8.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
I've posted a patch for the outstanding rule file upgrades here: https://gerrit.libreoffice.org/c/core/+/166273 Barring any issues, and combined with the previous patches in flight, this should complete the upgrade.
Jonathan Clark committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/64743ee6bc9567015f164333ed9b508542017337 tdf#49885 Removed custom Thai BreakIterator It will be available in 24.8.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Jonathan Clark committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/14c6cde779d64596eab0f4d3f32f181ce2243929 tdf#49885 Updated CJK BreakIterator to use ICU It will be available in 24.8.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Jonathan Clark committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/44699b3de37f07090ac6fee1cd97aa76036e9700 tdf#49885 BreakIterator rule upgrades It will be available in 24.8.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
@Jonathan: If the issue is resolved, please mark it as RESOLVED/FIXED here.
The last of the patches have been committed, so I'm marking this resolved.