Bug 95159 - Thai word wrapping of กลับ and การกระทำ
Summary: Thai word wrapping of กลับ and การกระทำ
Status: RESOLVED NOTOURBUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
5.0.2.2 release
Hardware: Other All
: medium trivial
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Hyphenation
  Show dependency treegraph
 
Reported: 2015-10-18 16:47 UTC by Brian Wilson
Modified: 2020-12-04 13:17 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
example text of the problem (23.32 KB, application/vnd.oasis.opendocument.text)
2015-10-18 21:48 UTC, Brian Wilson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Brian Wilson 2015-10-18 16:47:44 UTC
The Thai words กลับ and การะกระทำ are incorrectly wrapped as ก|ลับ and การก|ระทำ

This problem is especially noticeable as these are very common words in Thai.
Comment 1 Alex Thurgood 2015-10-18 19:07:26 UTC
@Brian : coud you please provide a sample document where this problem is apparent ?
Comment 2 Julien Nabet 2015-10-18 21:03:16 UTC
Samphan: thought you might be interested in this one.
Comment 3 Brian Wilson 2015-10-18 21:48:09 UTC
Created attachment 119724 [details]
example text of the problem

Here is a sample document of nonsense words illustrating the problem. The words are real Thai words, but they are randomly strung together.
Comment 4 Buovjaga 2015-10-21 05:43:41 UTC
Confirmed.

Win 7 Pro 64-bit, Version: 5.0.2.2 (x64)
Build ID: 37b43f919e4de5eeaca9b9755ed688758a8251fe
Locale: fi-FI (fi_FI)
Comment 5 Brian Wilson 2015-10-23 23:40:34 UTC
another word with "ก" that incorrectly wraps. This word actually broke between pages.

ลูกา (Luke) wrapped as ลูก|า. Even if the name of the third Gospel is unknown, an "า" can never begin a syllable under any circumstances, ever.
Comment 6 QA Administrators 2016-11-08 11:33:49 UTC Comment hidden (obsolete)
Comment 7 QA Administrators 2020-10-24 05:12:47 UTC Comment hidden (obsolete)
Comment 8 Samphan Raruenrom 2020-12-04 12:12:06 UTC
LibreOffice relies on ICU to break Thai words. ICU uses a greedy dictionary-based longest-matching Thai word-segmentation algorithm, e.g. it stops when found the first possible segmentation "การก|ระ|ทำ" (การก happens to really be a valid Thai word).

To fix this issue, one needs to implement a slightly better maximal-matching algorithm in ICU. It will still fail in some other rarer cases anyway.

BTW, this kind of problem happens very rarely in real-world Thai because the ambiguous sequences are rare and must be at the end of a line. I never hear a Thai complain about this.
Comment 9 Buovjaga 2020-12-04 13:17:27 UTC
(In reply to Samphan Raruenrom from comment #8)
> LibreOffice relies on ICU to break Thai words. ICU uses a greedy
> dictionary-based longest-matching Thai word-segmentation algorithm, e.g. it
> stops when found the first possible segmentation "การก|ระ|ทำ" (การก happens
> to really be a valid Thai word).
> 
> To fix this issue, one needs to implement a slightly better maximal-matching
> algorithm in ICU. It will still fail in some other rarer cases anyway.
> 
> BTW, this kind of problem happens very rarely in real-world Thai because the
> ambiguous sequences are rare and must be at the end of a line. I never hear
> a Thai complain about this.

Thanks a lot. Let's close this as notourbug.