Bug 54843 - Bad righthyphenmin for 3-byte or more UTF-8 multibyte characters
Summary: Bad righthyphenmin for 3-byte or more UTF-8 multibyte characters
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
(earliest affected) Master
Hardware: Other All
: medium normal
Assignee: Not Assigned
Whiteboard: target:3.7.0
Depends on:
Reported: 2012-09-13 07:47 UTC by László Németh
Modified: 2012-09-14 08:43 UTC (History)
0 users

See Also:
Crash report or crash signature:

Telugu test example (12.36 KB, application/vnd.oasis.opendocument.text)
2012-09-13 07:47 UTC, László Németh

Note You need to log in before you can comment on or make changes to this bug.
Description László Németh 2012-09-13 07:47:34 UTC
Created attachment 67077 [details]
Telugu test example

(From the bug report by Steven Dickson:)

There appears to be a logic error in the hnj_hyphen_rhmin function in the file hyphen.c.  The function is supposed to remove hyphens from the right hand side of a word based on the value of RIGHTHYPHENMIN defined in the hyphenation pattern file for the language.  It works properly for words containing only single-byte characters, but can fail if the word contains multi-byte characters.

The code erroneously assumes that the last character of the word is a single-byte character and starts scanning the word at the next to last byte of the word.  This can be corrected by initializing the character count variable, i, to 0 rather than 1 and starting the for loop with j = word_size – 1 rather than j = word_size -2.

The code also erroneously increments the character count variable, i, while still inside of a mult-byte character. This can be corrected by only incrementing i when at the first byte of a multi-byte character (word[j] & 0xc0 == 0xc0) or when at a single-byte character (word[j] & 0x80 != 0x80).

A diff of hyphen.c with the corrections follows.


<     int i = 1;


>     int i = 0;


<     for (j = word_size - 2; i < rhmin && j > 0; j--) {


>     for (j = word_size - 1; i < rhmin && j > 0; j--) {


<        if (!utf8 || (word[j] & 0xc0) != 0xc0) i++;


>        if (!utf8 || (word[j] & 0xc0) == 0xc0 || (word[j] & 0x80) != 0x80) i++;
Comment 1 László Németh 2012-09-13 08:00:10 UTC
Also fixed in the Hyphen CVS: http://hunspell.cvs.sourceforge.net/viewvc/hunspell/hyphen/
Comment 2 Not Assigned 2012-09-14 08:43:58 UTC
Laszlo Nemeth committed a patch related to this issue.
It has been pushed to "master":


fdo#54843 righthyphenmin fix (patch by Steven Dickson)

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
Affected users are encouraged to test the fix and report feedback.