Bug 54843 - Bad righthyphenmin for 3-byte or more UTF-8 multibyte characters
Summary: Bad righthyphenmin for 3-byte or more UTF-8 multibyte characters
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
4.0.0.0.alpha0+ Master
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:3.7.0
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-13 07:47 UTC by László Németh
Modified: 2012-09-14 08:43 UTC (History)
0 users

See Also:
Crash report or crash signature:


Attachments
Telugu test example (12.36 KB, application/vnd.oasis.opendocument.text)
2012-09-13 07:47 UTC, László Németh
Details

Note You need to log in before you can comment on or make changes to this bug.
Description László Németh 2012-09-13 07:47:34 UTC
Created attachment 67077 [details]
Telugu test example

(From the bug report by Steven Dickson:)

There appears to be a logic error in the hnj_hyphen_rhmin function in the file hyphen.c.  The function is supposed to remove hyphens from the right hand side of a word based on the value of RIGHTHYPHENMIN defined in the hyphenation pattern file for the language.  It works properly for words containing only single-byte characters, but can fail if the word contains multi-byte characters.

 
The code erroneously assumes that the last character of the word is a single-byte character and starts scanning the word at the next to last byte of the word.  This can be corrected by initializing the character count variable, i, to 0 rather than 1 and starting the for loop with j = word_size – 1 rather than j = word_size -2.

 
The code also erroneously increments the character count variable, i, while still inside of a mult-byte character. This can be corrected by only incrementing i when at the first byte of a multi-byte character (word[j] & 0xc0 == 0xc0) or when at a single-byte character (word[j] & 0x80 != 0x80).

A diff of hyphen.c with the corrections follows.

737c737

<     int i = 1;

---

>     int i = 0;

743c743

<     for (j = word_size - 2; i < rhmin && j > 0; j--) {

---

>     for (j = word_size - 1; i < rhmin && j > 0; j--) {

756c756

<        if (!utf8 || (word[j] & 0xc0) != 0xc0) i++;

---

>        if (!utf8 || (word[j] & 0xc0) == 0xc0 || (word[j] & 0x80) != 0x80) i++;
Comment 1 László Németh 2012-09-13 08:00:10 UTC
Also fixed in the Hyphen CVS: http://hunspell.cvs.sourceforge.net/viewvc/hunspell/hyphen/
Comment 2 Not Assigned 2012-09-14 08:43:58 UTC
Laszlo Nemeth committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=3d654071413bc107e0730dd31261c252f71572bf

fdo#54843 righthyphenmin fix (patch by Steven Dickson)



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.