Bug 46950 - Hebrew: Spell-checking breaks Hebrew words at intra-word single and double quotes
Summary: Hebrew: Spell-checking breaks Hebrew words at intra-word single and double qu...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Spell-Checking RTL-Hebrew
  Show dependency treegraph
 
Reported: 2012-03-04 01:59 UTC by Nadav Har'El
Modified: 2018-12-27 15:30 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nadav Har'El 2012-03-04 01:59:46 UTC
It appears that before LibreOffice passes text to the spell-checker, it breaks them into separate words. The problem is that (apparently) it does this using some general language-agnostic rules, while different languages might have different rules as to what characters may be part of a word, and what breaks words.

My problem is specifically with the Hebrew spell-checking: In Hebrew, the quote characters - ' and ", are used not just for quoting, but have an additional unrelated use as in-word characters:
1. The single-quote is used to mark foreign sounds. E.g., the word ג'ירפה has a single-quote character after the gimmel, which means it should be pronounced "j", not "g".
2. The double-quote is used inside acronyms, to mark them as such. For example מנכ"ל is the acronym for CEO. מנכ"לים is its plural. Both have quotes in the middle of the word - and these words, together with this quote, are in the dictionary.

Because of this, the Hebrew hunspell dictionary includes the following lines in he_IL.aff:

   BREAK 3
   BREAK ^"
   BREAK "$
   BREAK ^'

This means that " only breaks words when it's in the beginning and end (and ' only in the beginning) - these characters in the middle of a word never mean a word break in Hebrew. With this setting, hunspell correctly word-breaks and spell-checks Hebrew text.

Unfortunately, LibreOffice doesn't respect these instructions. It appears that it incorrectly breaks up the words before sending them to hunspell. The end result is that all Hebrew words which are acronyms or have foreign sounds in them are incorrectly marked as being errors, which is very annoying.
Comment 1 Urmas 2012-03-05 20:42:44 UTC
On a second thought, why do use ' and " instead of geresh/gershaim?

But the problem is that spellchecker breaks words on them too, even if that is explicitly prohibited.
Comment 2 Nadav Har'El 2012-03-05 22:33:06 UTC
Well, it's just that despite the existence of the separate "geresh" and "gershayim" characters in Unicode, I've never seen anyone actually using them. Everyone I've seen uses the normal ASCII single-quote and double-quote characters respectively, and expect those to look fine in Hebrew fonts - and they do.

They reason people don't use the special unicode characters is probably that there is usually no convenient method to enter them with the keyboard.

You're right that it should also be checked what happens with these special characters - the spell-checker shouldn't break such words, and it should accept them even though the dictionary contains words with the ASCII quotes/double-quote, not with geresh/gershayim. This should perhaps become a separate bug, if it doesn't work properly.
Comment 3 Lior Kaplan 2012-03-16 05:45:06 UTC
Reported in the past with OO.org at 
https://issues.apache.org/ooo/show_bug.cgi?id=51772
https://issues.apache.org/ooo/show_bug.cgi?id=99796

The first also have patches which might still be relevant.
Comment 4 QA Administrators 2014-10-24 03:18:27 UTC Comment hidden (obsolete)
Comment 5 Amir Adar 2015-01-13 13:53:01 UTC
Though the bug is quite old, it is still present in versions 4.3.5.2 and 4.5.0.0 (master build). To reproduce:

1. Open LibreOffice Writer.
2. Type in a Hebrew acronym, like פלמ"ח.

Even if the language is set to Hebrew, the acronym is underlined with red, indicating a spelling error. It also separates between the letters before and after the "Gershayim", marking them as two words instead of one.

I tried to reproduce on other programs with spell-checking capabilities, such as Gedit, and it seems the problem is there as well. Perhaps the underlying engine is at fault, and not LibreOffice itself.

I am using Linux Mint 17.1, 32-bit.
Comment 6 Nadav Har'El 2015-01-13 14:34:09 UTC
Indeed, this bug still exists, and is still very much annoying to Hebrew users!

As I explained in detail in the original bug report, I believe this is *not* problem of the underlying engine (aspell, based on data from the hspell project) but rather of libreoffice's own word split algorithm, which apparently doesn't respect Aspell's declaration of in-word characters, nor does it support the correct word-split rules for Hebrew (where certain seemingly-punctuation characters may be parts of words).

I'm not familiar with the code involved, but https://www.libreoffice.org/bugzilla/show_bug.cgi?id=62360 points to the place in the libreoffice code which might need to be fixed.
Comment 7 QA Administrators 2016-01-17 20:05:15 UTC Comment hidden (obsolete)
Comment 8 Lior Kaplan 2016-01-18 20:15:23 UTC
Still happens with LibO 5.0.x. 

To test: use the word ג'ירפה and see that the quote makes the spell checker think it's two words.
Comment 9 Nadav Har'El 2016-02-21 08:23:17 UTC
Indeed, this bug still exists, and still very much annoying. Non-hebrew-speakers might not appreciate the meaning of this bug, but a certain percentage of Hebrew words (unfortunately I can't quote a good estimate) simply contain the single-quote or double-quote characters in them. I gave above examples - certain words with foreign-language sounds and all acronyms.

LibreOffice will mark all these words as wrong, which not only prevents spell-checking such words, it also lowers the users overall confidence in the spellchecker because he or she will so often see correctly-written words red-marked.
Comment 10 QA Administrators 2017-09-01 11:20:33 UTC Comment hidden (obsolete)
Comment 11 Lior Kaplan 2017-09-01 11:41:40 UTC
Still reproducible.

Version: 5.4.0.3
Build ID: 1:5.4.0-1
Comment 12 Omer Zak 2017-11-15 22:07:00 UTC
Still happens in:

Version: 6.0.0.0.alpha1+
Build ID: 9050854c35c389466923f0224a36572d36cd471a
CPU threads: 8; OS: Linux 4.9; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.utf8); Calc: group

OS: Debian 64bit Stretch (Debian 9.2, with some backported packages)


But with some changes.
1. The word פלמ"ח is still not handled correctly.
2. The word ג'ירפה is now handled correctly. Seems that Writer now converts the single quote into geresh.
Comment 13 QA Administrators 2018-11-16 03:42:08 UTC Comment hidden (obsolete)
Comment 14 Nadav Har'El 2018-11-16 20:37:42 UTC
‎The bug still exists in LibreOffice 6.1.2.1.

As Omer Zak noted above, the bug was *fixed* for the single quote, e.g., ג'ירפה or סח'נין are now correctly recognized as correctly spelled. This is a welcome improvement. However, the bug still exists for double-quotes, e.g., מנכ"לים or פלמ"ח are still split to two words which are spell-checked individually.
Comment 15 Eyal Rozenberg 2018-12-27 15:28:57 UTC
(In reply to Nadav Har'El from comment #14)
> As Omer Zak noted above, the bug was *fixed* for the single quote, e.g.,
> ג'ירפה or סח'נין are now correctly recognized as correctly spelled. 

If you can bisect this fix with daily builds, you can probably figure out who exactly fixed it and where. If you do that, perhaps we'd be able to either:

* Formulate a patch to handle the double-quote case as well; or
* Contact the developer who introduced that patch to ask for their help more specifically.