Bug 46950 - Hebrew: Spell-checking breaks Hebrew words at intra-word single and double quotes
Summary: Hebrew: Spell-checking breaks Hebrew words at intra-word single and double qu...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Spell-Checking RTL-Hebrew
  Show dependency treegraph
 
Reported: 2012-03-04 01:59 UTC by Nadav Har'El
Modified: 2023-02-16 14:47 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Document exhibiting the different manifestations of the bug (13.73 KB, application/vnd.oasis.opendocument.text)
2021-02-12 22:08 UTC, Eyal Rozenberg
Details
Test document rendered in LO Writer 7.1 (218.97 KB, image/png)
2021-02-12 22:09 UTC, Eyal Rozenberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nadav Har'El 2012-03-04 01:59:46 UTC
It appears that before LibreOffice passes text to the spell-checker, it breaks them into separate words. The problem is that (apparently) it does this using some general language-agnostic rules, while different languages might have different rules as to what characters may be part of a word, and what breaks words.

My problem is specifically with the Hebrew spell-checking: In Hebrew, the quote characters - ' and ", are used not just for quoting, but have an additional unrelated use as in-word characters:
1. The single-quote is used to mark foreign sounds. E.g., the word ג'ירפה has a single-quote character after the gimmel, which means it should be pronounced "j", not "g".
2. The double-quote is used inside acronyms, to mark them as such. For example מנכ"ל is the acronym for CEO. מנכ"לים is its plural. Both have quotes in the middle of the word - and these words, together with this quote, are in the dictionary.

Because of this, the Hebrew hunspell dictionary includes the following lines in he_IL.aff:

   BREAK 3
   BREAK ^"
   BREAK "$
   BREAK ^'

This means that " only breaks words when it's in the beginning and end (and ' only in the beginning) - these characters in the middle of a word never mean a word break in Hebrew. With this setting, hunspell correctly word-breaks and spell-checks Hebrew text.

Unfortunately, LibreOffice doesn't respect these instructions. It appears that it incorrectly breaks up the words before sending them to hunspell. The end result is that all Hebrew words which are acronyms or have foreign sounds in them are incorrectly marked as being errors, which is very annoying.
Comment 1 Urmas 2012-03-05 20:42:44 UTC
On a second thought, why do use ' and " instead of geresh/gershaim?

But the problem is that spellchecker breaks words on them too, even if that is explicitly prohibited.
Comment 2 Nadav Har'El 2012-03-05 22:33:06 UTC
Well, it's just that despite the existence of the separate "geresh" and "gershayim" characters in Unicode, I've never seen anyone actually using them. Everyone I've seen uses the normal ASCII single-quote and double-quote characters respectively, and expect those to look fine in Hebrew fonts - and they do.

They reason people don't use the special unicode characters is probably that there is usually no convenient method to enter them with the keyboard.

You're right that it should also be checked what happens with these special characters - the spell-checker shouldn't break such words, and it should accept them even though the dictionary contains words with the ASCII quotes/double-quote, not with geresh/gershayim. This should perhaps become a separate bug, if it doesn't work properly.
Comment 3 Lior Kaplan 2012-03-16 05:45:06 UTC
Reported in the past with OO.org at 
https://issues.apache.org/ooo/show_bug.cgi?id=51772
https://issues.apache.org/ooo/show_bug.cgi?id=99796

The first also have patches which might still be relevant.
Comment 4 QA Administrators 2014-10-24 03:18:27 UTC Comment hidden (obsolete)
Comment 5 Amir Adar 2015-01-13 13:53:01 UTC
Though the bug is quite old, it is still present in versions 4.3.5.2 and 4.5.0.0 (master build). To reproduce:

1. Open LibreOffice Writer.
2. Type in a Hebrew acronym, like פלמ"ח.

Even if the language is set to Hebrew, the acronym is underlined with red, indicating a spelling error. It also separates between the letters before and after the "Gershayim", marking them as two words instead of one.

I tried to reproduce on other programs with spell-checking capabilities, such as Gedit, and it seems the problem is there as well. Perhaps the underlying engine is at fault, and not LibreOffice itself.

I am using Linux Mint 17.1, 32-bit.
Comment 6 Nadav Har'El 2015-01-13 14:34:09 UTC
Indeed, this bug still exists, and is still very much annoying to Hebrew users!

As I explained in detail in the original bug report, I believe this is *not* problem of the underlying engine (aspell, based on data from the hspell project) but rather of libreoffice's own word split algorithm, which apparently doesn't respect Aspell's declaration of in-word characters, nor does it support the correct word-split rules for Hebrew (where certain seemingly-punctuation characters may be parts of words).

I'm not familiar with the code involved, but https://www.libreoffice.org/bugzilla/show_bug.cgi?id=62360 points to the place in the libreoffice code which might need to be fixed.
Comment 7 QA Administrators 2016-01-17 20:05:15 UTC Comment hidden (obsolete)
Comment 8 Lior Kaplan 2016-01-18 20:15:23 UTC
Still happens with LibO 5.0.x. 

To test: use the word ג'ירפה and see that the quote makes the spell checker think it's two words.
Comment 9 Nadav Har'El 2016-02-21 08:23:17 UTC
Indeed, this bug still exists, and still very much annoying. Non-hebrew-speakers might not appreciate the meaning of this bug, but a certain percentage of Hebrew words (unfortunately I can't quote a good estimate) simply contain the single-quote or double-quote characters in them. I gave above examples - certain words with foreign-language sounds and all acronyms.

LibreOffice will mark all these words as wrong, which not only prevents spell-checking such words, it also lowers the users overall confidence in the spellchecker because he or she will so often see correctly-written words red-marked.
Comment 10 QA Administrators 2017-09-01 11:20:33 UTC Comment hidden (obsolete)
Comment 11 Lior Kaplan 2017-09-01 11:41:40 UTC
Still reproducible.

Version: 5.4.0.3
Build ID: 1:5.4.0-1
Comment 12 Omer Zak 2017-11-15 22:07:00 UTC
Still happens in:

Version: 6.0.0.0.alpha1+
Build ID: 9050854c35c389466923f0224a36572d36cd471a
CPU threads: 8; OS: Linux 4.9; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.utf8); Calc: group

OS: Debian 64bit Stretch (Debian 9.2, with some backported packages)


But with some changes.
1. The word פלמ"ח is still not handled correctly.
2. The word ג'ירפה is now handled correctly. Seems that Writer now converts the single quote into geresh.
Comment 13 QA Administrators 2018-11-16 03:42:08 UTC Comment hidden (obsolete)
Comment 14 Nadav Har'El 2018-11-16 20:37:42 UTC
‎The bug still exists in LibreOffice 6.1.2.1.

As Omer Zak noted above, the bug was *fixed* for the single quote, e.g., ג'ירפה or סח'נין are now correctly recognized as correctly spelled. This is a welcome improvement. However, the bug still exists for double-quotes, e.g., מנכ"לים or פלמ"ח are still split to two words which are spell-checked individually.
Comment 15 Eyal Rozenberg 2018-12-27 15:28:57 UTC
(In reply to Nadav Har'El from comment #14)
> As Omer Zak noted above, the bug was *fixed* for the single quote, e.g.,
> ג'ירפה or סח'נין are now correctly recognized as correctly spelled. 

If you can bisect this fix with daily builds, you can probably figure out who exactly fixed it and where. If you do that, perhaps we'd be able to either:

* Formulate a patch to handle the double-quote case as well; or
* Contact the developer who introduced that patch to ask for their help more specifically.
Comment 16 QA Administrators 2020-12-27 03:37:42 UTC Comment hidden (obsolete)
Comment 17 Eyal Rozenberg 2021-02-12 22:06:57 UTC
So, Nadav has not replied to my last comment, so let me summarize the state of affairs, in LO 7.1:

* There are (at least) three ways to signify a Geresh within a word: APOSTROPHE (U+27), RIGHT SINGLE QUOTATION MARK (U+2019), and HEBREW PUNCTUATION GERESH (U+5F3).
* Similarly are (at least) three ways to signify a Gershaim within a word: DOUBLE QUOTATION MARK (0x22), RIGHT DOUBLE QUOTATION MARK (0x201D), and HEBREW PUNCTUATION GERSHAIM (0x5F4).
* LibreOffice writer _is_ breaking up words using APOSTROPHE or DOUBLE QUOTATION MARK. In an ideal world, these would not be used for Geresh or Gershaim, but since these are commonly used in practice - it is a bug.
* LibreOffice writer _is_ breaking up words using RIGHT DOUBLE QUOTATION MARK - this is a bug. Due to this bug, the two parts of the words are spell-checked separately.
* LibreOffice writer is _not_ breaking up words using RIGHT SINGLE QUOTATION MARK, and spell-checking succeeds on them (at least in my anecdotal checking; consider ג’ירפה for example).
* LibreOffice writer is _not_ breaking up words using HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM - but spell-checking still _fails_ on them: ג׳ירפה , דו״ח  This is a different phenomenon than what Nadav Har'el first identified. It may be worth splitting off into a separate bug.

Version info:
Version: 7.1.0.3 / LibreOffice Community
Build ID: f6099ecf3d29644b5008cc8f48f42f4a40986e4c
CPU threads: 4; OS: Linux 5.9; UI render: default; VCL: gtk3
Locale: he-IL (en_IL); UI: en-US
Comment 18 Eyal Rozenberg 2021-02-12 22:08:24 UTC
Created attachment 169707 [details]
Document exhibiting the different manifestations of the bug

This covers all 3 ways to signify both symbols.
Comment 19 Eyal Rozenberg 2021-02-12 22:09:47 UTC
Created attachment 169708 [details]
Test document rendered in LO Writer 7.1

You will note the red squiggly line where the automatic spelling check fails.

Note in particular the cases of only a single character or two characters getting the squiggly line rather than the full word.
Comment 20 Eyal Rozenberg 2021-02-12 22:24:01 UTC
I should mention that when you open the ODT, you need to enable editing and spelling auto-check, and also type something in for the spell-check to kick in. Otherwise nothing will show up as misspelled. (That's not a bug.)
Comment 21 Eyal Rozenberg 2021-02-12 22:31:37 UTC
I should mention that when you open the ODT, you need to enable editing and spelling auto-check, and also type something in for the spell-check to kick in. Otherwise nothing will show up as misspelled
Comment 22 Eyal Rozenberg 2021-02-12 22:35:46 UTC
Have opened bug 140382 about the failure of the spell-checking to accept words with proper HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM.
Comment 23 eladhen2 2021-02-15 15:41:30 UTC
I see this behavior on 6.3.6.2
Comment 24 QA Administrators 2023-02-16 03:25:58 UTC Comment hidden (obsolete)
Comment 25 Eyal Rozenberg 2023-02-16 14:47:06 UTC
The situation described in comment 17 - persists with:

Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: ad387d5b984c6666906505d25685065f710ed55d
CPU threads: 4; OS: Linux 6.1; UI render: default; VCL: gtk3
Locale: he-IL (en_IL); UI: en-US

to sommarize this more succinctly:

intra-word character         break-up?    Example word
-----------------------------------------------------------
APOSTROPHE                   Yes          ג'ירפה
RIGHT SINGLE QUOTATION MARK  No           ג’ירפה
HEBREW PUNCTUATION GERESH    No           ג׳ירפה
DOUBLE QUOTATION MARK        Yes          דו"ח
RIGHT DOUBLE QUOTATION MARK  Yes          דו”ח
HEBREW PUNCTUATION GERSHAIM  No           דו״ח

All the "Yes" entries are buggy behavior - there should be no break-up of the word into two parts.

Spelling failure despite non-breakup:

HEBREW PUNCTUATION GERESH
HEBREW PUNCTUATION GERSHAIM