Bug 46950 - Hebrew: Spell-checking breaks Hebrew words at intra-word single and double quotes
Summary: Hebrew: Spell-checking breaks Hebrew words at intra-word single and double qu...
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Jonathan Clark
URL:
Whiteboard: target:25.2.0 target:24.8.0.2
Keywords:
Depends on:
Blocks: Spell-Checking Hebrew
  Show dependency treegraph
 
Reported: 2012-03-04 01:59 UTC by Nadav Har'El
Modified: 2024-10-24 08:23 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Document exhibiting the different manifestations of the bug (13.73 KB, application/vnd.oasis.opendocument.text)
2021-02-12 22:08 UTC, Eyal Rozenberg
Details
Test document rendered in LO Writer 7.1 (218.97 KB, image/png)
2021-02-12 22:09 UTC, Eyal Rozenberg
Details
Spell checking of the table from comment 32 (58.08 KB, image/png)
2024-10-23 16:02 UTC, Eyal Rozenberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nadav Har'El 2012-03-04 01:59:46 UTC
It appears that before LibreOffice passes text to the spell-checker, it breaks them into separate words. The problem is that (apparently) it does this using some general language-agnostic rules, while different languages might have different rules as to what characters may be part of a word, and what breaks words.

My problem is specifically with the Hebrew spell-checking: In Hebrew, the quote characters - ' and ", are used not just for quoting, but have an additional unrelated use as in-word characters:
1. The single-quote is used to mark foreign sounds. E.g., the word ג'ירפה has a single-quote character after the gimmel, which means it should be pronounced "j", not "g".
2. The double-quote is used inside acronyms, to mark them as such. For example מנכ"ל is the acronym for CEO. מנכ"לים is its plural. Both have quotes in the middle of the word - and these words, together with this quote, are in the dictionary.

Because of this, the Hebrew hunspell dictionary includes the following lines in he_IL.aff:

   BREAK 3
   BREAK ^"
   BREAK "$
   BREAK ^'

This means that " only breaks words when it's in the beginning and end (and ' only in the beginning) - these characters in the middle of a word never mean a word break in Hebrew. With this setting, hunspell correctly word-breaks and spell-checks Hebrew text.

Unfortunately, LibreOffice doesn't respect these instructions. It appears that it incorrectly breaks up the words before sending them to hunspell. The end result is that all Hebrew words which are acronyms or have foreign sounds in them are incorrectly marked as being errors, which is very annoying.
Comment 1 Urmas 2012-03-05 20:42:44 UTC
On a second thought, why do use ' and " instead of geresh/gershaim?

But the problem is that spellchecker breaks words on them too, even if that is explicitly prohibited.
Comment 2 Nadav Har'El 2012-03-05 22:33:06 UTC
Well, it's just that despite the existence of the separate "geresh" and "gershayim" characters in Unicode, I've never seen anyone actually using them. Everyone I've seen uses the normal ASCII single-quote and double-quote characters respectively, and expect those to look fine in Hebrew fonts - and they do.

They reason people don't use the special unicode characters is probably that there is usually no convenient method to enter them with the keyboard.

You're right that it should also be checked what happens with these special characters - the spell-checker shouldn't break such words, and it should accept them even though the dictionary contains words with the ASCII quotes/double-quote, not with geresh/gershayim. This should perhaps become a separate bug, if it doesn't work properly.
Comment 3 Lior Kaplan 2012-03-16 05:45:06 UTC
Reported in the past with OO.org at 
https://issues.apache.org/ooo/show_bug.cgi?id=51772
https://issues.apache.org/ooo/show_bug.cgi?id=99796

The first also have patches which might still be relevant.
Comment 4 QA Administrators 2014-10-24 03:18:27 UTC Comment hidden (obsolete)
Comment 5 Amir Adar 2015-01-13 13:53:01 UTC
Though the bug is quite old, it is still present in versions 4.3.5.2 and 4.5.0.0 (master build). To reproduce:

1. Open LibreOffice Writer.
2. Type in a Hebrew acronym, like פלמ"ח.

Even if the language is set to Hebrew, the acronym is underlined with red, indicating a spelling error. It also separates between the letters before and after the "Gershayim", marking them as two words instead of one.

I tried to reproduce on other programs with spell-checking capabilities, such as Gedit, and it seems the problem is there as well. Perhaps the underlying engine is at fault, and not LibreOffice itself.

I am using Linux Mint 17.1, 32-bit.
Comment 6 Nadav Har'El 2015-01-13 14:34:09 UTC
Indeed, this bug still exists, and is still very much annoying to Hebrew users!

As I explained in detail in the original bug report, I believe this is *not* problem of the underlying engine (aspell, based on data from the hspell project) but rather of libreoffice's own word split algorithm, which apparently doesn't respect Aspell's declaration of in-word characters, nor does it support the correct word-split rules for Hebrew (where certain seemingly-punctuation characters may be parts of words).

I'm not familiar with the code involved, but https://www.libreoffice.org/bugzilla/show_bug.cgi?id=62360 points to the place in the libreoffice code which might need to be fixed.
Comment 7 QA Administrators 2016-01-17 20:05:15 UTC Comment hidden (obsolete)
Comment 8 Lior Kaplan 2016-01-18 20:15:23 UTC
Still happens with LibO 5.0.x. 

To test: use the word ג'ירפה and see that the quote makes the spell checker think it's two words.
Comment 9 Nadav Har'El 2016-02-21 08:23:17 UTC
Indeed, this bug still exists, and still very much annoying. Non-hebrew-speakers might not appreciate the meaning of this bug, but a certain percentage of Hebrew words (unfortunately I can't quote a good estimate) simply contain the single-quote or double-quote characters in them. I gave above examples - certain words with foreign-language sounds and all acronyms.

LibreOffice will mark all these words as wrong, which not only prevents spell-checking such words, it also lowers the users overall confidence in the spellchecker because he or she will so often see correctly-written words red-marked.
Comment 10 QA Administrators 2017-09-01 11:20:33 UTC Comment hidden (obsolete)
Comment 11 Lior Kaplan 2017-09-01 11:41:40 UTC
Still reproducible.

Version: 5.4.0.3
Build ID: 1:5.4.0-1
Comment 12 Omer Zak 2017-11-15 22:07:00 UTC
Still happens in:

Version: 6.0.0.0.alpha1+
Build ID: 9050854c35c389466923f0224a36572d36cd471a
CPU threads: 8; OS: Linux 4.9; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.utf8); Calc: group

OS: Debian 64bit Stretch (Debian 9.2, with some backported packages)


But with some changes.
1. The word פלמ"ח is still not handled correctly.
2. The word ג'ירפה is now handled correctly. Seems that Writer now converts the single quote into geresh.
Comment 13 QA Administrators 2018-11-16 03:42:08 UTC Comment hidden (obsolete)
Comment 14 Nadav Har'El 2018-11-16 20:37:42 UTC
‎The bug still exists in LibreOffice 6.1.2.1.

As Omer Zak noted above, the bug was *fixed* for the single quote, e.g., ג'ירפה or סח'נין are now correctly recognized as correctly spelled. This is a welcome improvement. However, the bug still exists for double-quotes, e.g., מנכ"לים or פלמ"ח are still split to two words which are spell-checked individually.
Comment 15 Eyal Rozenberg 2018-12-27 15:28:57 UTC
(In reply to Nadav Har'El from comment #14)
> As Omer Zak noted above, the bug was *fixed* for the single quote, e.g.,
> ג'ירפה or סח'נין are now correctly recognized as correctly spelled. 

If you can bisect this fix with daily builds, you can probably figure out who exactly fixed it and where. If you do that, perhaps we'd be able to either:

* Formulate a patch to handle the double-quote case as well; or
* Contact the developer who introduced that patch to ask for their help more specifically.
Comment 16 QA Administrators 2020-12-27 03:37:42 UTC Comment hidden (obsolete)
Comment 17 Eyal Rozenberg 2021-02-12 22:06:57 UTC
So, Nadav has not replied to my last comment, so let me summarize the state of affairs, in LO 7.1:

* There are (at least) three ways to signify a Geresh within a word: APOSTROPHE (U+27), RIGHT SINGLE QUOTATION MARK (U+2019), and HEBREW PUNCTUATION GERESH (U+5F3).
* Similarly are (at least) three ways to signify a Gershaim within a word: DOUBLE QUOTATION MARK (0x22), RIGHT DOUBLE QUOTATION MARK (0x201D), and HEBREW PUNCTUATION GERSHAIM (0x5F4).
* LibreOffice writer _is_ breaking up words using APOSTROPHE or DOUBLE QUOTATION MARK. In an ideal world, these would not be used for Geresh or Gershaim, but since these are commonly used in practice - it is a bug.
* LibreOffice writer _is_ breaking up words using RIGHT DOUBLE QUOTATION MARK - this is a bug. Due to this bug, the two parts of the words are spell-checked separately.
* LibreOffice writer is _not_ breaking up words using RIGHT SINGLE QUOTATION MARK, and spell-checking succeeds on them (at least in my anecdotal checking; consider ג’ירפה for example).
* LibreOffice writer is _not_ breaking up words using HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM - but spell-checking still _fails_ on them: ג׳ירפה , דו״ח  This is a different phenomenon than what Nadav Har'el first identified. It may be worth splitting off into a separate bug.

Version info:
Version: 7.1.0.3 / LibreOffice Community
Build ID: f6099ecf3d29644b5008cc8f48f42f4a40986e4c
CPU threads: 4; OS: Linux 5.9; UI render: default; VCL: gtk3
Locale: he-IL (en_IL); UI: en-US
Comment 18 Eyal Rozenberg 2021-02-12 22:08:24 UTC
Created attachment 169707 [details]
Document exhibiting the different manifestations of the bug

This covers all 3 ways to signify both symbols.
Comment 19 Eyal Rozenberg 2021-02-12 22:09:47 UTC
Created attachment 169708 [details]
Test document rendered in LO Writer 7.1

You will note the red squiggly line where the automatic spelling check fails.

Note in particular the cases of only a single character or two characters getting the squiggly line rather than the full word.
Comment 20 Eyal Rozenberg 2021-02-12 22:24:01 UTC
I should mention that when you open the ODT, you need to enable editing and spelling auto-check, and also type something in for the spell-check to kick in. Otherwise nothing will show up as misspelled. (That's not a bug.)
Comment 21 Eyal Rozenberg 2021-02-12 22:31:37 UTC
I should mention that when you open the ODT, you need to enable editing and spelling auto-check, and also type something in for the spell-check to kick in. Otherwise nothing will show up as misspelled
Comment 22 Eyal Rozenberg 2021-02-12 22:35:46 UTC
Have opened bug 140382 about the failure of the spell-checking to accept words with proper HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM.
Comment 23 eladhen2 2021-02-15 15:41:30 UTC
I see this behavior on 6.3.6.2
Comment 24 QA Administrators 2023-02-16 03:25:58 UTC Comment hidden (obsolete)
Comment 25 Eyal Rozenberg 2023-02-16 14:47:06 UTC
The situation described in comment 17 - persists with:

Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: ad387d5b984c6666906505d25685065f710ed55d
CPU threads: 4; OS: Linux 6.1; UI render: default; VCL: gtk3
Locale: he-IL (en_IL); UI: en-US

to sommarize this more succinctly:

intra-word character         break-up?    Example word
-----------------------------------------------------------
APOSTROPHE                   Yes          ג'ירפה
RIGHT SINGLE QUOTATION MARK  No           ג’ירפה
HEBREW PUNCTUATION GERESH    No           ג׳ירפה
DOUBLE QUOTATION MARK        Yes          דו"ח
RIGHT DOUBLE QUOTATION MARK  Yes          דו”ח
HEBREW PUNCTUATION GERSHAIM  No           דו״ח

All the "Yes" entries are buggy behavior - there should be no break-up of the word into two parts.

Spelling failure despite non-breakup:

HEBREW PUNCTUATION GERESH
HEBREW PUNCTUATION GERSHAIM
Comment 26 Commit Notification 2024-07-16 00:18:02 UTC
Jonathan Clark committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/174aa6e980f973cea9b1c402d03bd6dba951f5ae

tdf#46950 Allow intra-word right double quotation mark

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 27 Jonathan Clark 2024-07-16 00:31:00 UTC
The above patch adds right double quotation marks as an alternative for gershaim. The rest of the cases were already handled correctly, but I added more regression tests to ensure these changes aren't accidentally reverted in the future.

With this change, spell checking will still break Hebrew words at geresh, gershaim, and right double quotation marks. Support for these characters needs to be added to the Hebrew dictionary data. This is tracked by bug 140382, mentioned above.
Comment 28 Commit Notification 2024-07-17 03:06:30 UTC
Jonathan Clark committed a patch related to this issue.
It has been pushed to "libreoffice-24-8":

https://git.libreoffice.org/core/commit/9c9a7fa814c276dcd6ba1c18023d17c3e5a0745b

tdf#46950 Allow intra-word right double quotation mark

It will be available in 24.8.0.2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 29 Eyal Rozenberg 2024-10-22 22:07:25 UTC
(In reply to Jonathan Clark from comment #27)
> The above patch adds right double quotation marks as an alternative for
> gershaim. The rest of the cases were already handled correctly

Were they though?

The correct handling is a matter of level-of-strictness. If I were strict, I could say that only HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM can keep a word together, and with other characters it's two words and everything is a pig's breakfast anyways - indicating that we don't like their use by also redlining the spelling. But if I were lax - all six character would keep a word together.

The behavior with a recent nightly:

Version: 25.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: c8371b5f1a84191d38185820915f0d93741df1fe
CPU threads: 4; OS: Linux 6.6; UI render: default; VCL: gtk3
Locale: en-US (en_IL); UI: en-US
Calc: threaded

is:

Word    Character                    Broken?       Fails spelling?
------------------------------------------------------------------------
ג'ירפה  APOSTROPHE                   Yes           N/A
ג’ירפה  RIGHT SINGLE QUOTATION MARK  No [1]        No
ג’ירפה  HEBREW PUNCTUATION GERESH    No            Yes
דו"ח    DOUBLE QUOTATION MARK        Yes           N/A
דו”ח    RIGHT DOUBLE QUOTATION MARK  No [1]        No
דו״ח    HEBREW PUNCTUATION GERSHAIM  No            Yes

[1] - deducted from the spelling success.


> With this change, spell checking will still break Hebrew words at geresh,
> gershaim, and right double quotation marks.

Doesn't seem like that's what's happening.
Comment 30 Jonathan Clark 2024-10-23 14:24:33 UTC
(In reply to Eyal Rozenberg from comment #29)
> Word    Character                    Broken?       Fails spelling?
> ------------------------------------------------------------------------
> ג'ירפה  APOSTROPHE                   Yes           N/A
> ג’ירפה  RIGHT SINGLE QUOTATION MARK  No [1]        No
> ג’ירפה  HEBREW PUNCTUATION GERESH    No            Yes
> דו"ח    DOUBLE QUOTATION MARK        Yes           N/A
> דו”ח    RIGHT DOUBLE QUOTATION MARK  No [1]        No
> דו״ח    HEBREW PUNCTUATION GERSHAIM  No            Yes

I re-checked this by copying and pasting the table into a local build of Writer. Writer doesn't break any of the Hebrew words. This can be verified by double-click selecting the words, or advancing the cursor by words (ctrl+arrow keys). The corresponding unit tests are also still present and passing.

Version: 25.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: d6b6419b7b937aea4639b7f4f81b7f24cdccc6e0
CPU threads: 32; OS: Linux 6.8; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded
Comment 31 Eyal Rozenberg 2024-10-23 15:54:02 UTC Comment hidden (obsolete)
Comment 32 Eyal Rozenberg 2024-10-23 16:00:08 UTC
(In reply to Jonathan Clark from comment #30)

Ah, so, first I think I have messed up GERESH and maybe the GERSHAIM in the table. Here it is again:

ג'ירפה  APOSTROPHE                   Yes           N/A
ג’ירפה  RIGHT SINGLE QUOTATION MARK  No [1]        No
ג׳ירפה  HEBREW PUNCTUATION GERESH    No            Yes
דו"ח    DOUBLE QUOTATION MARK        Yes           N/A
דו”ח    RIGHT DOUBLE QUOTATION MARK  No [1]        No
דו״ח    HEBREW PUNCTUATION GERSHAIM  No            Yes

> Writer doesn't break any of the Hebrew words. This can be verified
> by double-click selecting the words, or advancing the cursor by words
> (ctrl+arrow keys). The corresponding unit tests are also still present and
> passing.

Ok, yes, Writer doesn't break any of the words with those actions. But it does break the words with apostrophe and double quotation mark when sending them over for spell-checking, so that we get the ג and the ירפה checked separately and the דו and ח.
Comment 33 Eyal Rozenberg 2024-10-23 16:02:06 UTC
Created attachment 197207 [details]
Spell checking of the table from comment 32

Screenshot showing: 

* ג'ירפה and דו"ח broken, 
* the other four combinations not broek
* Spell check passes with RIGHT SINGLE QUOTATION MARK, failing with GERESH
* Spell check passes with RIGHT DOUBLE QUOTATION MARK, failing with GERSHAIM

Checked with a nightly from 2024-10-22:

Version: 25.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: c8371b5f1a84191d38185820915f0d93741df1fe
CPU threads: 4; OS: Linux 6.6; UI render: default; VCL: gtk3
Locale: en-US (en_IL); UI: en-US
Calc: threaded
Comment 34 Jonathan Clark 2024-10-23 17:37:14 UTC
(In reply to Eyal Rozenberg from comment #32)
> (In reply to Jonathan Clark from comment #30)
> 
> Ah, so, first I think I have messed up GERESH and maybe the GERSHAIM in the
> table. Here it is again:
> 
> ג'ירפה  APOSTROPHE                   Yes           N/A
> ג’ירפה  RIGHT SINGLE QUOTATION MARK  No [1]        No
> ג׳ירפה  HEBREW PUNCTUATION GERESH    No            Yes
> דו"ח    DOUBLE QUOTATION MARK        Yes           N/A
> דו”ח    RIGHT DOUBLE QUOTATION MARK  No [1]        No
> דו״ח    HEBREW PUNCTUATION GERSHAIM  No            Yes
> 
> > Writer doesn't break any of the Hebrew words. This can be verified
> > by double-click selecting the words, or advancing the cursor by words
> > (ctrl+arrow keys). The corresponding unit tests are also still present and
> > passing.
> 
> Ok, yes, Writer doesn't break any of the words with those actions. But it
> does break the words with apostrophe and double quotation mark when sending
> them over for spell-checking, so that we get the ג and the ירפה checked
> separately and the דו and ח.

This is bug 140382. We are sending the words to Hunspell intact. Hunspell performs its own tokenization based on dictionary data. Hebrew dictionary data does not include these characters as word characters, so Hunspell splits them prior to dictionary lookup (see bug 140382 comment 3).
Comment 35 Eyal Rozenberg 2024-10-23 20:11:46 UTC
(In reply to Jonathan Clark from comment #34)
> This is bug 140382. We are sending the words to Hunspell intact. Hunspell
> performs its own tokenization based on dictionary data.

But why do we respect its further tokenization? Shouldn't we just mark the entire word as mis-spelled if hunspell rejected it?
Comment 36 Jonathan Clark 2024-10-23 20:45:17 UTC
(In reply to Eyal Rozenberg from comment #35)
> (In reply to Jonathan Clark from comment #34)
> > This is bug 140382. We are sending the words to Hunspell intact. Hunspell
> > performs its own tokenization based on dictionary data.
> 
> But why do we respect its further tokenization? Shouldn't we just mark the
> entire word as mis-spelled if hunspell rejected it?

I don't have a prepared example, but for certain highly synthetic languages we want to handle whole words for editing purposes, but the spell checker needs to work at a morpheme level and can report spelling mistakes for parts of words. For polysynthetic languages you could have 50 character words composed of 10 morphemes, and if only one morpheme is spelled incorrectly, it would be annoying to see the entire word redlined.
Comment 37 Eyal Rozenberg 2024-10-24 08:23:58 UTC
(In reply to Jonathan Clark from comment #36)
> I don't have a prepared example, but for certain highly synthetic languages
> we want to handle whole words for editing purposes, but the spell checker
> needs to work at a morpheme level and can report spelling mistakes for parts
> of words. 

Ok, I'll buy that.