Bug 140382 - Hebrew spelling check rejects words with proper Hebrew Geresh and Gershaim
Summary: Hebrew spelling check rejects words with proper Hebrew Geresh and Gershaim
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
7.1.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Dictionaries RTL-Hebrew
  Show dependency treegraph
 
Reported: 2021-02-12 22:33 UTC by Eyal Rozenberg
Modified: 2024-07-16 22:22 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eyal Rozenberg 2021-02-12 22:33:57 UTC
Description:
In Hebrew, a Geresh mark may be used to signify an altered sound of a consonant, typically for a word borrowed from another language. A Gerhsaim mark may be used to indicate an acronym.

Now, these marks are often signified in practice by people typing in Hebrew on a keyboard using the glyphs APOSTROPHE (U+27) and DOUBLE QUOTATION MARK (U+22) respectively; or by RIGHT SINGLE QUOTATION MARK (U+2019) and RIGHT DOUBLE QUOTATION MARK (U+201D); or finally by the proper HEBREW PUNCTUATION GERESH (U+5F3) and HEBREW PUNCTUATION GERHSAIM (U+5F4)

Well, it seems that when the latter two glyphs are used - words fail the spelling check even when they shouldn't. 



Steps to Reproduce:
Consider the words:

ג׳ירפה
דו״ח

put them in an LO Writer document, apply spell-checking and see.



Actual Results:
Both words fail the spell check.

Expected Results:
Both words pass the spell check.


Reproducible: Always


User Profile Reset: No



Additional Info:
You can compare this against 

ג'ירפה

with the APOSTROPHE, which does pass spell-checking. Unfortunately, however, if you try

דו"ח

you will hit bug 46950: The word will be broken at the DOUBLE QUOTATION MARK, and the two parts spell-checked separately, so I can't know whether this variant passes the spelling check or not.
Comment 1 eladhen2 2021-02-15 15:44:26 UTC
confirmed on 6.3.6.2.
Comment 2 Stéphane Guillou (stragu) 2023-12-15 18:03:03 UTC
Reproduced with:

Version: 7.6.4.1 (X86_64) / LibreOffice Community
Build ID: e19e193f88cd6c0525a17fb7a176ed8e6a3e2aa1
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded

... with the 2017.09.03 version of the Hebrew spelling dictionary provided by the hebrew langpack,

... and the provided strings using Characters > Font > Language > Hebrew.

Same strings are not recognised as misspelled by MS Office 365 (online).
Comment 3 Jonathan Clark 2024-07-15 23:51:26 UTC
I investigated this bug as part of my work on bug 46950. Specifically, I wanted to determine if this was an LO-specific issue, or if it originated in an upstream project.

The root cause for this bug is incomplete upstream Hebrew dictionary data. Currently, the dictionary doesn't list geresh, gershayim, or the right double quotation mark as word characters.

To demonstrate this, I ran the following test directly against hunspell.

 $ hunspell -d he_IL 
 Hunspell 1.7.2
 ג'ירפה
 *
 ג’ירפה
 *
 ג׳ירפה
 & ג 15 0: ה, גו, גא, גע, גח, חג, גש, גס, גז, זג, גד, דג, גג, גב, גר
 *
 דו"ח
 *
 דו”ח
 *
 & ח 15 3: כ, חי, אח, קח, חש, שח, חס, חד, חג, גח, חב, נח, חט, טח, צח
 דו״ח
 *
 & ח 15 3: כ, חי, אח, קח, חש, שח, חס, חד, חג, גח, חב, נח, חט, טח, צח

This output shows the words containing apostrophe, right single quotation mark, and quotation mark were all interpreted correctly as a single word. However, words containing geresh, right double quotation mark, and gershayim were each incorrectly interpreted as two words.

I then edited my local he_IL.aff file to add geresh, right double quotation mark, and gershayim to the WORDCHARS line, and re-ran the above command:

 $ hunspell -d he_IL <sample.txt 
 Hunspell 1.7.2
 ג'ירפה
 *
 ג’ירפה
 *
 ג׳ירפה
 & ג׳ירפה 2 0: גירפה, ג'ירפה
 דו"ח
 *
 דו”ח
 & דו”ח 3 0: דוח, דווח, דו"ח
 דו״ח
 & דו״ח 3 0: דוח, דווח, דו"ח

With my modified he_IL.aff file, hunspell now correctly sees all cases as a single word (although it says they're spelled incorrectly).

Our Hebrew dictionary data comes from an upstream project, Hspell. In order to support these characters properly, I think it would be best to approach the Hspell maintainers with this request.
Comment 4 Stéphane Guillou (stragu) 2024-07-16 01:27:56 UTC
Thanks Jonathan.

Lior, is this something you can help with?
http://hspell.ivrix.org.il/ lists the address nyh@math.technion.ac.il for reporting issues, no idea how up to date that is.
Comment 5 Eyal Rozenberg 2024-07-16 20:15:55 UTC
(In reply to Jonathan Clark from comment #3)
> I then edited my local he_IL.aff file to add geresh, right double quotation
> mark, and gershayim to the WORDCHARS line

I already have APOSTROPHE and QUOTATION MARK in the WORDCHARS line, and that doesn't help with דו"ח and ג'ירפה. But adding the HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM does help when I use those.

> With my modified he_IL.aff file, hunspell now correctly sees all cases as a
> single word (although it says they're spelled incorrectly).

so, I don't... perhaps there's something else in he_IL.aff that messes up the ' and " behavior?

Also,  ג'ירפה and דו"ח, with ' and " rather than ׳ and ״ respectively, do exist in my he_IL.dic file as valid words. So there is also the matter of "canonicalizing" the character used for geresh or gershaim for dictionary lookup. I suppose that's supposed to be hunspell's job?


> Our Hebrew dictionary data comes from an upstream project, Hspell. In order
> to support these characters properly, I think it would be best to approach
> the Hspell maintainers with this request.

I remember I've gotten confused by hspell-vs-hunspell in the past. On my system, I have hunspell installed. Are these alternatives? subprojects of each other? complementary projects?
Comment 6 Eyal Rozenberg 2024-07-16 22:22:30 UTC
Some testing with hspell (not hunspell):

$ cat a.txt | recode ISO-8859-8..utf-8
דו"ח
ג'ירפה
$ hspell -l a.txt  | recode ISO-8859-8..utf-8
מילה חוקית: דו"ח
	דו"ח(ע,ז,יחיד)
	דו"ח(ע,ז,יחיד,סמיכות)
מילה חוקית: ג'ירפה
	ג'ירפה(ע,נ,יחיד)

which tells us that both words are "legal words" (מילה חוקית) with some morphological information.

Now, it's interesting to note that hspell only seems to accept ISO-8859-8 encoding for input, which does _not_ have the HEBREW PUNCTUATION GERESH nor HEBREW PUNCTUATION GERSHAIM . Maybe this has something to do with that aspect of the bug.