Description: In Hebrew, a Geresh mark may be used to signify an altered sound of a consonant, typically for a word borrowed from another language. A Gerhsaim mark may be used to indicate an acronym. Now, these marks are often signified in practice by people typing in Hebrew on a keyboard using the glyphs APOSTROPHE (U+27) and DOUBLE QUOTATION MARK (U+22) respectively; or by RIGHT SINGLE QUOTATION MARK (U+2019) and RIGHT DOUBLE QUOTATION MARK (U+201D); or finally by the proper HEBREW PUNCTUATION GERESH (U+5F3) and HEBREW PUNCTUATION GERHSAIM (U+5F4) Well, it seems that when the latter two glyphs are used - words fail the spelling check even when they shouldn't. Steps to Reproduce: Consider the words: ג׳ירפה דו״ח put them in an LO Writer document, apply spell-checking and see. Actual Results: Both words fail the spell check. Expected Results: Both words pass the spell check. Reproducible: Always User Profile Reset: No Additional Info: You can compare this against ג'ירפה with the APOSTROPHE, which does pass spell-checking. Unfortunately, however, if you try דו"ח you will hit bug 46950: The word will be broken at the DOUBLE QUOTATION MARK, and the two parts spell-checked separately, so I can't know whether this variant passes the spelling check or not.
confirmed on 6.3.6.2.
Reproduced with: Version: 7.6.4.1 (X86_64) / LibreOffice Community Build ID: e19e193f88cd6c0525a17fb7a176ed8e6a3e2aa1 CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3 Locale: en-AU (en_AU.UTF-8); UI: en-US Calc: threaded ... with the 2017.09.03 version of the Hebrew spelling dictionary provided by the hebrew langpack, ... and the provided strings using Characters > Font > Language > Hebrew. Same strings are not recognised as misspelled by MS Office 365 (online).
I investigated this bug as part of my work on bug 46950. Specifically, I wanted to determine if this was an LO-specific issue, or if it originated in an upstream project. The root cause for this bug is incomplete upstream Hebrew dictionary data. Currently, the dictionary doesn't list geresh, gershayim, or the right double quotation mark as word characters. To demonstrate this, I ran the following test directly against hunspell. $ hunspell -d he_IL Hunspell 1.7.2 ג'ירפה * ג’ירפה * ג׳ירפה & ג 15 0: ה, גו, גא, גע, גח, חג, גש, גס, גז, זג, גד, דג, גג, גב, גר * דו"ח * דו”ח * & ח 15 3: כ, חי, אח, קח, חש, שח, חס, חד, חג, גח, חב, נח, חט, טח, צח דו״ח * & ח 15 3: כ, חי, אח, קח, חש, שח, חס, חד, חג, גח, חב, נח, חט, טח, צח This output shows the words containing apostrophe, right single quotation mark, and quotation mark were all interpreted correctly as a single word. However, words containing geresh, right double quotation mark, and gershayim were each incorrectly interpreted as two words. I then edited my local he_IL.aff file to add geresh, right double quotation mark, and gershayim to the WORDCHARS line, and re-ran the above command: $ hunspell -d he_IL <sample.txt Hunspell 1.7.2 ג'ירפה * ג’ירפה * ג׳ירפה & ג׳ירפה 2 0: גירפה, ג'ירפה דו"ח * דו”ח & דו”ח 3 0: דוח, דווח, דו"ח דו״ח & דו״ח 3 0: דוח, דווח, דו"ח With my modified he_IL.aff file, hunspell now correctly sees all cases as a single word (although it says they're spelled incorrectly). Our Hebrew dictionary data comes from an upstream project, Hspell. In order to support these characters properly, I think it would be best to approach the Hspell maintainers with this request.
Thanks Jonathan. Lior, is this something you can help with? http://hspell.ivrix.org.il/ lists the address nyh@math.technion.ac.il for reporting issues, no idea how up to date that is.
(In reply to Jonathan Clark from comment #3) > I then edited my local he_IL.aff file to add geresh, right double quotation > mark, and gershayim to the WORDCHARS line I already have APOSTROPHE and QUOTATION MARK in the WORDCHARS line, and that doesn't help with דו"ח and ג'ירפה. But adding the HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM does help when I use those. > With my modified he_IL.aff file, hunspell now correctly sees all cases as a > single word (although it says they're spelled incorrectly). so, I don't... perhaps there's something else in he_IL.aff that messes up the ' and " behavior? Also, ג'ירפה and דו"ח, with ' and " rather than ׳ and ״ respectively, do exist in my he_IL.dic file as valid words. So there is also the matter of "canonicalizing" the character used for geresh or gershaim for dictionary lookup. I suppose that's supposed to be hunspell's job? > Our Hebrew dictionary data comes from an upstream project, Hspell. In order > to support these characters properly, I think it would be best to approach > the Hspell maintainers with this request. I remember I've gotten confused by hspell-vs-hunspell in the past. On my system, I have hunspell installed. Are these alternatives? subprojects of each other? complementary projects?
Some testing with hspell (not hunspell): $ cat a.txt | recode ISO-8859-8..utf-8 דו"ח ג'ירפה $ hspell -l a.txt | recode ISO-8859-8..utf-8 מילה חוקית: דו"ח דו"ח(ע,ז,יחיד) דו"ח(ע,ז,יחיד,סמיכות) מילה חוקית: ג'ירפה ג'ירפה(ע,נ,יחיד) which tells us that both words are "legal words" (מילה חוקית) with some morphological information. Now, it's interesting to note that hspell only seems to accept ISO-8859-8 encoding for input, which does _not_ have the HEBREW PUNCTUATION GERESH nor HEBREW PUNCTUATION GERSHAIM . Maybe this has something to do with that aspect of the bug.
Ping.