Bug 40504 - Spell checker marks Hebrew and Arabic words with vowel marks as errors
Summary: Spell checker marks Hebrew and Arabic words with vowel marks as errors
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
3.5.0 Beta2
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-08-30 20:29 UTC by Yotam Benshalom
Modified: 2012-05-08 12:54 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
A patch by tkos (2.88 KB, patch)
2012-01-11 13:37 UTC, Lior Kaplan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yotam Benshalom 2011-08-30 20:29:31 UTC
In Hebrew, Arabic and other Semitic languages the letters represent the consonants. Vowel marks (nikkud in Hebrew) are optional, and are usually used in special cases, like in poetry or texts intended for children.

The spell checker database contains words without vowel marks. This is consistent with other spell checkers. However, it marks words which are spelled correctly, and have vowel marks added, as errors.

For example, the word שָׁלוֹם is marked as an error but the word שלום is not.

The spell checker should ignore all ranges of vowel marks while looking for errors. I heard a rumour that this was corrected in an ancient version of OpenOffice but never seen it.
Comment 1 ⁨خالد حسني⁩ 2011-12-12 07:47:13 UTC
I think this is more of hunspell/spelling dictionary issue rather than LibreOffice one, for example the Arabic dictionary ignores vowel marks except few cases that is guaranteed to be always wrong (e.g. كتابً is marked as wrong but not كتابٌ), so it is up to spelling dictionary author to ignore combining marks or not.
Comment 2 Björn Michaelsen 2011-12-23 12:39:49 UTC
[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:
http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1

more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
Comment 3 Yotam Benshalom 2011-12-23 12:43:08 UTC
This bug still exists in LibreOffice 3.5.0, build 7362ca8-b5a8e65-af86909-d471f98-61464c4.
Comment 4 Yotam Benshalom 2011-12-28 19:03:08 UTC
Still exists on Beta 2.
Khaled, do you happen to know where exactly I should post this problem? If this is not a LO problem, I am not sure now where is the bug tracking system I need.
Comment 5 ⁨خالد حسني⁩ 2011-12-29 02:46:46 UTC
The (In reply to comment #4)
> Khaled, do you happen to know where exactly I should post this problem? If this
> is not a LO problem, I am not sure now where is the bug tracking system I need.

The Hebrew spell dictionary in LibreOffice points to Hspell project http://hspell.ivrix.org.il/, so I think this where this issue should be reported (there is some talk in their FAQ on being based on 'niqqud-less spelling rules', so this might not be a bug but a conscious decision, but I know very little about Hebrew spelling rules.)
Comment 6 Lior Kaplan 2012-01-11 13:37:30 UTC
Created attachment 55467 [details]
A patch by tkos

This patch was used as part of the Hebrew oo.org done by tkos. I think it's a good start to solve the issue (even if it's Hebrew oriented). A license clearance can be provided as needed.
Comment 7 Nadav Har'El 2012-03-04 01:40:22 UTC
Hi, I'm Nadav Har'El, the co-author of Hspell (hspell.ivrix.org.il), on which the spell-checking dictionary you're using is based.

What to do with words with niqqud (or partial niqqud) is an issue we've long been wondering about. What is the best thing to do - mark all of them as errors? Mark all of them as correct? Anything else?
As I'll explain now, we decided that marking all of them as *errors* is actually the best thing to do, so we don't consider this a bug.

Clearly, the best option would have been for Hspell to also know the right niqqud, and accept שלום with a qamats on the shin, but *not* with a patach. However, this will require Hspell to know the correct niqqud, which it currently doesn't - and adding it will require a huge effort, which isn't likely in the near future.

A second option could have been to simply ignore any niqqud marks, and spell-check only the letters as if the niqqud isn't there. I believe this is what you want. There are several serious problems with this suggestion: First, it will lead the spell-checker to "accept" any niqqud, including wrong niqqud (e.g., שלום with patach). A user who is not aware of this will believe the computer is confirming that he wrote the right niqqud... The second problem is that words with full niqqud (as opposed to partial niqqud) also have different underlying spelling (fewer imot qri'a), so if you try to spell-check fully vowelled text, you'll get many of your *correct* words marked as spelling errors, when in fact they are not. For these two reasons, I don't like this option: It implies that we can spell-check partial niqqud, while we can't, and it implies that we can spell-check completely vowelled text, while in fact we can't.

The currently-used third option is to mark any vowelled word as an error. The user sees this "error", just like any unknown proper name and such are marked as "errors", and has the option to verify this spelling on his own and add it to his private dictionary. Indeed, long vowelled text will be entirely marked as errors, but this is good because Hspell indeed can't verify its spelling! If the user wants to exempt such paragraphs from being spell-checked, he can easily do that - a text style can choose the "language" which it uses, and the user can define a style which says its text is not in Hebrew (or in Hebrew, without spellchecking).

A fourth option could have been to consider *any* word with niqqud as a correct word. This eliminates the second problem of the 2nd option (you don't get random errors on correct fully-vowelled text) and I actually prefer it over the 2nd option, but you still have the first problem - that the user *thinks* his text is being spell-checked, when in fact it isn't.

So to summarize, I do believe that the current behaviour is reasonable, and not a bug.
Comment 8 Michael Meeks 2012-05-08 12:54:34 UTC
could we not write a set of simple lightproof rules for Hebrew to capture this ? [ if the rendering is context sensitive it seems that might be a better approach ]. Based on Nadav's input - I'm closing this NOTABUG for now :-) that is until someone pokes at lightproof to add their rule.