It appears that before LibreOffice passes text to the spell-checker, it breaks them into separate words. The problem is that (apparently) it does this using some general language-agnostic rules, while different languages might have different rules as to what characters may be part of a word, and what breaks words. My problem is specifically with the Hebrew spell-checking: In Hebrew, the quote characters - ' and ", are used not just for quoting, but have an additional unrelated use as in-word characters: 1. The single-quote is used to mark foreign sounds. E.g., the word ג'ירפה has a single-quote character after the gimmel, which means it should be pronounced "j", not "g". 2. The double-quote is used inside acronyms, to mark them as such. For example מנכ"ל is the acronym for CEO. מנכ"לים is its plural. Both have quotes in the middle of the word - and these words, together with this quote, are in the dictionary. Because of this, the Hebrew hunspell dictionary includes the following lines in he_IL.aff: BREAK 3 BREAK ^" BREAK "$ BREAK ^' This means that " only breaks words when it's in the beginning and end (and ' only in the beginning) - these characters in the middle of a word never mean a word break in Hebrew. With this setting, hunspell correctly word-breaks and spell-checks Hebrew text. Unfortunately, LibreOffice doesn't respect these instructions. It appears that it incorrectly breaks up the words before sending them to hunspell. The end result is that all Hebrew words which are acronyms or have foreign sounds in them are incorrectly marked as being errors, which is very annoying.
On a second thought, why do use ' and " instead of geresh/gershaim? But the problem is that spellchecker breaks words on them too, even if that is explicitly prohibited.
Well, it's just that despite the existence of the separate "geresh" and "gershayim" characters in Unicode, I've never seen anyone actually using them. Everyone I've seen uses the normal ASCII single-quote and double-quote characters respectively, and expect those to look fine in Hebrew fonts - and they do. They reason people don't use the special unicode characters is probably that there is usually no convenient method to enter them with the keyboard. You're right that it should also be checked what happens with these special characters - the spell-checker shouldn't break such words, and it should accept them even though the dictionary contains words with the ASCII quotes/double-quote, not with geresh/gershayim. This should perhaps become a separate bug, if it doesn't work properly.
Reported in the past with OO.org at https://issues.apache.org/ooo/show_bug.cgi?id=51772 https://issues.apache.org/ooo/show_bug.cgi?id=99796 The first also have patches which might still be relevant.
Please read this message in its entirety before responding. Your bug was confirmed at least 1 year ago and has not had any activity on it for over a year. Your bug is still set to NEW which means that it is open and confirmed. It would be nice to have the bug confirmed on a newer version than the version reported in the original report to know that the bug is still present -- sometimes a bug is inadvertently fixed over time and just never closed. If you have time please do the following: 1) Test to see if the bug is still present on a currently supported version of LibreOffice (preferably 4.2 or newer). 2) If it is present please leave a comment telling us what version of LibreOffice and your operating system. 3) If it is NOT present please set the bug to RESOLVED-WORKSFORME and leave a short comment telling us your version and Operating System Please DO NOT 1) Update the version field 2) Reply via email (please reply directly on the bug tracker) 3) Set the bug to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + LibreOffice is powered by a team of volunteers, every bug is confirmed (triaged) by human beings who mostly give their time for free. We invite you to join our triaging by checking out this link: https://wiki.documentfoundation.org/QA/BugTriage There are also other ways to get involved including with marketing, UX, documentation, and of course developing - http://www.libreoffice.org/get-help/mailing-lists/. Lastly, good bug reports help tremendously in making the process go smoother, please always provide reproducible steps (even if it seems easy) and attach any and all relevant material
Though the bug is quite old, it is still present in versions 4.3.5.2 and 4.5.0.0 (master build). To reproduce: 1. Open LibreOffice Writer. 2. Type in a Hebrew acronym, like פלמ"ח. Even if the language is set to Hebrew, the acronym is underlined with red, indicating a spelling error. It also separates between the letters before and after the "Gershayim", marking them as two words instead of one. I tried to reproduce on other programs with spell-checking capabilities, such as Gedit, and it seems the problem is there as well. Perhaps the underlying engine is at fault, and not LibreOffice itself. I am using Linux Mint 17.1, 32-bit.
Indeed, this bug still exists, and is still very much annoying to Hebrew users! As I explained in detail in the original bug report, I believe this is *not* problem of the underlying engine (aspell, based on data from the hspell project) but rather of libreoffice's own word split algorithm, which apparently doesn't respect Aspell's declaration of in-word characters, nor does it support the correct word-split rules for Hebrew (where certain seemingly-punctuation characters may be parts of words). I'm not familiar with the code involved, but https://www.libreoffice.org/bugzilla/show_bug.cgi?id=62360 points to the place in the libreoffice code which might need to be fixed.
** Please read this message in its entirety before responding ** To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present on a currently supported version of LibreOffice (5.0.4 or later) https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the version of LibreOffice and your operating system, and any changes you see in the bug behavior If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a short comment that includes your version of LibreOffice and Operating System Please DO NOT: - Update the version field - Reply via email (please reply directly on the bug tracker) - Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) http://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to "inherited from OOo"; 4b. If the bug was not present in 3.3 - add "regression" to keyword Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa Thank you for your help! -- The LibreOffice QA Team This NEW Message was generated on: 2016-01-17
Still happens with LibO 5.0.x. To test: use the word ג'ירפה and see that the quote makes the spell checker think it's two words.
Indeed, this bug still exists, and still very much annoying. Non-hebrew-speakers might not appreciate the meaning of this bug, but a certain percentage of Hebrew words (unfortunately I can't quote a good estimate) simply contain the single-quote or double-quote characters in them. I gave above examples - certain words with foreign-language sounds and all acronyms. LibreOffice will mark all these words as wrong, which not only prevents spell-checking such words, it also lowers the users overall confidence in the spellchecker because he or she will so often see correctly-written words red-marked.
** Please read this message in its entirety before responding ** To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present on a currently supported version of LibreOffice (5.4.1 or 5.3.6 https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the version of LibreOffice and your operating system, and any changes you see in the bug behavior If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a short comment that includes your version of LibreOffice and Operating System Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) http://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to "inherited from OOo"; 4b. If the bug was not present in 3.3 - add "regression" to keyword Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team MassPing-UntouchedBug-20170901
Still reproducible. Version: 5.4.0.3 Build ID: 1:5.4.0-1
Still happens in: Version: 6.0.0.0.alpha1+ Build ID: 9050854c35c389466923f0224a36572d36cd471a CPU threads: 8; OS: Linux 4.9; UI render: default; VCL: gtk3; Locale: en-US (en_US.utf8); Calc: group OS: Debian 64bit Stretch (Debian 9.2, with some backported packages) But with some changes. 1. The word פלמ"ח is still not handled correctly. 2. The word ג'ירפה is now handled correctly. Seems that Writer now converts the single quote into geresh.
** Please read this message in its entirety before responding ** To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the information from Help - About LibreOffice. If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice. Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from http://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to 'inherited from OOo'; 4b. If the bug was not present in 3.3 - add 'regression' to keyword Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team MassPing-UntouchedBug
The bug still exists in LibreOffice 6.1.2.1. As Omer Zak noted above, the bug was *fixed* for the single quote, e.g., ג'ירפה or סח'נין are now correctly recognized as correctly spelled. This is a welcome improvement. However, the bug still exists for double-quotes, e.g., מנכ"לים or פלמ"ח are still split to two words which are spell-checked individually.
(In reply to Nadav Har'El from comment #14) > As Omer Zak noted above, the bug was *fixed* for the single quote, e.g., > ג'ירפה or סח'נין are now correctly recognized as correctly spelled. If you can bisect this fix with daily builds, you can probably figure out who exactly fixed it and where. If you do that, perhaps we'd be able to either: * Formulate a patch to handle the double-quote case as well; or * Contact the developer who introduced that patch to ask for their help more specifically.
Dear Nadav Har'El, To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the information from Help - About LibreOffice. If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice. Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to 'inherited from OOo'; 4b. If the bug was not present in 3.3 - add 'regression' to keyword Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team MassPing-UntouchedBug
So, Nadav has not replied to my last comment, so let me summarize the state of affairs, in LO 7.1: * There are (at least) three ways to signify a Geresh within a word: APOSTROPHE (U+27), RIGHT SINGLE QUOTATION MARK (U+2019), and HEBREW PUNCTUATION GERESH (U+5F3). * Similarly are (at least) three ways to signify a Gershaim within a word: DOUBLE QUOTATION MARK (0x22), RIGHT DOUBLE QUOTATION MARK (0x201D), and HEBREW PUNCTUATION GERSHAIM (0x5F4). * LibreOffice writer _is_ breaking up words using APOSTROPHE or DOUBLE QUOTATION MARK. In an ideal world, these would not be used for Geresh or Gershaim, but since these are commonly used in practice - it is a bug. * LibreOffice writer _is_ breaking up words using RIGHT DOUBLE QUOTATION MARK - this is a bug. Due to this bug, the two parts of the words are spell-checked separately. * LibreOffice writer is _not_ breaking up words using RIGHT SINGLE QUOTATION MARK, and spell-checking succeeds on them (at least in my anecdotal checking; consider ג’ירפה for example). * LibreOffice writer is _not_ breaking up words using HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM - but spell-checking still _fails_ on them: ג׳ירפה , דו״ח This is a different phenomenon than what Nadav Har'el first identified. It may be worth splitting off into a separate bug. Version info: Version: 7.1.0.3 / LibreOffice Community Build ID: f6099ecf3d29644b5008cc8f48f42f4a40986e4c CPU threads: 4; OS: Linux 5.9; UI render: default; VCL: gtk3 Locale: he-IL (en_IL); UI: en-US
Created attachment 169707 [details] Document exhibiting the different manifestations of the bug This covers all 3 ways to signify both symbols.
Created attachment 169708 [details] Test document rendered in LO Writer 7.1 You will note the red squiggly line where the automatic spelling check fails. Note in particular the cases of only a single character or two characters getting the squiggly line rather than the full word.
I should mention that when you open the ODT, you need to enable editing and spelling auto-check, and also type something in for the spell-check to kick in. Otherwise nothing will show up as misspelled. (That's not a bug.)
I should mention that when you open the ODT, you need to enable editing and spelling auto-check, and also type something in for the spell-check to kick in. Otherwise nothing will show up as misspelled
Have opened bug 140382 about the failure of the spell-checking to accept words with proper HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM.
I see this behavior on 6.3.6.2
Dear Nadav Har'El, To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the information from Help - About LibreOffice. If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice. Please DO NOT Update the version field Reply via email (please reply directly on the bug tracker) Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to 'inherited from OOo'; 4b. If the bug was not present in 3.3 - add 'regression' to keyword Feel free to come ask questions or to say hello in our QA chat: https://web.libera.chat/?settings=#libreoffice-qa Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team MassPing-UntouchedBug
The situation described in comment 17 - persists with: Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: ad387d5b984c6666906505d25685065f710ed55d CPU threads: 4; OS: Linux 6.1; UI render: default; VCL: gtk3 Locale: he-IL (en_IL); UI: en-US to sommarize this more succinctly: intra-word character break-up? Example word ----------------------------------------------------------- APOSTROPHE Yes ג'ירפה RIGHT SINGLE QUOTATION MARK No ג’ירפה HEBREW PUNCTUATION GERESH No ג׳ירפה DOUBLE QUOTATION MARK Yes דו"ח RIGHT DOUBLE QUOTATION MARK Yes דו”ח HEBREW PUNCTUATION GERSHAIM No דו״ח All the "Yes" entries are buggy behavior - there should be no break-up of the word into two parts. Spelling failure despite non-breakup: HEBREW PUNCTUATION GERESH HEBREW PUNCTUATION GERSHAIM
Jonathan Clark committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/174aa6e980f973cea9b1c402d03bd6dba951f5ae tdf#46950 Allow intra-word right double quotation mark It will be available in 25.2.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
The above patch adds right double quotation marks as an alternative for gershaim. The rest of the cases were already handled correctly, but I added more regression tests to ensure these changes aren't accidentally reverted in the future. With this change, spell checking will still break Hebrew words at geresh, gershaim, and right double quotation marks. Support for these characters needs to be added to the Hebrew dictionary data. This is tracked by bug 140382, mentioned above.
Jonathan Clark committed a patch related to this issue. It has been pushed to "libreoffice-24-8": https://git.libreoffice.org/core/commit/9c9a7fa814c276dcd6ba1c18023d17c3e5a0745b tdf#46950 Allow intra-word right double quotation mark It will be available in 24.8.0.2. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
(In reply to Jonathan Clark from comment #27) > The above patch adds right double quotation marks as an alternative for > gershaim. The rest of the cases were already handled correctly Were they though? The correct handling is a matter of level-of-strictness. If I were strict, I could say that only HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM can keep a word together, and with other characters it's two words and everything is a pig's breakfast anyways - indicating that we don't like their use by also redlining the spelling. But if I were lax - all six character would keep a word together. The behavior with a recent nightly: Version: 25.2.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: c8371b5f1a84191d38185820915f0d93741df1fe CPU threads: 4; OS: Linux 6.6; UI render: default; VCL: gtk3 Locale: en-US (en_IL); UI: en-US Calc: threaded is: Word Character Broken? Fails spelling? ------------------------------------------------------------------------ ג'ירפה APOSTROPHE Yes N/A ג’ירפה RIGHT SINGLE QUOTATION MARK No [1] No ג’ירפה HEBREW PUNCTUATION GERESH No Yes דו"ח DOUBLE QUOTATION MARK Yes N/A דו”ח RIGHT DOUBLE QUOTATION MARK No [1] No דו״ח HEBREW PUNCTUATION GERSHAIM No Yes [1] - deducted from the spelling success. > With this change, spell checking will still break Hebrew words at geresh, > gershaim, and right double quotation marks. Doesn't seem like that's what's happening.
(In reply to Eyal Rozenberg from comment #29) > Word Character Broken? Fails spelling? > ------------------------------------------------------------------------ > ג'ירפה APOSTROPHE Yes N/A > ג’ירפה RIGHT SINGLE QUOTATION MARK No [1] No > ג’ירפה HEBREW PUNCTUATION GERESH No Yes > דו"ח DOUBLE QUOTATION MARK Yes N/A > דו”ח RIGHT DOUBLE QUOTATION MARK No [1] No > דו״ח HEBREW PUNCTUATION GERSHAIM No Yes I re-checked this by copying and pasting the table into a local build of Writer. Writer doesn't break any of the Hebrew words. This can be verified by double-click selecting the words, or advancing the cursor by words (ctrl+arrow keys). The corresponding unit tests are also still present and passing. Version: 25.2.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: d6b6419b7b937aea4639b7f4f81b7f24cdccc6e0 CPU threads: 32; OS: Linux 6.8; UI render: default; VCL: gtk3 Locale: en-US (en_US.UTF-8); UI: en-US Calc: threaded
(In reply to Jonathan Clark from comment #30) > I re-checked this by copying and pasting the table into a local build of > Writer. Ah, so, I might have messed up GERESH in the table. Here it is again: ג'ירפה APOSTROPHE Yes N/A ג’ירפה RIGHT SINGLE QUOTATION MARK No [1] No ג׳ירפה HEBREW PUNCTUATION GERESH No Yes דו"ח DOUBLE QUOTATION MARK Yes N/A דו”ח RIGHT DOUBLE QUOTATION MARK No [1] No דו״ח HEBREW PUNCTUATION GERSHAIM No Yes > Writer doesn't break any of the Hebrew words. This can be verified > by double-click selecting the words, or advancing the cursor by words > (ctrl+arrow keys). The corresponding unit tests are also still present and > passing. While the double-click behavior and ctrl+arrow behavior agrees with not-breaking the word - the spelling behavior does not. Will attaching a screenshot.
(In reply to Jonathan Clark from comment #30) Ah, so, first I think I have messed up GERESH and maybe the GERSHAIM in the table. Here it is again: ג'ירפה APOSTROPHE Yes N/A ג’ירפה RIGHT SINGLE QUOTATION MARK No [1] No ג׳ירפה HEBREW PUNCTUATION GERESH No Yes דו"ח DOUBLE QUOTATION MARK Yes N/A דו”ח RIGHT DOUBLE QUOTATION MARK No [1] No דו״ח HEBREW PUNCTUATION GERSHAIM No Yes > Writer doesn't break any of the Hebrew words. This can be verified > by double-click selecting the words, or advancing the cursor by words > (ctrl+arrow keys). The corresponding unit tests are also still present and > passing. Ok, yes, Writer doesn't break any of the words with those actions. But it does break the words with apostrophe and double quotation mark when sending them over for spell-checking, so that we get the ג and the ירפה checked separately and the דו and ח.
Created attachment 197207 [details] Spell checking of the table from comment 32 Screenshot showing: * ג'ירפה and דו"ח broken, * the other four combinations not broek * Spell check passes with RIGHT SINGLE QUOTATION MARK, failing with GERESH * Spell check passes with RIGHT DOUBLE QUOTATION MARK, failing with GERSHAIM Checked with a nightly from 2024-10-22: Version: 25.2.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: c8371b5f1a84191d38185820915f0d93741df1fe CPU threads: 4; OS: Linux 6.6; UI render: default; VCL: gtk3 Locale: en-US (en_IL); UI: en-US Calc: threaded
(In reply to Eyal Rozenberg from comment #32) > (In reply to Jonathan Clark from comment #30) > > Ah, so, first I think I have messed up GERESH and maybe the GERSHAIM in the > table. Here it is again: > > ג'ירפה APOSTROPHE Yes N/A > ג’ירפה RIGHT SINGLE QUOTATION MARK No [1] No > ג׳ירפה HEBREW PUNCTUATION GERESH No Yes > דו"ח DOUBLE QUOTATION MARK Yes N/A > דו”ח RIGHT DOUBLE QUOTATION MARK No [1] No > דו״ח HEBREW PUNCTUATION GERSHAIM No Yes > > > Writer doesn't break any of the Hebrew words. This can be verified > > by double-click selecting the words, or advancing the cursor by words > > (ctrl+arrow keys). The corresponding unit tests are also still present and > > passing. > > Ok, yes, Writer doesn't break any of the words with those actions. But it > does break the words with apostrophe and double quotation mark when sending > them over for spell-checking, so that we get the ג and the ירפה checked > separately and the דו and ח. This is bug 140382. We are sending the words to Hunspell intact. Hunspell performs its own tokenization based on dictionary data. Hebrew dictionary data does not include these characters as word characters, so Hunspell splits them prior to dictionary lookup (see bug 140382 comment 3).
(In reply to Jonathan Clark from comment #34) > This is bug 140382. We are sending the words to Hunspell intact. Hunspell > performs its own tokenization based on dictionary data. But why do we respect its further tokenization? Shouldn't we just mark the entire word as mis-spelled if hunspell rejected it?
(In reply to Eyal Rozenberg from comment #35) > (In reply to Jonathan Clark from comment #34) > > This is bug 140382. We are sending the words to Hunspell intact. Hunspell > > performs its own tokenization based on dictionary data. > > But why do we respect its further tokenization? Shouldn't we just mark the > entire word as mis-spelled if hunspell rejected it? I don't have a prepared example, but for certain highly synthetic languages we want to handle whole words for editing purposes, but the spell checker needs to work at a morpheme level and can report spelling mistakes for parts of words. For polysynthetic languages you could have 50 character words composed of 10 morphemes, and if only one morpheme is spelled incorrectly, it would be annoying to see the entire word redlined.
(In reply to Jonathan Clark from comment #36) > I don't have a prepared example, but for certain highly synthetic languages > we want to handle whole words for editing purposes, but the spell checker > needs to work at a morpheme level and can report spelling mistakes for parts > of words. Ok, I'll buy that.