46950 – Hebrew: Spell-checking breaks Hebrew words at intra-word single and double quotes

Bug 46950 - Hebrew: Spell-checking breaks Hebrew words at intra-word single and double quotes

Summary: Hebrew: Spell-checking breaks Hebrew words at intra-word single and double qu...

Status:	VERIFIED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Linguistic (show other bugs)
Version: (earliest affected)	Inherited From OOo
Hardware:	All All

Importance:	medium normal
Assignee:	Jonathan Clark

URL:
Whiteboard:	target:25.2.0 target:24.8.0.2
Keywords:

Depends on:
Blocks:	Spell-Checking Hebrew
	Show dependency tree / graph

Reported:	2012-03-04 01:59 UTC by Nadav Har'El
Modified:	2024-10-24 08:23 UTC (History)
CC List:	3 users (show)

See Also:	https://bz.apache.org/ooo/show_bug.cgi?id=51772 https://bz.apache.org/ooo/show_bug.cgi?id=99796 140382
Crash report or crash signature:

Attachments
Document exhibiting the different manifestations of the bug (13.73 KB, application/vnd.oasis.opendocument.text) 2021-02-12 22:08 UTC, Eyal Rozenberg	Details
Test document rendered in LO Writer 7.1 (218.97 KB, image/png) 2021-02-12 22:09 UTC, Eyal Rozenberg	Details
Spell checking of the table from comment 32 (58.08 KB, image/png) 2024-10-23 16:02 UTC, Eyal Rozenberg	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nadav Har'El 2012-03-04 01:59:46 UTC

It appears that before LibreOffice passes text to the spell-checker, it breaks them into separate words. The problem is that (apparently) it does this using some general language-agnostic rules, while different languages might have different rules as to what characters may be part of a word, and what breaks words.

My problem is specifically with the Hebrew spell-checking: In Hebrew, the quote characters - ' and ", are used not just for quoting, but have an additional unrelated use as in-word characters:
1. The single-quote is used to mark foreign sounds. E.g., the word ג'ירפה has a single-quote character after the gimmel, which means it should be pronounced "j", not "g".
2. The double-quote is used inside acronyms, to mark them as such. For example מנכ"ל is the acronym for CEO. מנכ"לים is its plural. Both have quotes in the middle of the word - and these words, together with this quote, are in the dictionary.

Because of this, the Hebrew hunspell dictionary includes the following lines in he_IL.aff:

   BREAK 3
   BREAK ^"
   BREAK "$
   BREAK ^'

This means that " only breaks words when it's in the beginning and end (and ' only in the beginning) - these characters in the middle of a word never mean a word break in Hebrew. With this setting, hunspell correctly word-breaks and spell-checks Hebrew text.

Unfortunately, LibreOffice doesn't respect these instructions. It appears that it incorrectly breaks up the words before sending them to hunspell. The end result is that all Hebrew words which are acronyms or have foreign sounds in them are incorrectly marked as being errors, which is very annoying.

Comment 1 Urmas 2012-03-05 20:42:44 UTC

On a second thought, why do use ' and " instead of geresh/gershaim?

But the problem is that spellchecker breaks words on them too, even if that is explicitly prohibited.

Comment 2 Nadav Har'El 2012-03-05 22:33:06 UTC

Well, it's just that despite the existence of the separate "geresh" and "gershayim" characters in Unicode, I've never seen anyone actually using them. Everyone I've seen uses the normal ASCII single-quote and double-quote characters respectively, and expect those to look fine in Hebrew fonts - and they do.

They reason people don't use the special unicode characters is probably that there is usually no convenient method to enter them with the keyboard.

You're right that it should also be checked what happens with these special characters - the spell-checker shouldn't break such words, and it should accept them even though the dictionary contains words with the ASCII quotes/double-quote, not with geresh/gershayim. This should perhaps become a separate bug, if it doesn't work properly.

Comment 3 Lior Kaplan 2012-03-16 05:45:06 UTC

Reported in the past with OO.org at 
https://issues.apache.org/ooo/show_bug.cgi?id=51772
https://issues.apache.org/ooo/show_bug.cgi?id=99796

The first also have patches which might still be relevant.

Comment 4 QA Administrators 2014-10-24 03:18:27 UTC Comment hidden (obsolete)

Please read this message in its entirety before responding.

Your bug was confirmed at least 1 year ago and has not had any activity on it for over a year. Your bug is still set to NEW which means that it is open and confirmed. It would be nice to have the bug confirmed on a newer version than the version reported in the original report to know that the bug is still present -- sometimes a bug is inadvertently fixed over time and just never closed.

If you have time please do the following:
1) Test to see if the bug is still present on a currently supported version of LibreOffice (preferably 4.2 or newer).
2) If it is present please leave a comment telling us what version of LibreOffice and your operating system.
3) If it is NOT present please set the bug to RESOLVED-WORKSFORME and leave a short comment telling us your version and Operating System

Please DO NOT
1) Update the version field
2) Reply via email (please reply directly on the bug tracker)
3) Set the bug to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case)

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
LibreOffice is powered by a team of volunteers, every bug is confirmed (triaged) by human beings who mostly give their time for free. We invite you to join our triaging by checking out this link:
https://wiki.documentfoundation.org/QA/BugTriage

There are also other ways to get involved including with marketing, UX, documentation, and of course developing -  http://www.libreoffice.org/get-help/mailing-lists/. 

Lastly, good bug reports help tremendously in making the process go smoother, please always provide reproducible steps (even if it seems easy) and attach any and all relevant material

Comment 5 Amir Adar 2015-01-13 13:53:01 UTC

Though the bug is quite old, it is still present in versions 4.3.5.2 and 4.5.0.0 (master build). To reproduce:

1. Open LibreOffice Writer.
2. Type in a Hebrew acronym, like פלמ"ח.

Even if the language is set to Hebrew, the acronym is underlined with red, indicating a spelling error. It also separates between the letters before and after the "Gershayim", marking them as two words instead of one.

I tried to reproduce on other programs with spell-checking capabilities, such as Gedit, and it seems the problem is there as well. Perhaps the underlying engine is at fault, and not LibreOffice itself.

I am using Linux Mint 17.1, 32-bit.

Comment 6 Nadav Har'El 2015-01-13 14:34:09 UTC

Indeed, this bug still exists, and is still very much annoying to Hebrew users!

As I explained in detail in the original bug report, I believe this is *not* problem of the underlying engine (aspell, based on data from the hspell project) but rather of libreoffice's own word split algorithm, which apparently doesn't respect Aspell's declaration of in-word characters, nor does it support the correct word-split rules for Hebrew (where certain seemingly-punctuation characters may be parts of words).

I'm not familiar with the code involved, but https://www.libreoffice.org/bugzilla/show_bug.cgi?id=62360 points to the place in the libreoffice code which might need to be fixed.

Comment 7 QA Administrators 2016-01-17 20:05:15 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present on a currently supported version of LibreOffice (5.0.4 or later) https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the version of LibreOffice and your operating system, and any changes you see in the bug behavior

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a short comment that includes your version of LibreOffice and Operating System

Please DO NOT:

- Update the version field
- Reply via email (please reply directly on the bug tracker)
- Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to "inherited from OOo"; 4b. If the bug was not present in 3.3 - add "regression" to keyword

Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for your help!

-- The LibreOffice QA Team This NEW Message was generated on: 2016-01-17

Comment 8 Lior Kaplan 2016-01-18 20:15:23 UTC

Still happens with LibO 5.0.x. 

To test: use the word ג'ירפה and see that the quote makes the spell checker think it's two words.

Comment 9 Nadav Har'El 2016-02-21 08:23:17 UTC

Indeed, this bug still exists, and still very much annoying. Non-hebrew-speakers might not appreciate the meaning of this bug, but a certain percentage of Hebrew words (unfortunately I can't quote a good estimate) simply contain the single-quote or double-quote characters in them. I gave above examples - certain words with foreign-language sounds and all acronyms.

LibreOffice will mark all these words as wrong, which not only prevents spell-checking such words, it also lowers the users overall confidence in the spellchecker because he or she will so often see correctly-written words red-marked.

Comment 10 QA Administrators 2017-09-01 11:20:33 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present on a currently supported version of LibreOffice
(5.4.1 or 5.3.6 https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the version of LibreOffice and
your operating system, and any changes you see in the bug behavior

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave
a short comment that includes your version of LibreOffice and Operating System

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to "inherited from OOo";
4b. If the bug was not present in 3.3 - add "regression" to keyword

Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug-20170901

Comment 11 Lior Kaplan 2017-09-01 11:41:40 UTC

Still reproducible.

Version: 5.4.0.3
Build ID: 1:5.4.0-1

Comment 12 Omer Zak 2017-11-15 22:07:00 UTC

Still happens in:

Version: 6.0.0.0.alpha1+
Build ID: 9050854c35c389466923f0224a36572d36cd471a
CPU threads: 8; OS: Linux 4.9; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.utf8); Calc: group

OS: Debian 64bit Stretch (Debian 9.2, with some backported packages)


But with some changes.
1. The word פלמ"ח is still not handled correctly.
2. The word ג'ירפה is now handled correctly. Seems that Writer now converts the single quote into geresh.

Comment 13 QA Administrators 2018-11-16 03:42:08 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 14 Nadav Har'El 2018-11-16 20:37:42 UTC

‎The bug still exists in LibreOffice 6.1.2.1.

As Omer Zak noted above, the bug was *fixed* for the single quote, e.g., ג'ירפה or סח'נין are now correctly recognized as correctly spelled. This is a welcome improvement. However, the bug still exists for double-quotes, e.g., מנכ"לים or פלמ"ח are still split to two words which are spell-checked individually.

Comment 15 Eyal Rozenberg 2018-12-27 15:28:57 UTC

(In reply to Nadav Har'El from comment #14)
> As Omer Zak noted above, the bug was *fixed* for the single quote, e.g.,
> ג'ירפה or סח'נין are now correctly recognized as correctly spelled. 

If you can bisect this fix with daily builds, you can probably figure out who exactly fixed it and where. If you do that, perhaps we'd be able to either:

* Formulate a patch to handle the double-quote case as well; or
* Contact the developer who introduced that patch to ask for their help more specifically.

Comment 16 QA Administrators 2020-12-27 03:37:42 UTC Comment hidden (obsolete)

Dear Nadav Har'El,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 17 Eyal Rozenberg 2021-02-12 22:06:57 UTC

So, Nadav has not replied to my last comment, so let me summarize the state of affairs, in LO 7.1:

* There are (at least) three ways to signify a Geresh within a word: APOSTROPHE (U+27), RIGHT SINGLE QUOTATION MARK (U+2019), and HEBREW PUNCTUATION GERESH (U+5F3).
* Similarly are (at least) three ways to signify a Gershaim within a word: DOUBLE QUOTATION MARK (0x22), RIGHT DOUBLE QUOTATION MARK (0x201D), and HEBREW PUNCTUATION GERSHAIM (0x5F4).
* LibreOffice writer _is_ breaking up words using APOSTROPHE or DOUBLE QUOTATION MARK. In an ideal world, these would not be used for Geresh or Gershaim, but since these are commonly used in practice - it is a bug.
* LibreOffice writer _is_ breaking up words using RIGHT DOUBLE QUOTATION MARK - this is a bug. Due to this bug, the two parts of the words are spell-checked separately.
* LibreOffice writer is _not_ breaking up words using RIGHT SINGLE QUOTATION MARK, and spell-checking succeeds on them (at least in my anecdotal checking; consider ג’ירפה for example).
* LibreOffice writer is _not_ breaking up words using HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM - but spell-checking still _fails_ on them: ג׳ירפה , דו״ח  This is a different phenomenon than what Nadav Har'el first identified. It may be worth splitting off into a separate bug.

Version info:
Version: 7.1.0.3 / LibreOffice Community
Build ID: f6099ecf3d29644b5008cc8f48f42f4a40986e4c
CPU threads: 4; OS: Linux 5.9; UI render: default; VCL: gtk3
Locale: he-IL (en_IL); UI: en-US

Comment 18 Eyal Rozenberg 2021-02-12 22:08:24 UTC

Created attachment 169707 [details]
Document exhibiting the different manifestations of the bug

This covers all 3 ways to signify both symbols.

Comment 19 Eyal Rozenberg 2021-02-12 22:09:47 UTC

Created attachment 169708 [details]
Test document rendered in LO Writer 7.1

You will note the red squiggly line where the automatic spelling check fails.

Note in particular the cases of only a single character or two characters getting the squiggly line rather than the full word.

Comment 20 Eyal Rozenberg 2021-02-12 22:24:01 UTC

I should mention that when you open the ODT, you need to enable editing and spelling auto-check, and also type something in for the spell-check to kick in. Otherwise nothing will show up as misspelled. (That's not a bug.)

Comment 21 Eyal Rozenberg 2021-02-12 22:31:37 UTC

I should mention that when you open the ODT, you need to enable editing and spelling auto-check, and also type something in for the spell-check to kick in. Otherwise nothing will show up as misspelled

Comment 22 Eyal Rozenberg 2021-02-12 22:35:46 UTC

Have opened bug 140382 about the failure of the spell-checking to accept words with proper HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM.

Comment 23 eladhen2 2021-02-15 15:41:30 UTC

I see this behavior on 6.3.6.2

Comment 24 QA Administrators 2023-02-16 03:25:58 UTC Comment hidden (obsolete)

Dear Nadav Har'El,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://web.libera.chat/?settings=#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 25 Eyal Rozenberg 2023-02-16 14:47:06 UTC

The situation described in comment 17 - persists with:

Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: ad387d5b984c6666906505d25685065f710ed55d
CPU threads: 4; OS: Linux 6.1; UI render: default; VCL: gtk3
Locale: he-IL (en_IL); UI: en-US

to sommarize this more succinctly:

intra-word character         break-up?    Example word
-----------------------------------------------------------
APOSTROPHE                   Yes          ג'ירפה
RIGHT SINGLE QUOTATION MARK  No           ג’ירפה
HEBREW PUNCTUATION GERESH    No           ג׳ירפה
DOUBLE QUOTATION MARK        Yes          דו"ח
RIGHT DOUBLE QUOTATION MARK  Yes          דו”ח
HEBREW PUNCTUATION GERSHAIM  No           דו״ח

All the "Yes" entries are buggy behavior - there should be no break-up of the word into two parts.

Spelling failure despite non-breakup:

HEBREW PUNCTUATION GERESH
HEBREW PUNCTUATION GERSHAIM

Comment 26 Commit Notification 2024-07-16 00:18:02 UTC

Jonathan Clark committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/174aa6e980f973cea9b1c402d03bd6dba951f5ae

tdf#46950 Allow intra-word right double quotation mark

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 27 Jonathan Clark 2024-07-16 00:31:00 UTC

The above patch adds right double quotation marks as an alternative for gershaim. The rest of the cases were already handled correctly, but I added more regression tests to ensure these changes aren't accidentally reverted in the future.

With this change, spell checking will still break Hebrew words at geresh, gershaim, and right double quotation marks. Support for these characters needs to be added to the Hebrew dictionary data. This is tracked by bug 140382, mentioned above.

Comment 28 Commit Notification 2024-07-17 03:06:30 UTC

Jonathan Clark committed a patch related to this issue.
It has been pushed to "libreoffice-24-8":

https://git.libreoffice.org/core/commit/9c9a7fa814c276dcd6ba1c18023d17c3e5a0745b

tdf#46950 Allow intra-word right double quotation mark

It will be available in 24.8.0.2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 29 Eyal Rozenberg 2024-10-22 22:07:25 UTC

(In reply to Jonathan Clark from comment #27)
> The above patch adds right double quotation marks as an alternative for
> gershaim. The rest of the cases were already handled correctly

Were they though?

The correct handling is a matter of level-of-strictness. If I were strict, I could say that only HEBREW PUNCTUATION GERESH and HEBREW PUNCTUATION GERSHAIM can keep a word together, and with other characters it's two words and everything is a pig's breakfast anyways - indicating that we don't like their use by also redlining the spelling. But if I were lax - all six character would keep a word together.

The behavior with a recent nightly:

Version: 25.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: c8371b5f1a84191d38185820915f0d93741df1fe
CPU threads: 4; OS: Linux 6.6; UI render: default; VCL: gtk3
Locale: en-US (en_IL); UI: en-US
Calc: threaded

is:

Word    Character                    Broken?       Fails spelling?
------------------------------------------------------------------------
ג'ירפה  APOSTROPHE                   Yes           N/A
ג’ירפה  RIGHT SINGLE QUOTATION MARK  No [1]        No
ג’ירפה  HEBREW PUNCTUATION GERESH    No            Yes
דו"ח    DOUBLE QUOTATION MARK        Yes           N/A
דו”ח    RIGHT DOUBLE QUOTATION MARK  No [1]        No
דו״ח    HEBREW PUNCTUATION GERSHAIM  No            Yes

[1] - deducted from the spelling success.


> With this change, spell checking will still break Hebrew words at geresh,
> gershaim, and right double quotation marks.

Doesn't seem like that's what's happening.

Comment 30 Jonathan Clark 2024-10-23 14:24:33 UTC

(In reply to Eyal Rozenberg from comment #29)
> Word    Character                    Broken?       Fails spelling?
> ------------------------------------------------------------------------
> ג'ירפה  APOSTROPHE                   Yes           N/A
> ג’ירפה  RIGHT SINGLE QUOTATION MARK  No [1]        No
> ג’ירפה  HEBREW PUNCTUATION GERESH    No            Yes
> דו"ח    DOUBLE QUOTATION MARK        Yes           N/A
> דו”ח    RIGHT DOUBLE QUOTATION MARK  No [1]        No
> דו״ח    HEBREW PUNCTUATION GERSHAIM  No            Yes

I re-checked this by copying and pasting the table into a local build of Writer. Writer doesn't break any of the Hebrew words. This can be verified by double-click selecting the words, or advancing the cursor by words (ctrl+arrow keys). The corresponding unit tests are also still present and passing.

Version: 25.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: d6b6419b7b937aea4639b7f4f81b7f24cdccc6e0
CPU threads: 32; OS: Linux 6.8; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded

Comment 31 Eyal Rozenberg 2024-10-23 15:54:02 UTC Comment hidden (obsolete)

(In reply to Jonathan Clark from comment #30)
> I re-checked this by copying and pasting the table into a local build of
> Writer.

Ah, so, I might have messed up GERESH in the table. Here it is again:

ג'ירפה  APOSTROPHE                   Yes           N/A
ג’ירפה  RIGHT SINGLE QUOTATION MARK  No [1]        No
ג׳ירפה  HEBREW PUNCTUATION GERESH    No            Yes
דו"ח    DOUBLE QUOTATION MARK        Yes           N/A
דו”ח    RIGHT DOUBLE QUOTATION MARK  No [1]        No
דו״ח    HEBREW PUNCTUATION GERSHAIM  No            Yes


> Writer doesn't break any of the Hebrew words. This can be verified
> by double-click selecting the words, or advancing the cursor by words
> (ctrl+arrow keys). The corresponding unit tests are also still present and
> passing.

While the double-click behavior and ctrl+arrow behavior agrees with not-breaking the word - the spelling behavior does not. Will attaching a screenshot.

Comment 32 Eyal Rozenberg 2024-10-23 16:00:08 UTC

(In reply to Jonathan Clark from comment #30)

Ah, so, first I think I have messed up GERESH and maybe the GERSHAIM in the table. Here it is again:

ג'ירפה  APOSTROPHE                   Yes           N/A
ג’ירפה  RIGHT SINGLE QUOTATION MARK  No [1]        No
ג׳ירפה  HEBREW PUNCTUATION GERESH    No            Yes
דו"ח    DOUBLE QUOTATION MARK        Yes           N/A
דו”ח    RIGHT DOUBLE QUOTATION MARK  No [1]        No
דו״ח    HEBREW PUNCTUATION GERSHAIM  No            Yes

> Writer doesn't break any of the Hebrew words. This can be verified
> by double-click selecting the words, or advancing the cursor by words
> (ctrl+arrow keys). The corresponding unit tests are also still present and
> passing.

Ok, yes, Writer doesn't break any of the words with those actions. But it does break the words with apostrophe and double quotation mark when sending them over for spell-checking, so that we get the ג and the ירפה checked separately and the דו and ח.

Comment 33 Eyal Rozenberg 2024-10-23 16:02:06 UTC

Created attachment 197207 [details]
Spell checking of the table from comment 32

Screenshot showing: 

* ג'ירפה and דו"ח broken, 
* the other four combinations not broek
* Spell check passes with RIGHT SINGLE QUOTATION MARK, failing with GERESH
* Spell check passes with RIGHT DOUBLE QUOTATION MARK, failing with GERSHAIM

Checked with a nightly from 2024-10-22:

Version: 25.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: c8371b5f1a84191d38185820915f0d93741df1fe
CPU threads: 4; OS: Linux 6.6; UI render: default; VCL: gtk3
Locale: en-US (en_IL); UI: en-US
Calc: threaded

Comment 34 Jonathan Clark 2024-10-23 17:37:14 UTC

(In reply to Eyal Rozenberg from comment #32)
> (In reply to Jonathan Clark from comment #30)
> 
> Ah, so, first I think I have messed up GERESH and maybe the GERSHAIM in the
> table. Here it is again:
> 
> ג'ירפה  APOSTROPHE                   Yes           N/A
> ג’ירפה  RIGHT SINGLE QUOTATION MARK  No [1]        No
> ג׳ירפה  HEBREW PUNCTUATION GERESH    No            Yes
> דו"ח    DOUBLE QUOTATION MARK        Yes           N/A
> דו”ח    RIGHT DOUBLE QUOTATION MARK  No [1]        No
> דו״ח    HEBREW PUNCTUATION GERSHAIM  No            Yes
> 
> > Writer doesn't break any of the Hebrew words. This can be verified
> > by double-click selecting the words, or advancing the cursor by words
> > (ctrl+arrow keys). The corresponding unit tests are also still present and
> > passing.
> 
> Ok, yes, Writer doesn't break any of the words with those actions. But it
> does break the words with apostrophe and double quotation mark when sending
> them over for spell-checking, so that we get the ג and the ירפה checked
> separately and the דו and ח.

This is bug 140382. We are sending the words to Hunspell intact. Hunspell performs its own tokenization based on dictionary data. Hebrew dictionary data does not include these characters as word characters, so Hunspell splits them prior to dictionary lookup (see bug 140382 comment 3).

Comment 35 Eyal Rozenberg 2024-10-23 20:11:46 UTC

(In reply to Jonathan Clark from comment #34)
> This is bug 140382. We are sending the words to Hunspell intact. Hunspell
> performs its own tokenization based on dictionary data.

But why do we respect its further tokenization? Shouldn't we just mark the entire word as mis-spelled if hunspell rejected it?

Comment 36 Jonathan Clark 2024-10-23 20:45:17 UTC

(In reply to Eyal Rozenberg from comment #35)
> (In reply to Jonathan Clark from comment #34)
> > This is bug 140382. We are sending the words to Hunspell intact. Hunspell
> > performs its own tokenization based on dictionary data.
> 
> But why do we respect its further tokenization? Shouldn't we just mark the
> entire word as mis-spelled if hunspell rejected it?

I don't have a prepared example, but for certain highly synthetic languages we want to handle whole words for editing purposes, but the spell checker needs to work at a morpheme level and can report spelling mistakes for parts of words. For polysynthetic languages you could have 50 character words composed of 10 morphemes, and if only one morpheme is spelled incorrectly, it would be annoying to see the entire word redlined.

Comment 37 Eyal Rozenberg 2024-10-24 08:23:58 UTC

(In reply to Jonathan Clark from comment #36)
> I don't have a prepared example, but for certain highly synthetic languages
> we want to handle whole words for editing purposes, but the spell checker
> needs to work at a morpheme level and can report spelling mistakes for parts
> of words. 

Ok, I'll buy that.