Bug 131487 - Words whose characters span multiple languages are spellchecked as a whole using the first language
Summary: Words whose characters span multiple languages are spellchecked as a whole us...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: needsDevAdvice
: 152108 (view as bug list)
Depends on:
Blocks: Spell-Checking
  Show dependency treegraph
 
Reported: 2020-03-22 22:51 UTC by Callegar
Modified: 2022-11-22 08:23 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Callegar 2020-03-22 22:51:03 UTC
Description:
When writing a document using multiple languages, it may happen to find pieces of text that look like a single word, but where a part of the chars belong to a language and the other part to another language.

For instance, in Italian, it may happen to write things like «Questo è un rapporto sull'International Conference on...» (This is a report on the International Conference on...).

Here "sull'" is in Italian and "International" is in English. However, because of the apostrophe, LibO sees "sull'International" as a single word and tries to spellcheck it. Apparently, the spellchecking is practiced based on the language of the last character (i.e., English in this case). Obviously, it fails, because "sull'International" is neither an English word nor an Italian one.

IMHO, when LibO sees a word with mixed language, it should consider it as a word whose language is None.

Steps to Reproduce:
See description

Actual Results:
See description

Expected Results:
See description


Reproducible: Always


User Profile Reset: No



Additional Info:
[Information automatically included from LibreOffice]
Locale: en-US
Module: TextDocument
[Information guessed from browser]
OS: Linux (All)
OS is 64bit: yes
Comment 1 Heiko Tietze 2020-06-22 05:49:24 UTC
Are there other examples of separators? In German it's wrong to combine a native word and a foreign word with a hyphen. I mean: can we make the request something like "Use apostrophe as word separator for spellchecking". Input from l10n would be great.
Comment 2 Ming Hua 2020-06-25 06:14:32 UTC
(In reply to Heiko Tietze from comment #1)
> Are there other examples of separators?
Good question, what about "nothing"? :-)

In Chinese we don't use spaces to separate words (we don't even have a clear definition of "words", compared to characters and phrases), so when English words are used in Chinese text, they are usually just written in between Chinese characters, with no separators whatsoever.  Of course, if more than one English words are used, the spaces between English words are kept.

This is not a user case that should be given much consideration, though, as Chinese users typically turn off spellchecking anyway.

> Input from l10n would be great.
I'll send a message to l10n mailing list later if no one beats me to it.
Comment 3 Heiko Tietze 2020-06-25 08:31:40 UTC
The Chinese example might be easier to solve since English characters are written with different font. Anyway, not really an UX topic. Let's see what devs think.
Comment 4 sophie 2020-06-25 09:09:15 UTC
Does it follow the Italian typographic conventions? For example in French it's not allowed to combine the words and foreign words have to be written in italic.
Comment 5 Callegar 2020-06-25 10:01:58 UTC
@Heiko Tietze Rather than introducing new /letters/ to be used as word separators, I really think that it would be better to introduce a filter preventing words to undergo spellchecking, based on character/font properties inconsistencies. This would make both "quell'albero" (all Italian) and "quell'internship" (Italian + English) be both treated as a single word, but the latter not to undergo spellcheck, because "quell'"is marked as Italian and "internship" as English".

In fact, making the apostrophe a separator would mean getting "quell'albero" as the word "quell" followed by the word "albero", but "quell" is certainly not an Italian word (and neither is "quell'" alone).

I think that this would also help other languages, where one may have local and foreign words written with characters running together with no separators at all.

@Sophie Yes, when using foreign words in Italian it is frequent to have combinations that require the apostrophe between them. I am no linguistics expert, but I think that this has at least a couple of reasons: one is that the  apostrophe indicates an elision which is a phonetic phenomenon, so it is a case where orthography follows phonetics; secondly Italian belongs to that set of languages where there is no official body in charge of establishing rules, so the grammar is determined by how the language is itself actually used. Practically, Italian has accepted this kind of language intermixing since the times when it was frequent to find Latin terms in technical or polished talk and is currently very flexible when it comes to the intermixing of English terms.
Comment 6 Mihkel Tõnnov 2020-06-25 14:36:40 UTC Comment hidden (obsolete)
Comment 7 Mihkel Tõnnov 2020-06-25 14:42:45 UTC
(In reply to Mihkel Tõnnov from comment #6)

Ugh, I messed up examples in my first paragraph while moving things around there. It should read like this:

In Estonian, foreign words should be written in italics and if there's a case ending, then it has to be separated by an apostrophe, e.g. "<i>status quo</i>'ni". Case endings by themselves are not valid words, so apostrophe as word separator definitely wouldn't help; and apostrophe is also used for other purposes, like to indicate emission of character(s) from a word (e.g. to imitate everyday speech).

Somewhat similarly, hyphens can be used to separate foreign part and native part in compound words, e.g. "<i>flamenco</i>-tantsija", "tele-<i>show</i>". (Note that hyphen separates complete words, while apostrophe is used with case endings.)

I'm not sure it would be beneficial to completely ignore such words in spellcheck, as any misspellings in them would then pass unflagged (and at least some foreign words are rather prone to misspellings - take the Italian coffee terms that the rest of the world often struggles to write correctly :)

Could the request here maybe be re-purposed and implemented as follows?

1) If a word (as currently detected by LibO) contains characters in multiple languages, check if there is some punctuation mark separating the different language parts.

2a) If there's an apostrophe (' or ’ - but probably not ‘): check which language the apostrophe belongs to, and as far as spellcheck is concerned, separate the "word" at the border of languages, keeping the apostrophe together with the preceding or following characters, as appropriate.

2b) If there's a hyphen: ignore the hyphen and spellcheck the word parts as if separated by space.

2c) If there's no separating character, underline the whole "word" as misspelled.

@Sergio: you said that "quell'" alone is not an Italian word - how is it currently handled by spellcheck, if used before an Italian word? Would 2a as described above work for Italian?

Also, I'm not sure if 2c would be OK for all languages, though - does anyone have counterexamples for this?
Comment 8 Callegar 2020-06-26 08:48:32 UTC
@Mihkel Tõnnov

> you said that "quell'" alone is not an Italian word - how is it currently handled by spellcheck, if used before an Italian word? Would 2a as described above work for Italian?

In Italian there are a lot of cases where there is an "elision" between two similar sounds. For instance we have the article "lo" that loses the final o when preceeding nouns that start with a vowel. For instance, rather than writing "lo ombrello" you write "l’ombrello". Incidentally, this is the same thing that happens with "quello" and "albero" that become "quell’albero" in my previous example. In this latter case "quello" is not an article, but the rule is the same.

To the best of my understanding these cases are treated by considering the two words that come to be pronounced as a single one because of the elision as a single word for spell checking.

Hence in the spell checking dictionary you have "lo" "ombrello" but also "l’ombrello", "quello", "albero", but also "quell'albero". I do not know the details, but I think that this is handled efficiently in the spell checker by combining a base dictionary with an affix file setting some rules to extend the base dictionary. In any case, this saves you from having to introduce the elided forms like "quell" in the dictionary, since these are not correct words by themselves.

This is why I think that it would be incorrect to consider the "’" as a word separator, at least in Italian and why I think that 2a would not be OK.

To me, the simplest thing to do would be keeping the word separator exactly as it is. Then before passing a word to the spell checker, if you have a word where different characters belong to different languages, pretend that the language for the whole word is "none", rather than pretending it is the language of the first character.
Comment 9 Mihail Balabanov 2021-04-10 19:31:22 UTC
Marking the multi-language words as ‘do not check’ would eliminate any false positives but also conceal typos when they do exist.

In Bulgarian, we use a hyphen when adding a plural and/or definiteness suffix to a foreign word or abbreviation in a different alphabet – like ‘DVD-то’ (the DVD), ‘Oscar-ите’ (the Oscars). Like in the languages mentioned above, the suffixes are not correct words by themselves. Considering that Estonian uses the apostrophe for such suffixes, Italian has apostrophe-separated prefixes, and other languages may have other separators, it makes sense to have a generalized mechanism to configure the allowed affixes for foreign words – together with any separators – per language. For example:

aff file:
FOREIGNWORDSFX S
FOREIGNWORDSFXSEP '-' # Would be ’ for Estonian

dic file:
ите/S # valid only immediately after a word in a different language and separated by ‘-’
Comment 10 Callegar 2021-04-10 21:32:12 UTC
@Mihail Balabanov, can you please expand a little?

It is unclear to me how the proposed change about not spellchecking mixed language words could conceal typos that are now caught, or do you mean that there is a need to add an additional mechanism so that also multi-language words can get spellchecked to catch typos that cannot be caught now?

I have a feeling that the mechanism that you propose requires a big overhaul of all the spellchecking mechanism and entail a lot of corner cases. For instance, consider the Italian "sull'International Conference". You would need a mechanism to say: "quell" can be considered a valid Italian word only if it is followed by a "’" followed by a non Italian word whose initial sound is vowel-like. Seems a bit impractical to me.

Possibly, what you need is to differentiate just by language: when you have a multi-language token if one piece is in Italian, I think that it really should not undergo spellchecking that would lead to wrong results in 99% of the cases anyway. Furthermore, an author would already put some greater care on it due to its particular form. Maybe if you have a multi-language token where one piece is in Bulgarian there should be another approach or the current one is already OK.

As a side note, given the ongoing discussion I wonder if the issue could be marked as confirmed.
Comment 11 Jean-Baptiste Faure 2021-08-12 18:07:59 UTC
In that case I would add the pseudo-word to my personal dictionary to silent the spell-checker.

Best regards. JBF
Comment 12 Callegar 2021-08-17 09:02:20 UTC
That may work, but I believe that it is rather sub-optimal to have to add (non existing) words to a dictionary for cases where there is no chance that spellchecking can be properly done.
Comment 13 Jean-Baptiste Faure 2021-08-23 20:30:16 UTC
(In reply to sergio.callegari from comment #12)
> That may work, but I believe that it is rather sub-optimal to have to add
> (non existing) words to a dictionary for cases where there is no chance that
> spellchecking can be properly done.

Perhaps but that solution works and is easy to use. Without changing the code. For your solution, you have to select a part of the "word" and define its language, then select the other part and define a different language.

My solution has another advantage: if you misspell this private word, LO warn you. With your solution, if you wrote "sull'Internationnal" you will not be warned.

Best regards. JBF
Comment 14 Callegar 2021-08-25 02:16:06 UTC
I beg to disagree, for two reasons:

1. The first one is more of a technical one, and its practical importance is relative. In which dictionary should such a "bilingual" composite word go? Dictionaries in LibO are classified by language, so that a word marked as English will be sought in the English dictionaries, a word marked as Italian in the Italian ones, etc. At least in principle a bilingual composite word does not belong to any single-language dictionary. It is just by chance that currently LibO seeks words whose characters span multiple languages in the dictionary corresponding to the language of the first character (at least, I believe that this is the case). Hence, the choice of the specific custom dictionary where to put the word is non obvious.

2. The second reason has strong practical significance, IMHO. If you have to write something like "dell'Industrial Conference blah blah blah" the first "dell'Industrial" ends up as a bilingual word with an initial part written in Italian and a second part written in English. The reasons why you may want to mark its initial characters as Italian and the last ones as English are twofold. First of all, you want correct hypbenation rules to be applied to either part (I do not even know if this will work in LibO as of today, but looking forward it should). Secondly, it is desirable that if you end up changing the text so that the initial "dell'" goes away and you remain with the "Industrial", that bit is correctly interpreted as an English word.

Now, if you take the pain to check the details above, I am pretty sure that you will be careful on this word, so that the spell checking will be more or less superfluous for the rare occurrences where it is encountered. In fact, the only thing you may want is for this word not to be marked as a mistake because false positives are distracting.

The other way round, you certainly do not want to add something to your Italian dictionary so that any misspelled occurrence of "dell'Industriale" (a perfectly correct Italian construction) is passed as correct if you miss the last "e" in Industriale.
Comment 15 Jean-Baptiste Faure 2021-08-25 16:09:49 UTC
A user dictionary can be assigned to all languages instead of to one in particular.
It is very easy to define a user dictionary for multi-language words and assigned to all languages. Following the same idea I have an user dictionary for proper names and another for acronyms, both being assigned to all languages.

Another way is to use the special user dictionary "List of Ignored Words" in which you can add manually whatever word you want : Tools > Options > Language Settings  > Writing Aids. Then select "User Defined Dictionaries" (the last one) and press "Edit" button. This dictionary is assigned to all languages and is used when you click "Ignore All" in the spellcheck dialog, but these changes are not saved when you close LibreOffice. But the changes are saved if you manually add a word to it.

Best regards. JBF
Comment 16 Callegar 2021-08-26 17:39:03 UTC
Again this seems to me like abusing a tool that is made for something else.

IMHO, dictionaries assigned to *all* languages are meant for stuff that is *invariant* across *all* languages, not stuff that is half in a language and half in another one.  For instance, a brand, a town name, a person name is likely to stay the same whatever language you are writing in regardless of the fact that it may not be a proper word in many (or any) of them.

Putting a construction like "dell'Industrial" in an "all languages" dictionary or in a global ignore list would prevent LibO from correctly marking "dell'Industrial" as an error when all the characters are marked for the Italian language (for instance because you meant to write "dell'Industriale"). In some senses it would pollute the set of words recognized as correct in many cases for the sake of avoiding a rare false positive. 

The other way round, the one and only one case where something like "dell'Industrial" should not be marked as an error is when you explicitly mark a part of this construction as Italian and the other part as English.

Avoiding the spellcheck for words whose characters span 2 languages would avoid the false positive when "dell'Industrial" is wanted (because of the explicit action of marking "dell'" as Italian and "Industrial" as English) and produce a false negative only in the extremely rare case when trying to write a form like "dell'Industrial" you do not pay enough enough attention (and e.g. you write "dell'Intustrial" - a mistake - notwithstanding the fact that you make an extra effort on this form (to mark its characters to be part in a language and part in another).

Most important it would have *zero* impact on words made of characters using only one language. Conversely, the solution of manually adding forms to dictionaries not only requires an effort in doing so, but changes the way in which spellchecking works for all documents.

There is also another aspect. Spellcheck is not only about right/wrong. It is also about suggestions.  If I add "dell'Industrial" to some dictionary, I suspect I will get this suggestion whenever I write "dell'Intustriale" rather than "dell'Industriale" as words whose characters are all Italian, and I definitely do not want that.
Comment 17 Ross Johnson 2021-10-21 05:02:30 UTC
Have you looked at using the AutoText feature for this?

Advantages over dictionaries:

1) AutoText can be formatted as you require, eg, italisized.

2) Saves time and effort and typos (through auto-completion)

3) Extends to text of any length and complexity, eg, words, phrases, paragraphs etc.

As with dictionaries:

1) AutoText is sharable

2) Avoids spellchecking, ie, allows setting problem text to language "None".
Comment 18 Mike Kaganski 2021-12-14 06:09:28 UTC
(In reply to Callegar from comment #0)
> Here "sull'" is in Italian and "International" is in English. However,
> because of the apostrophe, LibO sees "sull'International" as a single word
> and tries to spellcheck it. Apparently, the spellchecking is practiced based
> on the language of the last character (i.e., English in this case).
> Obviously, it fails, because "sull'International" is neither an English word
> nor an Italian one.
> 
> IMHO, when LibO sees a word with mixed language, it should consider it as a
> word whose language is None.

Do I understand it right, that the request is *not* about separators, but about handling of *any* sequence of characters that are detected as "word" for purposes of spell checking, and that contain a mix of characters having *different* language property set? E.g., I might type a single "International", but assign Italian to its "Inter" part, and English to "national". Is it about this general case?

If so, it is a perfectly reasonable request IMO, regardless of possible improvements in word separation procedure.
Comment 19 Mike Kaganski 2021-12-14 06:12:06 UTC
(In reply to Mike Kaganski from comment #18)
> If so, it is a perfectly reasonable request IMO

OTOH, skipping spell check for such cases, which do not explicitly mark words as excluded from spell check, would introduce a danger of unnoticed spelling errors, exactly the thing that spell check should assist in preventing.
Comment 20 Callegar 2021-12-14 17:36:59 UTC
> OTOH, skipping spell check for such cases, which do not explicitly mark words as excluded
> from spell check, would introduce a danger of unnoticed spelling errors, exactly the thing 
> that spell check should assist in preventing.

No, really it wouldn't. We are talking of words that are "mixed-language". Because of how the spell checking is implemented there are obviously no "mixed-language" dictionaries against which such words should be checked. This means that such words end up being checked against dictionaries that are not really made for them and that are likely not to catch errors for them correctly.

In the current implementation, the spellchecker decides in a totally arbitrary way that for mixed-language words the language ending up being used for the spellcheck is the language of the first letter. This means that there are currently two ways of preventing false positive spell-check errors on these words and both are hackish and causing more problem than they solve really introducing a danger of unnoticed spelling errors.

Let me show this to you with an example.

Suppose that I am writing in Italian about something that happened at the "International conference of this and that". So I need to write "nell'International". Here, "Nell'" is in Italian, being a version of "nello" (in the) where the last vowel is elided according to Italian elision rules. "International" is obviously English. For spellcheck, strings of chars with an apostrophe inside need to be considered as a single word.

- If this word is not in the Italian dictionary, current LibO marks it as an error

however

- If you put this word in the Italian dictionary, whenever you need to write "Nell'Internazionale" (all-italian) and you write "Nell'International" by a spelling mistake the error goes unnoticed.

- If you silence the error by making "Nell'International" as being in language "None", and then you edit it into "Nel Congresso ..." any error in "Nel" would go unnoticed because Nel will likely remain in language "None".
Comment 21 Mike Kaganski 2021-12-14 20:28:04 UTC
(In reply to Callegar from comment #20)
> No, really it wouldn't.

No, really it will. :D

> We are talking of words that are "mixed-language".

We are talking about *sequences of characters* that *may* represent some "valid" uses of words, but at the same time, may be simply a user error: a user could type "Inter" using Italian system input language (and the text will get Italian language on Windows), and then switch to English and finish "rnational". The end resul will be "Interrnational", with two "r" in the middle. If we disable spell checking on these words, it may go unnoticed (false negative). In current state, it will be marked - and even if such signal will be false-positive, generally false-positives are less harmful than false-negatives (the latter may result in released material being of inappropriate quality; the former are just unnecessarily drawing attention of the author).

And no matter how hard will you try to create a smart heuristics, you will always have a potential to write two words with a dot, or dash, or slash, or apostrophe, ... without spaces, and your suggestion will disable spell check on them with possible false negatives.

> This means that
> there are currently two ways of preventing false positive spell-check errors
> on these words and both are hackish and causing more problem than they solve
> really introducing a danger of unnoticed spelling errors.

The real way to fix such problem is making what you don't need highlighted as "None" language, explicitly disabling spell check.
Comment 22 Callegar 2021-12-15 08:15:11 UTC
@Mike Kagansky  What I am trying to communicate, with a sadly negative outcome, is that the number of false negatives with the proposed approach should be way less than the number of false negatives without it.

And this is somehow funny, because the reason why this bug got originally opened was precisely *false negatives*, not false positives. The reason why the bug was opened is not the burden of avoiding the red underline on "mixed-language words" (setting the language to "[None]" for the whole word or changing the language of half of it is equally expensive) but the fact that to avoid the red underlining users are currently pushed to adopt "[None]" that typically ends up in a debacle of false negatives.

Once you start writing things like "dell'International" either you live happy with the distracting red underlining indicating a potential misspell or you don't. If you don't there the problem of false negatives explodes! If you add the word to the "dictionary" corresponding to the initial character, you pollute that dictionary with something that does not belong to it and that will cause false negatives. And if you set the language to "[None]" that is really the worst that you could do. Because "[None]" sticks. As soon as you edit that piece of text (e.g. erasing International and changing it into another piece of text, or even just writing something after the last 'l' of International) that text will continue to be in language "[None]". And because there is no visual indication of language "[None]" (it is not like when you leave on bold, or italics, or a color, that immediately gives you a visual feedback), whole chunks of text will remain without any spellchecking at all.

Really, I don't think that we should encourage users to switch words to language "[None]"  lightly, unless the software is also made to provide a way (a button?) to temporarily highlight the regions of text set to "[None]" and as such deprived of spellcheck, because once you start with "[None]" is way too easy to end up with whole paragraphs like that through editing.

On the other hand, my proposal is likely to cause less harm. Setting the language to be "half Italian and half English" on a "word" like "dell'International" is semantically plausible (the word is indeed half Italian and half English) and to consider a word like that Italian is quite arbitrary (why not English?). Most important, when you edit the language selection shall either remain Italian or English, depending on where you edit.  The other way round, the possibility that you end up switching language mid-word "by accident" cannot be expected to be very frequent (at least in my experience).

In any case, I accept the fact that being Italian I may be biased towards observing what happens in Italian texts with English words and that there may be other languages with rules that I do not know for which the current behavior is better. So why not considering the addition of a flag in the "Options->Language Settings->Writing Aids" settings (possibly document-specific) such as "Use writing aids on mixed language words" to be "On" by default (the current behavior) but switchable to "Off".
Comment 23 Mike Kaganski 2021-12-15 08:28:21 UTC
(In reply to Callegar from comment #22)
> @Mike Kagansky  What I am trying to communicate, with a sadly negative
> outcome, is that the number of false negatives with the proposed approach
> should be way less than the number of false negatives without it.

I understand that. But you confuse "the number of false negatives for myself" vs "the number of false negatives universally". The overall number of users who may use cases that you describe is way less than the number of users who do *not* intentionally use such things, but may create such combinations accidentally (e.g., fast typers who switch keyboard layout too fast, and start a word starting from letter "c" - which curiously in English layout co-incides with Russian "с" having exactly the same shape, and thus indistinguishable in shape; or even more combinations in Roman languages).

So if 10 million users do such mistakes once every 10 years, and you and other 10 users experience your describer problem 10 times a day, the failure rate would be 2740:100. I took very low rate of failures for 10 million users, and very high rate for problem that you describe; but the user number at both sides are arbitrary, since I don't have numbers at hand (but I do know how often such typos happen e.g. for myself - being Russian - and for my colleagues at previous job - that's definitely not once 10 years).
Comment 24 Mike Kaganski 2021-12-15 08:54:48 UTC
(In reply to Mike Kaganski from comment #23)

To clarify: I do not oppose having some rules allowing spell checkers to *continue* checking *parts* of such words (e.g., if they have a separator in the middle, as in your example) - which would allow to check your "sull" and "International" individually; I only mean that doing what the title wants ("should not undergo spell checking") is not a proper approach.
Comment 25 Callegar 2021-12-15 12:11:59 UTC
I do not have any statistical data either ;-)

But from what you say, I read this (correct me if I am wrong):

- For the usage case that you expect to be most frequent, the best behavior would not be "spellcheck mixed-language words against the dictionaries corresponding to the first character of the word" (the current behavior), but *always mark as error* words with mixed-language.

- For the usage case that I am often encountering (but, as said, I accept that this can be mostly relevant to Italians since Italian is a language that quite friendly accepts foreign words and then applies Italian rules such as elisions with them) the best behavior would be to *never mark as errors* words with mixed-language.

In neither case, to spellcheck words with mixed language (the current behavior) is sensible. For your case it happens to work, but it is just a waste of CPU. For my case, it is plain nasty.

So, to me, the best possibility would be a per-document option: always mark as error mixed-language words or never mark them as error.
Comment 26 Stéphane Guillou (stragu) 2022-11-22 08:23:16 UTC
We have a similar issue reported in bug 152108, for German. A hyphenated word is split between two languages, but the spellchecker takes the whole thing as one word and uses the first language in the string.

Bug reporter suggests spellcheching the string parts separately according to the languages used: https://bugs.documentfoundation.org/show_bug.cgi?id=152108#c3

Summary updated to describe the issue.

Also in Windows:

Version: 7.5.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: deb7bc82de19ea8e20c767fdf21f9ba4feb5e9f0
CPU threads: 4; OS: Windows 10.0 Build 19045; UI render: Skia/Raster; VCL: win
Locale: en-GB (en_GB); UI: en-GB
Calc: threaded

Also in OOo:

OpenOffice.org 3.3.0
OOO330m20(Build:9567)
Comment 27 Stéphane Guillou (stragu) 2022-11-22 08:23:42 UTC
*** Bug 152108 has been marked as a duplicate of this bug. ***