Bug 131487 - Words whose characters span multiple languages should not undergo spell checking
Summary: Words whose characters span multiple languages should not undergo spell checking
Status: UNCONFIRMED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
6.4.2.2 release
Hardware: All Linux (All)
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: needsDevAdvice
Depends on:
Blocks: Spell-Checking
  Show dependency treegraph
 
Reported: 2020-03-22 22:51 UTC by sergio.callegari
Modified: 2021-10-21 05:02 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description sergio.callegari 2020-03-22 22:51:03 UTC
Description:
When writing a document using multiple languages, it may happen to find pieces of text that look like a single word, but where a part of the chars belong to a language and the other part to another language.

For instance, in Italian, it may happen to write things like «Questo è un rapporto sull'International Conference on...» (This is a report on the International Conference on...).

Here "sull'" is in Italian and "International" is in English. However, because of the apostrophe, LibO sees "sull'International" as a single word and tries to spellcheck it. Apparently, the spellchecking is practiced based on the language of the last character (i.e., English in this case). Obviously, it fails, because "sull'International" is neither an English word nor an Italian one.

IMHO, when LibO sees a word with mixed language, it should consider it as a word whose language is None.

Steps to Reproduce:
See description

Actual Results:
See description

Expected Results:
See description


Reproducible: Always


User Profile Reset: No



Additional Info:
[Information automatically included from LibreOffice]
Locale: en-US
Module: TextDocument
[Information guessed from browser]
OS: Linux (All)
OS is 64bit: yes
Comment 1 Heiko Tietze 2020-06-22 05:49:24 UTC
Are there other examples of separators? In German it's wrong to combine a native word and a foreign word with a hyphen. I mean: can we make the request something like "Use apostrophe as word separator for spellchecking". Input from l10n would be great.
Comment 2 Ming Hua 2020-06-25 06:14:32 UTC
(In reply to Heiko Tietze from comment #1)
> Are there other examples of separators?
Good question, what about "nothing"? :-)

In Chinese we don't use spaces to separate words (we don't even have a clear definition of "words", compared to characters and phrases), so when English words are used in Chinese text, they are usually just written in between Chinese characters, with no separators whatsoever.  Of course, if more than one English words are used, the spaces between English words are kept.

This is not a user case that should be given much consideration, though, as Chinese users typically turn off spellchecking anyway.

> Input from l10n would be great.
I'll send a message to l10n mailing list later if no one beats me to it.
Comment 3 Heiko Tietze 2020-06-25 08:31:40 UTC
The Chinese example might be easier to solve since English characters are written with different font. Anyway, not really an UX topic. Let's see what devs think.
Comment 4 sophie 2020-06-25 09:09:15 UTC
Does it follow the Italian typographic conventions? For example in French it's not allowed to combine the words and foreign words have to be written in italic.
Comment 5 sergio.callegari 2020-06-25 10:01:58 UTC
@Heiko Tietze Rather than introducing new /letters/ to be used as word separators, I really think that it would be better to introduce a filter preventing words to undergo spellchecking, based on character/font properties inconsistencies. This would make both "quell'albero" (all Italian) and "quell'internship" (Italian + English) be both treated as a single word, but the latter not to undergo spellcheck, because "quell'"is marked as Italian and "internship" as English".

In fact, making the apostrophe a separator would mean getting "quell'albero" as the word "quell" followed by the word "albero", but "quell" is certainly not an Italian word (and neither is "quell'" alone).

I think that this would also help other languages, where one may have local and foreign words written with characters running together with no separators at all.

@Sophie Yes, when using foreign words in Italian it is frequent to have combinations that require the apostrophe between them. I am no linguistics expert, but I think that this has at least a couple of reasons: one is that the  apostrophe indicates an elision which is a phonetic phenomenon, so it is a case where orthography follows phonetics; secondly Italian belongs to that set of languages where there is no official body in charge of establishing rules, so the grammar is determined by how the language is itself actually used. Practically, Italian has accepted this kind of language intermixing since the times when it was frequent to find Latin terms in technical or polished talk and is currently very flexible when it comes to the intermixing of English terms.
Comment 6 Mihkel Tõnnov 2020-06-25 14:36:40 UTC Comment hidden (obsolete)
Comment 7 Mihkel Tõnnov 2020-06-25 14:42:45 UTC
(In reply to Mihkel Tõnnov from comment #6)

Ugh, I messed up examples in my first paragraph while moving things around there. It should read like this:

In Estonian, foreign words should be written in italics and if there's a case ending, then it has to be separated by an apostrophe, e.g. "<i>status quo</i>'ni". Case endings by themselves are not valid words, so apostrophe as word separator definitely wouldn't help; and apostrophe is also used for other purposes, like to indicate emission of character(s) from a word (e.g. to imitate everyday speech).

Somewhat similarly, hyphens can be used to separate foreign part and native part in compound words, e.g. "<i>flamenco</i>-tantsija", "tele-<i>show</i>". (Note that hyphen separates complete words, while apostrophe is used with case endings.)

I'm not sure it would be beneficial to completely ignore such words in spellcheck, as any misspellings in them would then pass unflagged (and at least some foreign words are rather prone to misspellings - take the Italian coffee terms that the rest of the world often struggles to write correctly :)

Could the request here maybe be re-purposed and implemented as follows?

1) If a word (as currently detected by LibO) contains characters in multiple languages, check if there is some punctuation mark separating the different language parts.

2a) If there's an apostrophe (' or ’ - but probably not ‘): check which language the apostrophe belongs to, and as far as spellcheck is concerned, separate the "word" at the border of languages, keeping the apostrophe together with the preceding or following characters, as appropriate.

2b) If there's a hyphen: ignore the hyphen and spellcheck the word parts as if separated by space.

2c) If there's no separating character, underline the whole "word" as misspelled.

@Sergio: you said that "quell'" alone is not an Italian word - how is it currently handled by spellcheck, if used before an Italian word? Would 2a as described above work for Italian?

Also, I'm not sure if 2c would be OK for all languages, though - does anyone have counterexamples for this?
Comment 8 sergio.callegari 2020-06-26 08:48:32 UTC
@Mihkel Tõnnov

> you said that "quell'" alone is not an Italian word - how is it currently handled by spellcheck, if used before an Italian word? Would 2a as described above work for Italian?

In Italian there are a lot of cases where there is an "elision" between two similar sounds. For instance we have the article "lo" that loses the final o when preceeding nouns that start with a vowel. For instance, rather than writing "lo ombrello" you write "l’ombrello". Incidentally, this is the same thing that happens with "quello" and "albero" that become "quell’albero" in my previous example. In this latter case "quello" is not an article, but the rule is the same.

To the best of my understanding these cases are treated by considering the two words that come to be pronounced as a single one because of the elision as a single word for spell checking.

Hence in the spell checking dictionary you have "lo" "ombrello" but also "l’ombrello", "quello", "albero", but also "quell'albero". I do not know the details, but I think that this is handled efficiently in the spell checker by combining a base dictionary with an affix file setting some rules to extend the base dictionary. In any case, this saves you from having to introduce the elided forms like "quell" in the dictionary, since these are not correct words by themselves.

This is why I think that it would be incorrect to consider the "’" as a word separator, at least in Italian and why I think that 2a would not be OK.

To me, the simplest thing to do would be keeping the word separator exactly as it is. Then before passing a word to the spell checker, if you have a word where different characters belong to different languages, pretend that the language for the whole word is "none", rather than pretending it is the language of the first character.
Comment 9 Mihail Balabanov 2021-04-10 19:31:22 UTC
Marking the multi-language words as ‘do not check’ would eliminate any false positives but also conceal typos when they do exist.

In Bulgarian, we use a hyphen when adding a plural and/or definiteness suffix to a foreign word or abbreviation in a different alphabet – like ‘DVD-то’ (the DVD), ‘Oscar-ите’ (the Oscars). Like in the languages mentioned above, the suffixes are not correct words by themselves. Considering that Estonian uses the apostrophe for such suffixes, Italian has apostrophe-separated prefixes, and other languages may have other separators, it makes sense to have a generalized mechanism to configure the allowed affixes for foreign words – together with any separators – per language. For example:

aff file:
FOREIGNWORDSFX S
FOREIGNWORDSFXSEP '-' # Would be ’ for Estonian

dic file:
ите/S # valid only immediately after a word in a different language and separated by ‘-’
Comment 10 sergio.callegari 2021-04-10 21:32:12 UTC
@Mihail Balabanov, can you please expand a little?

It is unclear to me how the proposed change about not spellchecking mixed language words could conceal typos that are now caught, or do you mean that there is a need to add an additional mechanism so that also multi-language words can get spellchecked to catch typos that cannot be caught now?

I have a feeling that the mechanism that you propose requires a big overhaul of all the spellchecking mechanism and entail a lot of corner cases. For instance, consider the Italian "sull'International Conference". You would need a mechanism to say: "quell" can be considered a valid Italian word only if it is followed by a "’" followed by a non Italian word whose initial sound is vowel-like. Seems a bit impractical to me.

Possibly, what you need is to differentiate just by language: when you have a multi-language token if one piece is in Italian, I think that it really should not undergo spellchecking that would lead to wrong results in 99% of the cases anyway. Furthermore, an author would already put some greater care on it due to its particular form. Maybe if you have a multi-language token where one piece is in Bulgarian there should be another approach or the current one is already OK.

As a side note, given the ongoing discussion I wonder if the issue could be marked as confirmed.
Comment 11 Jean-Baptiste Faure 2021-08-12 18:07:59 UTC
In that case I would add the pseudo-word to my personal dictionary to silent the spell-checker.

Best regards. JBF
Comment 12 sergio.callegari 2021-08-17 09:02:20 UTC
That may work, but I believe that it is rather sub-optimal to have to add (non existing) words to a dictionary for cases where there is no chance that spellchecking can be properly done.
Comment 13 Jean-Baptiste Faure 2021-08-23 20:30:16 UTC
(In reply to sergio.callegari from comment #12)
> That may work, but I believe that it is rather sub-optimal to have to add
> (non existing) words to a dictionary for cases where there is no chance that
> spellchecking can be properly done.

Perhaps but that solution works and is easy to use. Without changing the code. For your solution, you have to select a part of the "word" and define its language, then select the other part and define a different language.

My solution has another advantage: if you misspell this private word, LO warn you. With your solution, if you wrote "sull'Internationnal" you will not be warned.

Best regards. JBF
Comment 14 sergio.callegari 2021-08-25 02:16:06 UTC
I beg to disagree, for two reasons:

1. The first one is more of a technical one, and its practical importance is relative. In which dictionary should such a "bilingual" composite word go? Dictionaries in LibO are classified by language, so that a word marked as English will be sought in the English dictionaries, a word marked as Italian in the Italian ones, etc. At least in principle a bilingual composite word does not belong to any single-language dictionary. It is just by chance that currently LibO seeks words whose characters span multiple languages in the dictionary corresponding to the language of the first character (at least, I believe that this is the case). Hence, the choice of the specific custom dictionary where to put the word is non obvious.

2. The second reason has strong practical significance, IMHO. If you have to write something like "dell'Industrial Conference blah blah blah" the first "dell'Industrial" ends up as a bilingual word with an initial part written in Italian and a second part written in English. The reasons why you may want to mark its initial characters as Italian and the last ones as English are twofold. First of all, you want correct hypbenation rules to be applied to either part (I do not even know if this will work in LibO as of today, but looking forward it should). Secondly, it is desirable that if you end up changing the text so that the initial "dell'" goes away and you remain with the "Industrial", that bit is correctly interpreted as an English word.

Now, if you take the pain to check the details above, I am pretty sure that you will be careful on this word, so that the spell checking will be more or less superfluous for the rare occurrences where it is encountered. In fact, the only thing you may want is for this word not to be marked as a mistake because false positives are distracting.

The other way round, you certainly do not want to add something to your Italian dictionary so that any misspelled occurrence of "dell'Industriale" (a perfectly correct Italian construction) is passed as correct if you miss the last "e" in Industriale.
Comment 15 Jean-Baptiste Faure 2021-08-25 16:09:49 UTC
A user dictionary can be assigned to all languages instead of to one in particular.
It is very easy to define a user dictionary for multi-language words and assigned to all languages. Following the same idea I have an user dictionary for proper names and another for acronyms, both being assigned to all languages.

Another way is to use the special user dictionary "List of Ignored Words" in which you can add manually whatever word you want : Tools > Options > Language Settings  > Writing Aids. Then select "User Defined Dictionaries" (the last one) and press "Edit" button. This dictionary is assigned to all languages and is used when you click "Ignore All" in the spellcheck dialog, but these changes are not saved when you close LibreOffice. But the changes are saved if you manually add a word to it.

Best regards. JBF
Comment 16 sergio.callegari 2021-08-26 17:39:03 UTC
Again this seems to me like abusing a tool that is made for something else.

IMHO, dictionaries assigned to *all* languages are meant for stuff that is *invariant* across *all* languages, not stuff that is half in a language and half in another one.  For instance, a brand, a town name, a person name is likely to stay the same whatever language you are writing in regardless of the fact that it may not be a proper word in many (or any) of them.

Putting a construction like "dell'Industrial" in an "all languages" dictionary or in a global ignore list would prevent LibO from correctly marking "dell'Industrial" as an error when all the characters are marked for the Italian language (for instance because you meant to write "dell'Industriale"). In some senses it would pollute the set of words recognized as correct in many cases for the sake of avoiding a rare false positive. 

The other way round, the one and only one case where something like "dell'Industrial" should not be marked as an error is when you explicitly mark a part of this construction as Italian and the other part as English.

Avoiding the spellcheck for words whose characters span 2 languages would avoid the false positive when "dell'Industrial" is wanted (because of the explicit action of marking "dell'" as Italian and "Industrial" as English) and produce a false negative only in the extremely rare case when trying to write a form like "dell'Industrial" you do not pay enough enough attention (and e.g. you write "dell'Intustrial" - a mistake - notwithstanding the fact that you make an extra effort on this form (to mark its characters to be part in a language and part in another).

Most important it would have *zero* impact on words made of characters using only one language. Conversely, the solution of manually adding forms to dictionaries not only requires an effort in doing so, but changes the way in which spellchecking works for all documents.

There is also another aspect. Spellcheck is not only about right/wrong. It is also about suggestions.  If I add "dell'Industrial" to some dictionary, I suspect I will get this suggestion whenever I write "dell'Intustriale" rather than "dell'Industriale" as words whose characters are all Italian, and I definitely do not want that.
Comment 17 Ross Johnson 2021-10-21 05:02:30 UTC
Have you looked at using the AutoText feature for this?

Advantages over dictionaries:

1) AutoText can be formatted as you require, eg, italisized.

2) Saves time and effort and typos (through auto-completion)

3) Extends to text of any length and complexity, eg, words, phrases, paragraphs etc.

As with dictionaries:

1) AutoText is sharable

2) Avoids spellchecking, ie, allows setting problem text to language "None".