Download it now!
Bug 131487 - Words whose characters span multiple languages should not undergo spell checking
Summary: Words whose characters span multiple languages should not undergo spell checking
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
(earliest affected) release
Hardware: All Linux (All)
: medium enhancement
Assignee: Not Assigned
Keywords: needsDevAdvice
Depends on:
Blocks: Spell-Checking
  Show dependency treegraph
Reported: 2020-03-22 22:51 UTC by sergio.callegari
Modified: 2021-04-10 21:32 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Note You need to log in before you can comment on or make changes to this bug.
Description sergio.callegari 2020-03-22 22:51:03 UTC
When writing a document using multiple languages, it may happen to find pieces of text that look like a single word, but where a part of the chars belong to a language and the other part to another language.

For instance, in Italian, it may happen to write things like «Questo è un rapporto sull'International Conference on...» (This is a report on the International Conference on...).

Here "sull'" is in Italian and "International" is in English. However, because of the apostrophe, LibO sees "sull'International" as a single word and tries to spellcheck it. Apparently, the spellchecking is practiced based on the language of the last character (i.e., English in this case). Obviously, it fails, because "sull'International" is neither an English word nor an Italian one.

IMHO, when LibO sees a word with mixed language, it should consider it as a word whose language is None.

Steps to Reproduce:
See description

Actual Results:
See description

Expected Results:
See description

Reproducible: Always

User Profile Reset: No

Additional Info:
[Information automatically included from LibreOffice]
Locale: en-US
Module: TextDocument
[Information guessed from browser]
OS: Linux (All)
OS is 64bit: yes
Comment 1 Heiko Tietze 2020-06-22 05:49:24 UTC
Are there other examples of separators? In German it's wrong to combine a native word and a foreign word with a hyphen. I mean: can we make the request something like "Use apostrophe as word separator for spellchecking". Input from l10n would be great.
Comment 2 Ming Hua 2020-06-25 06:14:32 UTC
(In reply to Heiko Tietze from comment #1)
> Are there other examples of separators?
Good question, what about "nothing"? :-)

In Chinese we don't use spaces to separate words (we don't even have a clear definition of "words", compared to characters and phrases), so when English words are used in Chinese text, they are usually just written in between Chinese characters, with no separators whatsoever.  Of course, if more than one English words are used, the spaces between English words are kept.

This is not a user case that should be given much consideration, though, as Chinese users typically turn off spellchecking anyway.

> Input from l10n would be great.
I'll send a message to l10n mailing list later if no one beats me to it.
Comment 3 Heiko Tietze 2020-06-25 08:31:40 UTC
The Chinese example might be easier to solve since English characters are written with different font. Anyway, not really an UX topic. Let's see what devs think.
Comment 4 sophie 2020-06-25 09:09:15 UTC
Does it follow the Italian typographic conventions? For example in French it's not allowed to combine the words and foreign words have to be written in italic.
Comment 5 sergio.callegari 2020-06-25 10:01:58 UTC
@Heiko Tietze Rather than introducing new /letters/ to be used as word separators, I really think that it would be better to introduce a filter preventing words to undergo spellchecking, based on character/font properties inconsistencies. This would make both "quell'albero" (all Italian) and "quell'internship" (Italian + English) be both treated as a single word, but the latter not to undergo spellcheck, because "quell'"is marked as Italian and "internship" as English".

In fact, making the apostrophe a separator would mean getting "quell'albero" as the word "quell" followed by the word "albero", but "quell" is certainly not an Italian word (and neither is "quell'" alone).

I think that this would also help other languages, where one may have local and foreign words written with characters running together with no separators at all.

@Sophie Yes, when using foreign words in Italian it is frequent to have combinations that require the apostrophe between them. I am no linguistics expert, but I think that this has at least a couple of reasons: one is that the  apostrophe indicates an elision which is a phonetic phenomenon, so it is a case where orthography follows phonetics; secondly Italian belongs to that set of languages where there is no official body in charge of establishing rules, so the grammar is determined by how the language is itself actually used. Practically, Italian has accepted this kind of language intermixing since the times when it was frequent to find Latin terms in technical or polished talk and is currently very flexible when it comes to the intermixing of English terms.
Comment 6 Mihkel Tõnnov 2020-06-25 14:36:40 UTC Comment hidden (obsolete)
Comment 7 Mihkel Tõnnov 2020-06-25 14:42:45 UTC
(In reply to Mihkel Tõnnov from comment #6)

Ugh, I messed up examples in my first paragraph while moving things around there. It should read like this:

In Estonian, foreign words should be written in italics and if there's a case ending, then it has to be separated by an apostrophe, e.g. "<i>status quo</i>'ni". Case endings by themselves are not valid words, so apostrophe as word separator definitely wouldn't help; and apostrophe is also used for other purposes, like to indicate emission of character(s) from a word (e.g. to imitate everyday speech).

Somewhat similarly, hyphens can be used to separate foreign part and native part in compound words, e.g. "<i>flamenco</i>-tantsija", "tele-<i>show</i>". (Note that hyphen separates complete words, while apostrophe is used with case endings.)

I'm not sure it would be beneficial to completely ignore such words in spellcheck, as any misspellings in them would then pass unflagged (and at least some foreign words are rather prone to misspellings - take the Italian coffee terms that the rest of the world often struggles to write correctly :)

Could the request here maybe be re-purposed and implemented as follows?

1) If a word (as currently detected by LibO) contains characters in multiple languages, check if there is some punctuation mark separating the different language parts.

2a) If there's an apostrophe (' or ’ - but probably not ‘): check which language the apostrophe belongs to, and as far as spellcheck is concerned, separate the "word" at the border of languages, keeping the apostrophe together with the preceding or following characters, as appropriate.

2b) If there's a hyphen: ignore the hyphen and spellcheck the word parts as if separated by space.

2c) If there's no separating character, underline the whole "word" as misspelled.

@Sergio: you said that "quell'" alone is not an Italian word - how is it currently handled by spellcheck, if used before an Italian word? Would 2a as described above work for Italian?

Also, I'm not sure if 2c would be OK for all languages, though - does anyone have counterexamples for this?
Comment 8 sergio.callegari 2020-06-26 08:48:32 UTC
@Mihkel Tõnnov

> you said that "quell'" alone is not an Italian word - how is it currently handled by spellcheck, if used before an Italian word? Would 2a as described above work for Italian?

In Italian there are a lot of cases where there is an "elision" between two similar sounds. For instance we have the article "lo" that loses the final o when preceeding nouns that start with a vowel. For instance, rather than writing "lo ombrello" you write "l’ombrello". Incidentally, this is the same thing that happens with "quello" and "albero" that become "quell’albero" in my previous example. In this latter case "quello" is not an article, but the rule is the same.

To the best of my understanding these cases are treated by considering the two words that come to be pronounced as a single one because of the elision as a single word for spell checking.

Hence in the spell checking dictionary you have "lo" "ombrello" but also "l’ombrello", "quello", "albero", but also "quell'albero". I do not know the details, but I think that this is handled efficiently in the spell checker by combining a base dictionary with an affix file setting some rules to extend the base dictionary. In any case, this saves you from having to introduce the elided forms like "quell" in the dictionary, since these are not correct words by themselves.

This is why I think that it would be incorrect to consider the "’" as a word separator, at least in Italian and why I think that 2a would not be OK.

To me, the simplest thing to do would be keeping the word separator exactly as it is. Then before passing a word to the spell checker, if you have a word where different characters belong to different languages, pretend that the language for the whole word is "none", rather than pretending it is the language of the first character.
Comment 9 Mihail Balabanov 2021-04-10 19:31:22 UTC
Marking the multi-language words as ‘do not check’ would eliminate any false positives but also conceal typos when they do exist.

In Bulgarian, we use a hyphen when adding a plural and/or definiteness suffix to a foreign word or abbreviation in a different alphabet – like ‘DVD-то’ (the DVD), ‘Oscar-ите’ (the Oscars). Like in the languages mentioned above, the suffixes are not correct words by themselves. Considering that Estonian uses the apostrophe for such suffixes, Italian has apostrophe-separated prefixes, and other languages may have other separators, it makes sense to have a generalized mechanism to configure the allowed affixes for foreign words – together with any separators – per language. For example:

aff file:
FOREIGNWORDSFXSEP '-' # Would be ’ for Estonian

dic file:
ите/S # valid only immediately after a word in a different language and separated by ‘-’
Comment 10 sergio.callegari 2021-04-10 21:32:12 UTC
@Mihail Balabanov, can you please expand a little?

It is unclear to me how the proposed change about not spellchecking mixed language words could conceal typos that are now caught, or do you mean that there is a need to add an additional mechanism so that also multi-language words can get spellchecked to catch typos that cannot be caught now?

I have a feeling that the mechanism that you propose requires a big overhaul of all the spellchecking mechanism and entail a lot of corner cases. For instance, consider the Italian "sull'International Conference". You would need a mechanism to say: "quell" can be considered a valid Italian word only if it is followed by a "’" followed by a non Italian word whose initial sound is vowel-like. Seems a bit impractical to me.

Possibly, what you need is to differentiate just by language: when you have a multi-language token if one piece is in Italian, I think that it really should not undergo spellchecking that would lead to wrong results in 99% of the cases anyway. Furthermore, an author would already put some greater care on it due to its particular form. Maybe if you have a multi-language token where one piece is in Bulgarian there should be another approach or the current one is already OK.

As a side note, given the ongoing discussion I wonder if the issue could be marked as confirmed.