Bug 138502 - Spellchecker problems with multiple languages and custom languages
Summary: Spellchecker problems with multiple languages and custom languages
Status: RESOLVED WONTFIX
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 148571 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-11-26 04:04 UTC by ariel18
Modified: 2022-08-11 11:01 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
Word on macOS (37.23 KB, image/png)
2022-08-11 07:58 UTC, Heiko Tietze
Details

Note You need to log in before you can comment on or make changes to this bug.
Description ariel18 2020-11-26 04:04:47 UTC
Description:
Suppose I am writing an English text containing many German words, or a German text containing many English ones. There is no obvious way to use a spellchecker without manually labeling each language switch.

Suppose I write my own .aff file for a language. It seems I can set a user-defined dictionary, but I cannot find any way to set a user-defined .aff file. This makes it difficult to develop and test new dictionaries in LibreOffice.

Steps to Reproduce:
1. Write a document containing words in two or more languages (where one language may or may not be supported)
2. Try to use spellcheck


Actual Results:
I may want the spellchecker to accept any words in the de_DE.dic files, inflected according to the de_DE.aff file, AND any words in the en_GB.dic file, inflected according to the en_GB.aff file. However, currently, I cannot, unless I explicitly tell the document which bits are in which language (Tools>Language>For selection), which is more tedious than manual spellchecking in each of the two language modes. 

I realize that using multiple languages would increase my false-negative rate, since misspellings that happened to be words in the other language would not be picked up. That's acceptable; it'd be much better than the huge false-positive rate you get when spellchecking German as English or vice-versa.

To avoid this error-rate issue, I could add the minor-language words to a user-defined dictionary; this often works well.

However, in user-defined dictionaries, I can only give inflection rules by analogy to existing words *in the default document language*. Many words in German inflect in ways that English words do not, and vice-versa. I therefore have to add every possible inflection of each second-language word as a separate user-defined dictionary entry, or the spellchecker won't work. This is very tedious.

Expected Results:
Potential solutions:
Potential solution 1: I could manually amalgamate the de_DE and en_GB files, but that would be tedious (inflection categories have what are essentially one-capital-letter variable names!). Also, while there's a system for adding user-defined dictionaries, there is no way I can see to add a user-defined .aff file. So it seems I'd have to pretend my hybrid file was an existing language! And I get the error-rate issue. This solution seems poor.

Potential solution 2: Since many extensions supply new dictionary+aff-file pairs, an extension/function allowing the user to add custom pairs seems like it should be possible, but I don't think it exists. https://wiki.documentfoundation.org/Development/Dictionaries
has no instructions, beyond asking the developers.

Potential solution 2.5: An option to do a semi-automated merge of existing dictionary+aff pairs to create a custom merged dic+aff pair for use as in PS2 above (while leaving the original languages intact). The error-rate issue occurs, unless I pare down the auto-generated file. 

Potential solution 3: User-defined dictionaries currently only allow users to define inflections by analogy to words in ONE dictionary+aff pair: "inflect this word like the word 'troggle' in the .dic file". There is no way to say "inflect the word "triggle" like the word 'troggle' in (xy_XY.dic and xy_XY.aff), and inflect the word "boing" like the word 'sproing' in (wz_WZ.dic and wz_WZ.aff)". 
	3a. I'd like to have the option to define inflections with variable names (like in the non-user-defined dictionaries, e.g.: "Adam/SM", where "S" and "M" are classes of inflections "Adam" takes, namely "Adams" and "Adam's"). 
	3b. I'd also like to use variable names that refer to a specific .aff file. Example: if the word is "widget", defining inflections with "widget/$en_GB_X" instead of "widget/X". It should also be possible to say "widget/$en_GB_X+$de_DE_Y" or "widget/$en_GB_X$de_DE_X". But maybe multiletter names would run into format-definition problems. This solution would greatly reduce the error-rate issue, especially combined with PS2. It would save manually copying inflections into a PS2 custom .aff file.

Potential solution 4: Add a Libreoffice setting to tell the spellchecker to use multiple pairs of words+inflections, and only flag words not found in any selected language. For correct-spelling-guessing algorithms, I'd be happy to set a preference for the rules in one specified .aff file over another, or set an order of priorities. Error-rate issue occurs, but that may be acceptable to many users.



Reproducible: Always


User Profile Reset: No



Additional Info:
PS4 would probably be simplest for the majority of users, and useful for language teaching and people writing about A-language texts in language B. PS2 would be the most flexible, and useful for people using rare languages. It would encourage users to develop language tools for LibreOffice. Conlanggers would love it, too.

PS3 has an additional use case. I may want to accept words from the de_DE.dic which have been inflected according to rules in the en_GB.aff file, or vice-versa; for example, "The Bundestag's procedural rules forbid it" or "Ich habe den Computer gecrasht" ("crash" is not really a German word, and "Computer", as a German noun, is capitalized). With PS3 I can easily add these cases to my user-defined dictionary.

It seems I am not the only one who would appreciate this sort of functionality, which makes me fear that at least PS4 is a difficult feature to add: https://ask.libreoffice.org/en/question/71151/simultaneously-use-two-languages-in-one-document/

PS2 would really be useful (even without adding PS2.5 and PS3, which would make it more useful), and it looks, to my ignorant eyes, easier to implement.

Questions:
Do any of these potential solutions already exist, and if so, where can I learn about them? If not, how feasible are PS2-4 as feature requests?
Comment 1 Mike Kaganski 2020-11-26 06:13:15 UTC
In LibreOffice, there are two ways this is expected to work:

1. Use of styles.
   This method implies that you create styles (character and/or maybe paragraph) which have required languages defined explicitly. The styles may be assigned custom keyboard shortcuts for ease of use.
   This method works in any environment. However, until there's no UI for nesting character styles (tdf#115311), it will not be useful, since there's no way to make a text range have "language" character style *and* some other character style (say, strong emphasis).

2. Taking system input language into account.
   This method uses the input language as reported by OS/environment, and applies respective direct formatting to entered text automatically at each key press. This feature depends on "Ignore system input language" setting in Options->Language Settings->Languages (when the setting is unchecked, the feature is active); it is active by default.
   This requires that user *changes* system input language correctly, though. In locales where it's normal to switch keyboard layouts when switching languages (e.g., when one uses a Cyrillic script and a Latin script: say, Russian + English, which have different keyboard layouts, with no Latin characters on Russian layout and vice versa), one *always* presses e.g. Shift+Alt when switching the keyboard layout, and that additionally switches the input language. That is in muscle memory of those users, and does not impose problems. However, for users who write in two languages both using Latin script, this is an unknown/unusual workflow; these users don't think about switching something when start writing in a different language; they usually don't know that they are able to have several system input modes (e.g., English input language using en-US intl keyboard layout, and German language also using en-US intl keyboard layout), which would enable them to facilitate that system feature.
   This feature also depends on OS/environment support for this feature in LO. It has always been supported on Windows. Until recently, it was not supported on Linux (tdf#108151); it was implemented for Qt5 by Jan-Marek, which will be in LO 7.2 (summer 2021). It is not (yet?) available for GTK and other backends.

I disagree that we should allow marking text as multi-language. Instead, we should fix the existing problems in the use of the two methods described above, and create an extensive help how to use them properly.
Comment 2 ariel18 2020-11-26 19:26:43 UTC
Thank you, Mike Kaganski. I am on Linux with LO6 and did not know about that system input language toggle. It sounds a much better suggestion than mine if one is typing the document and not modifying an existing text. I assume that if one uses >2 languages regularly one can cycle through them?

Conflation of styles and language seems odd, but I suppose I might be able to use it in some contexts. Good to know, especially if style-nesting gets fixed.

To be honest, one of the biggest problems I have for non-English languages in Latin script is diacritics. Sadly, Linux also doesn't seem to effectively implement the ["compose" key](https://en.wikipedia.org/wiki/Compose_key) (the settings exist, in many places, but don't seem to work). If I have to write a text about three people called Xaudaró, Yī and Zaćiragić, I can add them to the user-defined dictionary and autocorrect, but when the diacritics distinguish meanings (say, I am writing about 尤 Yóu, 右 Yòu, and 幽 Yōu, and also using the English word "you"), this does not work. And if I want to inflect my diacritic-laden words with a variety of non-English inflections I'm out of luck. Even toggling through languages might get a bit tedious here, assuming the languages are in LibreOffice.

Is it possible to set a user-defined .aff file, for inflections on user-defined dictionary words where neither word nor inflection is in the existing languages supported by LibreOffice? I'm only really wanting to inflect a few terms, but if I keep adding them as I go I might eventually have a full dictionary.
Comment 3 Heiko Tietze 2020-12-03 12:25:53 UTC
(In reply to ariel18 from comment #2)
> Is it possible to set a user-defined...

Tools > Options > Language Settings > Writing Aids (or Options... in the spell checking dialog) allows to add a new dictionary, to edit, and to delete. Should be easy to check Yòu against MyGreek and Yóu against MyFrench.
Admitted, this task is not easy and perhaps you just ignore the words eventually ;-)

Guess the original request has been answered. Resolving the ticket as WF, feel free to reopen.
Comment 4 ariel18 2020-12-03 19:17:14 UTC
(In reply to Heiko Tietze from comment #3)
> (In reply to ariel18 from comment #2)
> > Is it possible to set a user-defined...
> 
> Tools > Options > Language Settings > Writing Aids (or Options... in the
> spell checking dialog) allows to add a new dictionary, to edit, and to
> delete. 

Thank you, Heiko. I did find that setting for the .dic files, but I also wanted to know about custom .aff files. 

In many Linux distros these will be under /usr/share/hunspell

They are the files that give permissible inflection types. They allow you to say that if "dog" is a correctly-spelled word, so is "dogs" and "dog's" and "dogs'". And if "Wort" is a word, so is "Wörter", and if "gehen" is a word, so are "gehe", "gehst", "gegangen", and so on. The man page for hunspell describes this in some detail (https://www.systutorials.com/docs/linux/man/4-hunspell/).

The custom dicts seem to use a different way of indicating inflections from the default hunspell dicts (by analogy to other words, not by categories). This means that if my custom words do not inflect just like an existing English word, I have to add every inflected form individually to the custom dictionaries.

So my question is, how do I add custom inflection patterns?
Comment 5 Heiko Tietze 2020-12-04 08:11:36 UTC
(In reply to ariel18 from comment #4)
> So my question is, how do I add custom inflection patterns?

Isn't it a question to the hunspell community rather than a bug report or enhancement request for LibreOffice? If not, please use ask.libreoffice.org for questions. Would have to dig the web myself to find an answer.
Comment 6 Tristan Miller 2022-08-10 11:57:43 UTC
*** Bug 148571 has been marked as a duplicate of this bug. ***
Comment 7 Tristan Miller 2022-08-10 12:12:05 UTC
(In reply to Mike Kaganski from comment #1)
> I disagree that we should allow marking text as multi-language. Instead, we
> should fix the existing problems in the use of the two methods described
> above, and create an extensive help how to use them properly.

If I understand you correctly, both your proposed methods still require users to manually signal the change of language, an inconvenience that this feature request is explicitly meant to avoid.  (And as you yourself point out, such manual signalling is not typically part of the writing workflow when writing in two languages that share a common script.)  There are at least a couple good reasons to allow LibreOffice users to tag the same document or span of text with multiple languages:

1. The feature seems to be frequently requested, as in Bug 148571 and various Ask LibreOffice threads such as <https://ask.libreoffice.org/t/simultaneously-use-two-languages-in-one-document/19210/2>.

2. Other word processors and text editors, such as Microsoft Word and Emacs, are already able to do this, and according to the aforementioned reports, people do seem to be using and appreciating this functionality.
Comment 8 Mike Kaganski 2022-08-10 12:17:47 UTC
(In reply to Tristan Miller from comment #7)
> If I understand you correctly, both your proposed methods still require
> users to manually signal the change of language, an inconvenience that this
> feature request is explicitly meant to avoid.  (And as you yourself point
> out, such manual signalling is not typically part of the writing workflow
> when writing in two languages that share a common script.)

Yes.

> 1. The feature seems to be frequently requested, as in Bug 148571 and
> various Ask LibreOffice threads such as
> <https://ask.libreoffice.org/t/simultaneously-use-two-languages-in-one-
> document/19210/2>.

Indeed. Any user would love the software that has a magic button "do everything I think about". It doesn't make it possible to create a reasonably making mind-reading software.

> 2. Other word processors and text editors, such as Microsoft Word ...
> are already able to do this

Wrong. MS Word does not do that AFAICT - is there an evidence it does (I mean, documentation) it does?
Comment 9 Tristan Miller 2022-08-10 12:51:59 UTC
(In reply to Mike Kaganski from comment #8)
> > 2. Other word processors and text editors, such as Microsoft Word ...
> > are already able to do this
> 
> Wrong. MS Word does not do that AFAICT - is there an evidence it does (I
> mean, documentation) it does?


I don't use that software myself; my source is the Ask LibreOffice thread I linked to.  Two users there (RJD and Hilbert) claim that Word does this.
Comment 10 Mike Kaganski 2022-08-10 13:26:46 UTC
(In reply to Tristan Miller from comment #9)
> Two users there (RJD and Hilbert) claim that Word does this.

They are wrong. I have Word; it will only use one dictionary for any given text - the one that matches the language set to it (either manually, using e.g. status bar language control over selected text), or implicitly - by honoring the system input language (the same way we do in Writer on Windows).
Comment 11 Mike Kaganski 2022-08-10 13:37:55 UTC
(In reply to Tristan Miller from comment #9)

I read their posts.

*Possibly* RJD claims that. It would be interesting to learn if MS Word actually does that on macOS, or if there's some OS-specific way of recognizing text language. Note: I use Windows, so my knowledge of how Word behaves is incomplete (limited to onw OS).

But Hilbert never claimed Word did that. They only claim that in LibreOffice, spell check may simply *not* detect misspelled words - exactly the opposite to what is the problem in this issue, where words get flagged when a wrong dictionary is used on them. So the post of Hilbert is absolutely unrelated to this issue, and only discusses some problems with configuring spell checkers.

Just to clarify a bit.
Comment 12 Heiko Tietze 2022-08-11 07:58:39 UTC
Created attachment 181718 [details]
Word on macOS

All text set to English, no warning on pseudo Latin, German, and Spanish but Russian. Probably not a good example.
Comment 13 Mike Kaganski 2022-08-11 08:12:30 UTC
(In reply to Heiko Tietze from comment #12)

It indeed is not that clear.
1. Are the respective dictionaries installed/available (Latin, German, Spanish)?
2. Would a word in a Roman script, that definitely has a spelling error (not present in any language) be underlined?
Comment 14 Mike Kaganski 2022-08-11 08:17:45 UTC
(In reply to Heiko Tietze from comment #12)

Also could you please try with words that are less likely to appear in English dictionaries?

I tried locally with Ácida Säure (acid), and they were underlined when marked en-US (unlike Welt), while not underlined when marked Spanish (Spain) and German (Germany) respectively.
Comment 15 Heiko Tietze 2022-08-11 10:12:12 UTC
Ácida is not underlined when the text is defined as Spanish, Säure not in case of German. Otherwise, ie. also for English, it's a spelling mistake.

Don't know where to check the installed dictionaries, don't remember having any installed. The preferences > spelling grammar list only Custom Dictionary. 

Locale is English (German), default language is None (applied when I paste Ácida Säure).
Comment 16 Mike Kaganski 2022-08-11 11:01:22 UTC
(In reply to Heiko Tietze from comment #15)
> Ácida is not underlined when the text is defined as Spanish, Säure not in
> case of German. Otherwise, ie. also for English, it's a spelling mistake.

So - no, Word does not behave as claimed in the Ask question. It works the same way as Writer. And that is the correct way, using multiple dictionaries in attempt to match a word in any of them is likely to create false negatives. The "manual signalling is not typically part of the writing workflow when writing in two languages that share a common script" is not a compelling argument, just introduce that into the workflow. Writer is a tool to create texts; correct spelling is crucial for good texts; behavior of tools where correct spelling is not important (like when you tweet, where anything is tolerable) is not something we should consider as guidance.

> Don't know where to check the installed dictionaries

When you mark a word as some language in Word (you likely select the word, then click on the status bar's language indicator, then select the language in the dialog), the dialog has languages with dictionaries marked with a (blue) checkbox and "abc" - very much like in Writer's language lists.