LibreOffice correctly detects CTL text and sets the text language of the text to whatever is set in the complex text language drop down listbox in Tools > Options > Language Settings > Languages (by default it is hindi). The problem with this is that i could be writing in multiple CTL languages in a sentence and falling back on a single set CTL language isnt useful. I believe that it is possible to detect the user's keyboard layout and if so, why not use that to set the text language accurately.
I assume this same principle could be used for other languages as well.
IIRC we support this under Windows because the IM there has a property to indicate the language the IM is for, while under Linux we don't cause it doesn't.
e.g. WinSalFrame::GetInputLanguage for the windows one which has the feature vs GtkSalFrame::GetInputLanguage which can only return LANGUAGE_DONTKNOW
I think this is a duplicate of bug 108151
(In reply to Caolán McNamara from comment #1)
> IIRC we support this under Windows because the IM there has a property to
> indicate the language the IM is for, while under Linux we don't cause it
So with this mechanism not available on Linux, would it be possible to use your libexttextcat library to detect the language and change accordingly? Or alternatively add on to the current CTL detection, and detect CTL languages based on the unicode character range being typed?
I imagine using libexttextcat would just introduce a pile of "my language was guessed wrong" bugs. Especially for short sequences of text which won't be long enough for the statistical efforts of libexttextcat to guess it right.
Unicode char range folds this bunch of languages https://en.wikipedia.org/wiki/Arabic_script#Languages_currently_written_with_the_Arabic_alphabet to Arabic, while Hebrew script munges Yiddish and Hebrew together, which is maybe acceptable loss and probably happens on Windows already.
There are some hints in bug 108151 about some available fields in the gtk integration with the IBUS IM that might be of some use to pick an acceptable value to set for the language.
(In reply to Caolán McNamara from comment #3)
> I imagine using libexttextcat would just introduce a pile of "my language
> was guessed wrong" bugs. Especially for short sequences of text which won't
> be long enough for the statistical efforts of libexttextcat to guess it
Have you seen this library - https://github.com/CLD2Owners/cld2
> Unicode char range folds this bunch of languages
> Arabic_script#Languages_currently_written_with_the_Arabic_alphabet to
> Arabic, while Hebrew script munges Yiddish and Hebrew together, which is
> maybe acceptable loss and probably happens on Windows already.
For arabic alphabet languages, LO only lists persian, uyghur, punjabi and urdu under CTL and there are unicode characters that are unique to most of these languages.
@Lior: what is your take on Hebrew detection?
> There are some hints in bug 108151 about some available fields in the gtk
> integration with the IBUS IM that might be of some use to pick an acceptable
> value to set for the language.
Guessing based on locale is definitely helpful to some degree if a user lives in a country that a particular language is highly used in.
Created attachment 145279 [details]
Document with text in English, Hebre and Arabic for reproducing this issue
Reproduction instructions for this issue using the attached document:
1. Set your LO CTL language to Hebrew
1. Open the document (tri-lingual.odt)
2. Walk the cursor along the single line of text. You should see the status bar indicate the language as English (or "(en)"), then Hebrew, then Arabic (or "Arabic (Saudi Arabia)".
3. Copy the full line of text
4. Close the document
5. Open a new Writer document
6. Paste-Special the text you've copied, as unformatted text
7. Walk the cursor through the line again
Expected result: Language will again change from English, to Hebrew, to Arabic.
Actual result: Language will change from English to Hebrew, and be reported as Hebrew for the Arabic text as well.
Oh, I should mention I tested with:
Build ID: ad6adb1bfadf49af3187a0bb3ceffbf355e9eed1
CPU threads: 4; OS: Linux 4.9; UI render: default; VCL: gtk2;
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2018-09-29_02:45:20
Locale: en-US (en_IL); Calc: threaded
Still the same behavior with:
Version: 220.127.116.11.alpha0+ / LibreOffice Community
Build ID: 5c68399e6bea3aa18477487400f8bb143d6ed84e
CPU threads: 4; OS: Linux 5.18; UI render: default; VCL: gtk3
Locale: en-IL (en_IL); UI: en-US
See also: bug 139185 (and its See Also list) about language guessing problems of libexttextcat; see bug 139185 comment 4.