Bug 113298 - RTL: Automatic language detection based on keyboard layout
Summary: RTL: Automatic language detection based on keyboard layout
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
6.0.0.0.alpha0+
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: RTL-CTL Language-Detection
  Show dependency treegraph
 
Reported: 2017-10-20 16:30 UTC by Yousuf Philips (jay) (retired)
Modified: 2018-09-30 17:58 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
Document with text in English, Hebre and Arabic for reproducing this issue (8.66 KB, application/vnd.oasis.opendocument.text)
2018-09-30 17:57 UTC, Eyal Rozenberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yousuf Philips (jay) (retired) 2017-10-20 16:30:50 UTC
LibreOffice correctly detects CTL text and sets the text language of the text to whatever is set in the complex text language drop down listbox in Tools > Options > Language Settings > Languages (by default it is hindi). The problem with this is that i could be writing in multiple CTL languages in a sentence and falling back on a single set CTL language isnt useful. I believe that it is possible to detect the user's keyboard layout and if so, why not use that to set the text language accurately.

I assume this same principle could be used for other languages as well.
Comment 1 Caolán McNamara 2017-10-23 11:55:32 UTC
IIRC we support this under Windows because the IM there has a property to indicate the language the IM is for, while under Linux we don't cause it doesn't.

e.g. WinSalFrame::GetInputLanguage for the windows one which has the feature vs GtkSalFrame::GetInputLanguage which can only return LANGUAGE_DONTKNOW

I think this is a duplicate of bug 108151
Comment 2 Yousuf Philips (jay) (retired) 2017-10-23 12:56:37 UTC
(In reply to Caolán McNamara from comment #1)
> IIRC we support this under Windows because the IM there has a property to
> indicate the language the IM is for, while under Linux we don't cause it
> doesn't.

So with this mechanism not available on Linux, would it be possible to use your libexttextcat library to detect the language and change accordingly? Or alternatively add on to the current CTL detection, and detect CTL languages based on the unicode character range being typed?
Comment 3 Caolán McNamara 2017-10-23 15:58:34 UTC
I imagine using libexttextcat would just introduce a pile of "my language was guessed wrong" bugs. Especially for short sequences of text which won't be long enough for the statistical efforts of libexttextcat to guess it right.

Unicode char range folds this bunch of languages https://en.wikipedia.org/wiki/Arabic_script#Languages_currently_written_with_the_Arabic_alphabet to Arabic, while Hebrew script munges Yiddish and Hebrew together, which is maybe acceptable loss and probably happens on Windows already.

There are some hints in bug 108151 about some available fields in the gtk integration with the IBUS IM that might be of some use to pick an acceptable value to set for the language.
Comment 4 Yousuf Philips (jay) (retired) 2017-10-23 18:37:33 UTC
(In reply to Caolán McNamara from comment #3)
> I imagine using libexttextcat would just introduce a pile of "my language
> was guessed wrong" bugs. Especially for short sequences of text which won't
> be long enough for the statistical efforts of libexttextcat to guess it
> right.

Have you seen this library - https://github.com/CLD2Owners/cld2

> Unicode char range folds this bunch of languages
> https://en.wikipedia.org/wiki/
> Arabic_script#Languages_currently_written_with_the_Arabic_alphabet to
> Arabic, while Hebrew script munges Yiddish and Hebrew together, which is
> maybe acceptable loss and probably happens on Windows already.

For arabic alphabet languages, LO only lists persian, uyghur, punjabi and urdu under CTL and there are unicode characters that are unique to most of these languages.

https://en.wikipedia.org/wiki/Persian_language#Additions
https://en.wikipedia.org/wiki/Urdu_alphabet#Differences_from_Persian_alphabet
https://en.wikipedia.org/wiki/Shahmukhi_alphabet

@Lior: what is your take on Hebrew detection?

> There are some hints in bug 108151 about some available fields in the gtk
> integration with the IBUS IM that might be of some use to pick an acceptable
> value to set for the language.

Guessing based on locale is definitely helpful to some degree if a user lives in a country that a particular language is highly used in.
Comment 5 Eyal Rozenberg 2018-09-30 17:57:20 UTC
Created attachment 145279 [details]
Document with text in English, Hebre and Arabic for reproducing this issue

Reproduction instructions for this issue using the attached document:

1. Set your LO CTL language to Hebrew
1. Open the document (tri-lingual.odt)
2. Walk the cursor along the single line of text. You should see the status bar indicate the language as English (or "(en)"), then Hebrew, then Arabic (or "Arabic (Saudi Arabia)".
3. Copy the full line of text
4. Close the document
5. Open a new Writer document
6. Paste-Special the text you've copied, as unformatted text
7. Walk the cursor through the line again

Expected result: Language will again change from English, to Hebrew, to Arabic.

Actual result: Language will change from English to Hebrew, and be reported as Hebrew for the Arabic text as well.
Comment 6 Eyal Rozenberg 2018-09-30 17:58:02 UTC
Oh, I should mention I tested with:

Version: 6.2.0.0.alpha0+
Build ID: ad6adb1bfadf49af3187a0bb3ceffbf355e9eed1
CPU threads: 4; OS: Linux 4.9; UI render: default; VCL: gtk2; 
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2018-09-29_02:45:20
Locale: en-US (en_IL); Calc: threaded