113298 – RTL: Automatic language detection based on keyboard layout

Bug 113298 - RTL: Automatic language detection based on keyboard layout

Summary: RTL: Automatic language detection based on keyboard layout

Status:	RESOLVED DUPLICATE of bug 108151

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	LibreOffice (show other bugs)
Version: (earliest affected)	6.0.0.0.alpha0+
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	Language-Detection
	Show dependency tree / graph

Reported:	2017-10-20 16:30 UTC by Yousuf Philips (jay) (retired)
Modified:	2024-08-02 23:24 UTC (History)
CC List:	6 users (show)

See Also:	129038 151215 139185
Crash report or crash signature:

Attachments
Document with text in English, Hebre and Arabic for reproducing this issue (8.66 KB, application/vnd.oasis.opendocument.text) 2018-09-30 17:57 UTC, Eyal Rozenberg	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Yousuf Philips (jay) (retired) 2017-10-20 16:30:50 UTC

LibreOffice correctly detects CTL text and sets the text language of the text to whatever is set in the complex text language drop down listbox in Tools > Options > Language Settings > Languages (by default it is hindi). The problem with this is that i could be writing in multiple CTL languages in a sentence and falling back on a single set CTL language isnt useful. I believe that it is possible to detect the user's keyboard layout and if so, why not use that to set the text language accurately.

I assume this same principle could be used for other languages as well.

Comment 1 Caolán McNamara 2017-10-23 11:55:32 UTC

IIRC we support this under Windows because the IM there has a property to indicate the language the IM is for, while under Linux we don't cause it doesn't.

e.g. WinSalFrame::GetInputLanguage for the windows one which has the feature vs GtkSalFrame::GetInputLanguage which can only return LANGUAGE_DONTKNOW

I think this is a duplicate of bug 108151

Comment 2 Yousuf Philips (jay) (retired) 2017-10-23 12:56:37 UTC

(In reply to Caolán McNamara from comment #1)
> IIRC we support this under Windows because the IM there has a property to
> indicate the language the IM is for, while under Linux we don't cause it
> doesn't.

So with this mechanism not available on Linux, would it be possible to use your libexttextcat library to detect the language and change accordingly? Or alternatively add on to the current CTL detection, and detect CTL languages based on the unicode character range being typed?

Comment 3 Caolán McNamara 2017-10-23 15:58:34 UTC

I imagine using libexttextcat would just introduce a pile of "my language was guessed wrong" bugs. Especially for short sequences of text which won't be long enough for the statistical efforts of libexttextcat to guess it right.

Unicode char range folds this bunch of languages https://en.wikipedia.org/wiki/Arabic_script#Languages_currently_written_with_the_Arabic_alphabet to Arabic, while Hebrew script munges Yiddish and Hebrew together, which is maybe acceptable loss and probably happens on Windows already.

There are some hints in bug 108151 about some available fields in the gtk integration with the IBUS IM that might be of some use to pick an acceptable value to set for the language.

Comment 4 Yousuf Philips (jay) (retired) 2017-10-23 18:37:33 UTC

(In reply to Caolán McNamara from comment #3)
> I imagine using libexttextcat would just introduce a pile of "my language
> was guessed wrong" bugs. Especially for short sequences of text which won't
> be long enough for the statistical efforts of libexttextcat to guess it
> right.

Have you seen this library - https://github.com/CLD2Owners/cld2

> Unicode char range folds this bunch of languages
> https://en.wikipedia.org/wiki/
> Arabic_script#Languages_currently_written_with_the_Arabic_alphabet to
> Arabic, while Hebrew script munges Yiddish and Hebrew together, which is
> maybe acceptable loss and probably happens on Windows already.

For arabic alphabet languages, LO only lists persian, uyghur, punjabi and urdu under CTL and there are unicode characters that are unique to most of these languages.

https://en.wikipedia.org/wiki/Persian_language#Additions
https://en.wikipedia.org/wiki/Urdu_alphabet#Differences_from_Persian_alphabet
https://en.wikipedia.org/wiki/Shahmukhi_alphabet

@Lior: what is your take on Hebrew detection?

> There are some hints in bug 108151 about some available fields in the gtk
> integration with the IBUS IM that might be of some use to pick an acceptable
> value to set for the language.

Guessing based on locale is definitely helpful to some degree if a user lives in a country that a particular language is highly used in.

Comment 5 Eyal Rozenberg 2018-09-30 17:57:20 UTC

Created attachment 145279 [details]
Document with text in English, Hebre and Arabic for reproducing this issue

Reproduction instructions for this issue using the attached document:

1. Set your LO CTL language to Hebrew
1. Open the document (tri-lingual.odt)
2. Walk the cursor along the single line of text. You should see the status bar indicate the language as English (or "(en)"), then Hebrew, then Arabic (or "Arabic (Saudi Arabia)".
3. Copy the full line of text
4. Close the document
5. Open a new Writer document
6. Paste-Special the text you've copied, as unformatted text
7. Walk the cursor through the line again

Expected result: Language will again change from English, to Hebrew, to Arabic.

Actual result: Language will change from English to Hebrew, and be reported as Hebrew for the Arabic text as well.

Comment 6 Eyal Rozenberg 2018-09-30 17:58:02 UTC

Oh, I should mention I tested with:

Version: 6.2.0.0.alpha0+
Build ID: ad6adb1bfadf49af3187a0bb3ceffbf355e9eed1
CPU threads: 4; OS: Linux 4.9; UI render: default; VCL: gtk2; 
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2018-09-29_02:45:20
Locale: en-US (en_IL); Calc: threaded

Comment 7 Eyal Rozenberg 2022-08-14 22:26:49 UTC

Still the same behavior with:

Version: 7.5.0.0.alpha0+ / LibreOffice Community
Build ID: 5c68399e6bea3aa18477487400f8bb143d6ed84e
CPU threads: 4; OS: Linux 5.18; UI render: default; VCL: gtk3
Locale: en-IL (en_IL); UI: en-US

Comment 8 Mike Kaganski 2022-10-20 04:44:02 UTC

See also: bug 139185 (and its See Also list) about language guessing problems of libexttextcat; see bug 139185 comment 4.

Comment 9 Buovjaga 2023-04-12 07:38:36 UTC

*** Bug 154495 has been marked as a duplicate of this bug. ***

Comment 10 Noel Grandin 2023-04-12 07:50:46 UTC

Accessing the current keyboard layout appears to be distro-specific, so we'd need to (a) get the current distro and then (b) implement a bunch of hacks to get the current keyboard

FWIW localectl seems to be the most widely available command to extract this info.

Comment 11 Mike Kaganski 2023-04-12 07:53:28 UTC

Basically, this is a dupe of bug 108151, which also has some discussion about API availability, and an implementation of this for Qt5 from Jan-Marek.

Comment 12 Buovjaga 2023-04-12 08:25:10 UTC


*** This bug has been marked as a duplicate of bug 108151 ***

Comment 13 Eyal Rozenberg 2023-04-12 08:31:57 UTC

(In reply to Noel Grandin from comment #10)
> Accessing the current keyboard layout appears to be distro-specific

Isn't there some X-related mechanism/protocol for doing this?

> FWIW localectl seems to be the most widely available command to extract this
> info.

That's a systemd-based abomination, you definitely don't want LibreOffice to depend on systemd.

Comment 14 Buovjaga 2023-04-12 08:44:00 UTC

(In reply to Eyal Rozenberg from comment #13)
> (In reply to Noel Grandin from comment #10)
> > Accessing the current keyboard layout appears to be distro-specific
> 
> Isn't there some X-related mechanism/protocol for doing this?

Looks like it is also used on Wayland:
https://www.bestov.io/blog/all-you-need-to-know-about-kbd-keyboard-files

"while Wayland doesn't have any official way to handle keyboards and keymaps, XKB is what they suggest to use, and in particular, all of the Wayland implementations I've seen tend to use xkbcommon, a quite modern implementation which is reasonably compatible: it uses the same keyboard data distribution that comes with X11."

A verbose layout query would be:

setxkbmap -query -verbose 10

Comment 15 Mike Kaganski 2023-04-12 09:00:45 UTC

Please notice that if you agree that this is a dupe, discussions that provide valuable info should be held in the main bug, to keep all the relevant bits together :)