Bug 106306 - RTL: Wrong text language detection for punctuation at the beginning of sentence
Summary: RTL: Wrong text language detection for punctuation at the beginning of sentence
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
5.3.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: RTL-CTL Language-Detection
  Show dependency treegraph
 
Reported: 2017-03-03 20:43 UTC by Hossein
Modified: 2017-11-14 11:03 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
An examle of wrong output with double and single quotation mark. (22.84 KB, image/png)
2017-03-03 20:45 UTC, Hossein
Details
sample (7.99 KB, application/vnd.oasis.opendocument.text)
2017-10-30 19:19 UTC, Yousuf Philips (jay) (retired)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hossein 2017-03-03 20:43:35 UTC
Description:
I am a Persian/Farsi user, and I usually work with Persian/Farsi language documents. Because of this, I use "Persian" locale in Libreoffice.
Now that I want to create an English language document, I change the keyboard into English, and change the paragraph to left-to-right, and start typing. When I closely look at the status bar, I see Persian in the language section of the status bar, and I should type at least one character to see Englih language there. This may not seem to create problems at first, but actually it does. When I want to write quotation mark, it uses the Persian/Farsi quotation mark, and not the correct English one.

Steps to Reproduce:
1. Set locale to Persian in Options > Language Settings > Languages > Locale setting
2. Change the paragraph to LTR
3. Start typing something with single or double quotation mark like: "Test" or 'Test'

Actual Results:  
You will see that the first quotation mark is shown as « which is wrong.

Expected Results:
You should see the correct English quotation marks, double quotation mark " or single quotation mark '.


Reproducible: Always

User Profile Reset: No

Additional Info:
This seems to be created in LibreOffice 5.3, which fixed a lot of  text rendering issues. The locale prolbems are not confined to this. Translating numerals into appropriate shapes according to the context is also wrong in LibreOffice 5.3. If you set locale to Persian, you will see all the numerals are Hindi in a completely English and LTR Impress document.


User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0
Comment 1 Hossein 2017-03-03 20:45:05 UTC
Created attachment 131615 [details]
An examle of wrong output with double and single quotation mark.
Comment 2 m.a.riosv 2017-03-04 01:36:32 UTC
And selecting 'English' as font language for character?, or with double-click on the status bar language to select English?
Comment 3 QA Administrators 2017-09-29 08:57:42 UTC Comment hidden (obsolete)
Comment 4 Xisco Faulí 2017-10-30 10:51:21 UTC Comment hidden (obsolete)
Comment 5 Yousuf Philips (jay) (retired) 2017-10-30 19:19:09 UTC
Can repo it with arabic locale.

It treats the first single or double quotes as if it in the ctl language and the second quote after typing an english word as latin language.

Version: 6.0.0.0.alpha1+
Build ID: 43d6b11a5c1dda0cc2c1e06c768eece25051a56c
CPU threads: 2; OS: Linux 4.4; UI render: default; VCL: gtk2; 
Locale: ar-AE (en_US.UTF-8); Calc: group
Comment 6 Yousuf Philips (jay) (retired) 2017-10-30 19:19:48 UTC
Created attachment 137380 [details]
sample
Comment 7 V Stuart Foote 2017-10-30 20:00:39 UTC
Isn't this a Unicode implementation issue?

Don't these transitions between language scripts depend on our ICU library handling? But they still need additional boundary logic--otherwise as here where Unicode usage is not defined to a script, i.e. punctuation, symbols, numbers we get this type of issue at script transition(s).

Is there a better way to detect/toggle word boundaries?

=-ref-=
[1] http://unicode.org/reports/tr29/#Word_Boundaries
Comment 8 Hiunn-hué 2017-10-31 10:05:33 UTC Comment hidden (no-value)
Comment 9 Khaled Hosny 2017-10-31 14:02:53 UTC
(In reply to V Stuart Foote from comment #7)
> Isn't this a Unicode implementation issue?

AFAIK, no. The itemization of text into Western/CTL/Asian (or only three categories) is done by Writer and/or other LibreOffice internal code.

My guess is  that is is just using the default languages for common characters and then it does not look back when it sees the first script-specific character.
Comment 10 Omer Zak 2017-11-14 11:03:27 UTC
Still happens in:

Version: 6.0.0.0.alpha1+
Build ID: 9050854c35c389466923f0224a36572d36cd471a
CPU threads: 8; OS: Linux 4.9; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.utf8); Calc: group

OS: Debian 64bit Stretch (Debian 9.2, with some backported packages)