Bug 40292 - Tamil vowelless consonants should be treated as independent grapheme clusters
Summary: Tamil vowelless consonants should be treated as independent grapheme clusters
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
3.4.2 release
Hardware: All All
: medium normal
Assignee: Caolán McNamara
URL:
Whiteboard: target:3.6.0
Keywords: regression
Depends on:
Blocks:
 
Reported: 2011-08-22 10:15 UTC by Shriramana Sharma
Modified: 2012-06-18 07:20 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Shriramana Sharma 2011-08-22 10:15:11 UTC
I am using LibreOffice 3.3.3 on Kubuntu Natty and 3.4.2 on Windows XP. On both installations I observe the following bug:

What I did:

I typed the Tamil word சித்திரை (name of a month) into an LO application. I used the left- and right-cursor keys to navigate the word.

Expected behaviour:

The "correct" native-user-perceived grapheme cluster split-out is: சி|த்|தி|ரை (ci|t|ti|rai) as in Tamil, vowelless consonants are considered independent grapheme clusters on their own. So it should be possible to place the cursor at any of the above positions indicated by the |, especially between த் and தி.

Actual behaviour:

LO analyses the word as சி|த்தி|ரை (ci|t.ti|rai) and does not allow cursor placement between the த் and தி. When my cursor is to the left of த் and I press the right-cursor key, the cursor moves to the right of தி, and vice versa with left-cursor key.

Background:

Generally in Indic scripts like Devanagari, a vowelless consonant would be taken into the same grapheme cluster as a following consonant, as mostly ligatures or conjoining forms between the consonants will occur in such cases.

For example, the same word presented in Devanagari would be analysed as चि|त्ति|रै (ci|t.ti|rai) since the vowelless "t" ligates (or in the absence of ligature in a font, takes a conjoining form) with the following consonant. 

However in Tamil, vowelless consonants never ligate or form conjoining forms with following consonants. (The only exception is க் ligating with ஷ to form க்ஷ.) 

Therefore vowelless consonants in Tamil are always perceived by native users as grapheme clusters on their own. Therefore a native user expects to be able to place a cursor immediately before or after a vowelless consonant, so: சி|த்|தி|ரை 
 
Other applications (like for example Firefox which I am using now to report this bug) correctly treat த் and தி as separate grapheme clusters in the word சித்திரை.

Note:

This faulty behaviour is seen for all CONSONANT + VIRAMA + CONSONANT sequences in Tamil in LibreOffice. The one obvious exception is க்ஷ i.e. KA + VIRAMA + SSA where the ligature is formed and so the native user does *not* expect to place the cursor in the middle of it.
Comment 1 sasha.libreoffice 2012-04-11 06:02:02 UTC
Thanks for bugreport
Reproduced with copy-pasted text from here in 3.5.2 on Fedora 64 bit 
but not reproducible in 3.3.4 , so regression
Comment 2 sasha.libreoffice 2012-04-11 06:04:46 UTC
@ Caolan
What do You think about this bug?
Comment 3 Caolán McNamara 2012-04-11 06:35:52 UTC
i18npool/source/breakiterator/data/char_in.txt might need some sort of adjustment (or its just wrong) and i18npool/qa/cppunit/test_breakiterator.cxx updated with these examples

The rule we're apparently following is...

$TamilLetter = [\u0B85-\u0BB9];
$TamilSignVirama = \u0BCD;
$TamilLetter ($TamilSignVirama $TamilLetter?)+;

probably needs a bit more head-scratching to get that rule right
Comment 4 Not Assigned 2012-04-12 01:49:49 UTC
Caolan McNamara committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=16cd97480d0681d37f86e89366e1f9964ec16ef8

Resolves: fdo#40292 Tamil grapheme cluster rules
Comment 5 Caolán McNamara 2012-04-12 01:54:57 UTC
I'll freely admit I'm no expert here, but the provided examples now apparently do the expected thing. 

You will be able to find daily builds at e.g. http://dev-builds.libreoffice.org/daily/Win-x86@6-fast/master/current/ (for windows) and http://dev-builds.libreoffice.org/daily/Linux-x86_10-Release_Configuration/master/current/ (for linux) tomorrow (or the day after) which should include this fix for testing
Comment 6 sasha.libreoffice 2012-04-12 03:22:19 UTC
Thanks!
Comment 7 Shriramana Sharma 2012-06-17 20:19:00 UTC
Now confirmed fixed on latest trunk viz http://dev-builds.libreoffice.org/daily/Win-x86@6-fast/master/current/master~2012-06-14_22.09.53_LibO-Dev_3.7.0alpha0_Win_x86_install_en-US.msi

BTW it doesn't seem to be fixed in the LO 3.5 series. I'm using LO 3.5.3 on Kubuntu Precise where this problem still exists. 

BTW don't you think that the rule in your commit http://cgit.freedesktop.org/libreoffice/core/commit/?id=16cd97480d0681d37f86e89366e1f9964ec16ef8:

+$TamilSsa $TamilSignVirama $TamilKa;

should be:

+$TamilKa $TamilSignVirama $TamilSsa;

See the other rules in your commit.
Comment 8 Caolán McNamara 2012-06-18 07:20:24 UTC
re versions its in. See the "whiteboard" section above. That's supposed to keep track of what versions a fix was committed to. So I only committed to "master" and didn't backport to 3.5.X.

re the rules:
+$TamilKa $TamilSignVirama $TamilSsa; <-this one is for going forwards
... 
+$TamilSsa $TamilSignVirama $TamilKa; <-this one is for going backwards
so I think this is what we want here