I am using LibreOffice 3.3.3 on Kubuntu Natty and 3.4.2 on Windows XP. On both installations I observe the following bug:
What I did:
I typed the Tamil word சித்திரை (name of a month) into an LO application. I used the left- and right-cursor keys to navigate the word.
The "correct" native-user-perceived grapheme cluster split-out is: சி|த்|தி|ரை (ci|t|ti|rai) as in Tamil, vowelless consonants are considered independent grapheme clusters on their own. So it should be possible to place the cursor at any of the above positions indicated by the |, especially between த் and தி.
LO analyses the word as சி|த்தி|ரை (ci|t.ti|rai) and does not allow cursor placement between the த் and தி. When my cursor is to the left of த் and I press the right-cursor key, the cursor moves to the right of தி, and vice versa with left-cursor key.
Generally in Indic scripts like Devanagari, a vowelless consonant would be taken into the same grapheme cluster as a following consonant, as mostly ligatures or conjoining forms between the consonants will occur in such cases.
For example, the same word presented in Devanagari would be analysed as चि|त्ति|रै (ci|t.ti|rai) since the vowelless "t" ligates (or in the absence of ligature in a font, takes a conjoining form) with the following consonant.
However in Tamil, vowelless consonants never ligate or form conjoining forms with following consonants. (The only exception is க் ligating with ஷ to form க்ஷ.)
Therefore vowelless consonants in Tamil are always perceived by native users as grapheme clusters on their own. Therefore a native user expects to be able to place a cursor immediately before or after a vowelless consonant, so: சி|த்|தி|ரை
Other applications (like for example Firefox which I am using now to report this bug) correctly treat த் and தி as separate grapheme clusters in the word சித்திரை.
This faulty behaviour is seen for all CONSONANT + VIRAMA + CONSONANT sequences in Tamil in LibreOffice. The one obvious exception is க்ஷ i.e. KA + VIRAMA + SSA where the ligature is formed and so the native user does *not* expect to place the cursor in the middle of it.
Thanks for bugreport
Reproduced with copy-pasted text from here in 3.5.2 on Fedora 64 bit
but not reproducible in 3.3.4 , so regression
What do You think about this bug?
i18npool/source/breakiterator/data/char_in.txt might need some sort of adjustment (or its just wrong) and i18npool/qa/cppunit/test_breakiterator.cxx updated with these examples
The rule we're apparently following is...
$TamilLetter = [\u0B85-\u0BB9];
$TamilSignVirama = \u0BCD;
$TamilLetter ($TamilSignVirama $TamilLetter?)+;
probably needs a bit more head-scratching to get that rule right
Caolan McNamara committed a patch related to this issue.
It has been pushed to "master":
Resolves: fdo#40292 Tamil grapheme cluster rules
I'll freely admit I'm no expert here, but the provided examples now apparently do the expected thing.
You will be able to find daily builds at e.g. http://dev-builds.libreoffice.org/daily/Win-x86@6-fast/master/current/ (for windows) and http://dev-builds.libreoffice.org/daily/Linux-x86_10-Release_Configuration/master/current/ (for linux) tomorrow (or the day after) which should include this fix for testing
Now confirmed fixed on latest trunk viz http://dev-builds.libreoffice.org/daily/Win-x86@6-fast/master/current/master~2012-06-14_22.09.53_LibO-Dev_3.7.0alpha0_Win_x86_install_en-US.msi
BTW it doesn't seem to be fixed in the LO 3.5 series. I'm using LO 3.5.3 on Kubuntu Precise where this problem still exists.
BTW don't you think that the rule in your commit http://cgit.freedesktop.org/libreoffice/core/commit/?id=16cd97480d0681d37f86e89366e1f9964ec16ef8:
+$TamilSsa $TamilSignVirama $TamilKa;
+$TamilKa $TamilSignVirama $TamilSsa;
See the other rules in your commit.
re versions its in. See the "whiteboard" section above. That's supposed to keep track of what versions a fix was committed to. So I only committed to "master" and didn't backport to 3.5.X.
re the rules:
+$TamilKa $TamilSignVirama $TamilSsa; <-this one is for going forwards
+$TamilSsa $TamilSignVirama $TamilKa; <-this one is for going backwards
so I think this is what we want here