Bug Hunting Session
Bug 104195 - Hunspell can't handle specific character in Guarani 'g̃'
Summary: Hunspell can't handle specific character in Guarani 'g̃'
Status: RESOLVED NOTOURBUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
5.3.0.0.beta1
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-27 08:57 UTC by Olivier Hallot
Modified: 2016-11-28 08:46 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Linux Libertine G has got better combined diacritics support, than Times New Roman (31.32 KB, image/png)
2016-11-28 08:44 UTC, László Németh
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Olivier Hallot 2016-11-27 08:57:51 UTC
Hunspell can't locate words with letter g̃

hunspell -d gug
Hunspell 1.3.3
aga
& aga 1 0: anga

Should find ág̃a

We always had issues with g̃. Even when we created the Guarani keyboard.

Seems similar to bug#39275
Comment 1 Giovanni Caligaris 2016-11-27 12:14:54 UTC
LibreOffice Writer/Calc/Impress have the same issue. g̃ shows up as g~
Comment 2 Adolfo Jayme 2016-11-27 17:35:28 UTC
Note that Guarani’s nasal g is not a Unicode precomposed character — it’s a combination of a “g” plus U+0303 (combining tilde). Could that be the problem here?

BTW, Hunspell’s bug tracker is https://github.com/hunspell/hunspell/issues
Comment 3 László Németh 2016-11-28 08:40:15 UTC
Command line Hunspell word tokenization differs from the LibreOffice break iterator. Hunspell in LibreOffice can handle such combined Unicode characters well, you only need to use UTF-8 encoded aff and dic files:

------ gug.aff ------
SET UTF-8 
.....

# for suggestions with correct combined diacritics:

MAP 2
MAP aá
MAP g(g̃)


-------  gug.dic -----
100000
ág̃a

(If both precomposed and combined diacritics are common for the given language, you need the canonical form 


See also Hunspell 4 manual, for example:

       Use parenthesized groups for character sequences (eg. for composed Uni‐
       code characters):

              MAP 3
              MAP ß(ss)  (character sequence)
              MAP fi(fi)  ("fi" compatibility characters for Unicode fi ligature)
              MAP (ọ́)o   (composed Unicode character: ó with bottom dot)
Comment 4 László Németh 2016-11-28 08:44:10 UTC
Created attachment 129061 [details]
Linux Libertine G has got better combined diacritics support, than Times New Roman
Comment 5 László Németh 2016-11-28 08:46:05 UTC
(Sorry, the end of the previous sentence:)

If both precomposed and combined diacritics are common for the given language, you can use the precomposed (canonical?) form in the dictionary and use the ICONV command to convert the combined input to the precomposed form, and if you need, the OCONV command to convert the suggestions to combined characters.


Note: LibreOffice layout has got good combining diactritics support with a few fonts, for example, Linux Libertine G, see the attached screenshot.