104195 – Hunspell can't handle specific character in Guarani 'g̃'

Bug 104195 - Hunspell can't handle specific character in Guarani 'g̃'

Summary: Hunspell can't handle specific character in Guarani 'g̃'

Status:	RESOLVED NOTOURBUG

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Linguistic (show other bugs)
Version: (earliest affected)	5.3.0.0.beta1
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-11-27 08:57 UTC by Olivier Hallot
Modified:	2016-11-28 08:46 UTC (History)
CC List:	3 users (show)

See Also:	39275
Crash report or crash signature:

Attachments
Linux Libertine G has got better combined diacritics support, than Times New Roman (31.32 KB, image/png) 2016-11-28 08:44 UTC, László Németh	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Olivier Hallot 2016-11-27 08:57:51 UTC

Hunspell can't locate words with letter g̃

hunspell -d gug
Hunspell 1.3.3
aga
& aga 1 0: anga

Should find ág̃a

We always had issues with g̃. Even when we created the Guarani keyboard.

Seems similar to bug#39275

Comment 1 Giovanni Caligaris 2016-11-27 12:14:54 UTC

LibreOffice Writer/Calc/Impress have the same issue. g̃ shows up as g~

Comment 2 Adolfo Jayme Barrientos 2016-11-27 17:35:28 UTC

Note that Guarani’s nasal g is not a Unicode precomposed character — it’s a combination of a “g” plus U+0303 (combining tilde). Could that be the problem here?

BTW, Hunspell’s bug tracker is https://github.com/hunspell/hunspell/issues

Comment 3 László Németh 2016-11-28 08:40:15 UTC

Command line Hunspell word tokenization differs from the LibreOffice break iterator. Hunspell in LibreOffice can handle such combined Unicode characters well, you only need to use UTF-8 encoded aff and dic files:

------ gug.aff ------
SET UTF-8 
.....

# for suggestions with correct combined diacritics:

MAP 2
MAP aá
MAP g(g̃)


-------  gug.dic -----
100000
ág̃a

(If both precomposed and combined diacritics are common for the given language, you need the canonical form 


See also Hunspell 4 manual, for example:

       Use parenthesized groups for character sequences (eg. for composed Uni‐
       code characters):

              MAP 3
              MAP ß(ss)  (character sequence)
              MAP ﬁ(fi)  ("fi" compatibility characters for Unicode fi ligature)
              MAP (ọ́)o   (composed Unicode character: ó with bottom dot)

Comment 4 László Németh 2016-11-28 08:44:10 UTC

Created attachment 129061 [details]
Linux Libertine G has got better combined diacritics support, than Times New Roman

Comment 5 László Németh 2016-11-28 08:46:05 UTC

(Sorry, the end of the previous sentence:)

If both precomposed and combined diacritics are common for the given language, you can use the precomposed (canonical?) form in the dictionary and use the ICONV command to convert the combined input to the precomposed form, and if you need, the OCONV command to convert the suggestions to combined characters.


Note: LibreOffice layout has got good combining diactritics support with a few fonts, for example, Linux Libertine G, see the attached screenshot.