107769 – Unicode NFKC: spell checking should normalize data first

Bug 107769 - Unicode NFKC: spell checking should normalize data first

Summary: Unicode NFKC: spell checking should normalize data first

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Linguistic (show other bugs)
Version: (earliest affected)	5.4.0.0.alpha1+
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	Spell-Checking
	Show dependency tree / graph

Reported:	2017-05-11 11:05 UTC by martin_hosken
Modified:	2025-09-27 18:27 UTC (History)
CC List:	2 users (show)

See Also:	101962
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description martin_hosken 2017-05-11 11:05:45 UTC

Words to be spell checked should be converted to NFKC first so that spell checking dictionaries don't need to hold all forms (NFD, NFC, mixed) of a word.

I'm going to sketch my thoughts on how to do it here in case I can't get back to the bug for a while. Anyone want to take it further?

In SpellChecker::GetSpellFailure in lingucomponent/source/spell/sspellimpl.cxx, rather than doing a poor man's hand created NFK into nWord, start with an nWord created something like:

icu::UnicodeString rIn(reinterpret_case<const UChar *>(rWord.getStr()), rWord.getLength());
icu::UnicodeString normal;
UErrorCode rCode;
icu::Normalizer(rIn, UNORM_NFKC, normal, rCode);
OUString nWord(U_SUCCESS(rCode) ? OUString(reinterpret_case<Sal_Unicode *>(normal.getBuffer()), normal.length()) : OUString());

then use nWord instead of rWord for the rest of the function.

Need to find a test for this.

Comment 1 Buovjaga 2017-05-12 17:49:58 UTC

Ok -> NEW