116835 – Find & Replace: default value of "Diacritic-sensitive" (on/off) should be locale-specific

Bug 116835 - Find & Replace: default value of "Diacritic-sensitive" (on/off) should be locale-specific

Summary: Find & Replace: default value of "Diacritic-sensitive" (on/off) should be loc...

Status:	RESOLVED INVALID

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	LibreOffice (show other bugs)
Version: (earliest affected)	Inherited From OOo
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	Find&Replace-Dialog
	Show dependency tree / graph

Reported:	2018-04-05 20:59 UTC by Mihkel Tõnnov
Modified:	2018-04-11 22:46 UTC (History)
CC List:	5 users (show)

See Also:	111846 115829
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mihkel Tõnnov 2018-04-05 20:59:34 UTC

This is a spin-off from bug 111846.

Eike Rathke wrote at bug 111846 comment #27:
> ... also the default presets were wrongly chosen, clearly one normally does
> not want to ignore diacritics.

Khaled Hosny wrote at bug 111846 comment #28:
> Not in Arabic or in languages where diacritics are not parts of the letters
> (in Arabic خالد and خَالِدْ are the same word). It is just like
> case-insensitive search being the default.

As there are quite big differences in the expected behaviour of the letters with diacritics/accents/dots/etc. also among languages that are written in Latin script, could we have a per-language/locale default settings for the "Diacritic-sensitive" search option?

For languages like English, French, German, etc., the "Diacritic-sensitive" checkbox should be off by default, since ä, ö, ü, é/ê/è/ë etc. are considered variations of the "base" letter in these languages (so a/ä, o/ö, etc. are also collated together in dictionaries).

For languages like Estonian, Finnish, Icelandic, Swedish, Latvian, Hungarian, Polish, etc., "Diacritic-sensitive" should be on by default, as there the (native) letters with "diacritics" (ä, å, á, ā, etc.) are considered separate letters in their own right, and therefore they shouldn't be ignored/merged with "base" letter during searching, at least not by default. Some accented letters that occur in loanwords or foreign names might be considered variations of the base letter also in these languages - but that's no reason to default to disabling diacritic-sensitivity. The default setting should reflect the most common usage.

For languages like Lithuanian, it's probably better to also have "Diacritic-sensitive" on by default: there, e.g. ą is considered independent letter, while ã/à are considered variants of a (used mainly in dictionaries to indicate stress/length).

Comment 1 V Stuart Foote 2018-04-06 04:24:35 UTC

No, that is not a correct understanding of the function of "Diacritic-sensitive" or "Kashida-sensitive" transliteration for search within LibreOffice.

There is a lot of overhead involved with transliteration(s) to ignore diacritic/kashida and other CJK Unicode glyph conversions -- frankly any of them kill search performance compared to a "sensitive" mode, i.e. with no transliteration of text strings (see bug 115829 dup to bug 116242).

Setting default, and impact on other search/replace function, is going to be more involved than just a question of locale and script in use--and per user configuration of Tools -> Options -> Language support, and other settings on the Find & Replace dialog.

No doubt it could be done by locale and script, but not clear there is a need as now corrected defaults are reasonable.

Comment 2 Mihkel Tõnnov 2018-04-06 07:23:50 UTC

(In reply to V Stuart Foote from comment #1)
> No, that is not a correct understanding of the function of
> "Diacritic-sensitive" ...

How is it not? That's precisely how search works at the moment. If it's not meant to, then there's a bug in the implementation.

[ ] Diacritic-sensitive
Searching for "lääs" matches also "laas" and vice versa (but it shouldn't in Estonian),
"rad" matches also "råd" and vice versa (but it shouldn't in Swedish),
etc.

[x] Diacritic-sensitive
"lääs" only matches "lääs" and "laas" only matches "laas" (as expected in Estonian),
"rad" only matches "rad" and "råd" only matches "råd" (as expected in Swedish),
etc.

Kashida-sensitivity is something else entirely - in usage, if not in implementation -, so there's no need to include that in discussion within this here enhancement request.

> No doubt it could be done by locale and script, but not clear there is a
> need as now corrected defaults are reasonable.

So is the default now "Diacritic-sensitive" = on?

Comment 3 Eike Rathke 2018-04-06 10:00:24 UTC

The default is now Diacritic-sensitive, but has an effect only in new installations as the checkbox value is remembered in the user configuration, which in existing installations was remembered from the hidden Diacritic-ignore status.

Fwiw, contrary to what was said in comment 0, in German I do *not* expect diacritics to be ignored. Umlauts are distinct letters, not some "variation of a base" letter. Searching for Bar should not find Bär.
I could bet also in French accented characters are not to be mangled.

Trying to couple the Diacritic-sensitive default to locale/language IMHO is doomed to fail. There is no default locale with text, text either has a specific locale attribute or is set to None. Different portions of text can have different locales assigned. Often enough text is attributed with the user's current default locale but is written in another language and the attribute never changed. A coupled default would lead to bad user experience.

Rather tackle the reason why Diacritic-sensitive was introduced first-hand: for Arabic. So have a second Arabic-diacritic-sensitive checkbox and if (default) unchecked ignore diacritics only in text portions written in Arabic script.

Comment 4 Mihkel Tõnnov 2018-04-06 18:31:08 UTC

(In reply to Eike Rathke from comment #3)
> The default is now Diacritic-sensitive, but has an effect only in new
> installations as the checkbox value is remembered in the user configuration,
> which in existing installations was remembered from the hidden
> Diacritic-ignore status.

OK, I can live with that :)

> Fwiw, contrary to what was said in comment 0, in German I do *not* expect
> diacritics to be ignored. Umlauts are distinct letters, not some "variation
> of a base" letter. Searching for Bar should not find Bär.
> I could bet also in French accented characters are not to be mangled.

Oh. Well, I might have read too much into the old default (diacritic-sensitive = off) and my admittedly fading knowledge of German. But if also German (and French) should default to diacritic-sensitive search, then all the better.

> Trying to couple the Diacritic-sensitive default to locale/language IMHO is
> doomed to fail. There is no default locale with text, text either has a
> specific locale attribute or is set to None. Different portions of text can
> have different locales assigned. Often enough text is attributed with the
> user's current default locale but is written in another language and the
> attribute never changed. A coupled default would lead to bad user experience.

Agreed that it would be tricky. But if we default to diacritic-sensitive search now (also for English with its varying spellings à la naive/naïve etc.), then the issue is pretty much resolved in my eyes.

> Rather tackle the reason why Diacritic-sensitive was introduced first-hand:
> for Arabic. So have a second Arabic-diacritic-sensitive checkbox and if
> (default) unchecked ignore diacritics only in text portions written in
> Arabic script.

I agree. Presumably a new enhancement request would be needed for that?

Comment 5 Heiko Tietze 2018-04-06 20:27:14 UTC

The idea of changing the behavior was IIRC that users get a feedback when searching for BAR and find BÄR too, considering also that a slightly longer result list is better than not knowing why something has not been found.

We do have a very complex F&R dialog and we should be careful with adding more options, at least to the initial, basic search.