163652 – Properties of the com.sun.star.util.SearchDescriptor do not cover Matchdiacritics

Bug 163652 - Properties of the com.sun.star.util.SearchDescriptor do not cover Matchdiacritics

Summary: Properties of the com.sun.star.util.SearchDescriptor do not cover Matchdiacri...

Status:	UNCONFIRMED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	sdk (show other bugs)
Version: (earliest affected)	24.8.2.1 release
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:	QA:needsComment
Keywords:

Depends on:
Blocks:

Reported:	2024-10-28 06:10 UTC by madhavkiran.sodum
Modified:	2024-11-12 03:13 UTC (History)
CC List:	0 users

See Also:
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description madhavkiran.sodum 2024-10-28 06:10:55 UTC

In the predefined properties for com.sun.star.util.SearchDescriptor, there is no Property labelled as "SearchDiacriticSensitive" to search with sensitivity for diacritics as there is "SearchCaseSensitive" to search with sensitivity for case.

The reason it should be there is:
1. Match Case and Match Diacritics are very similar in implementation. with SearchCaseSensitive off: "s" matches "s", "S" and "ß" (sharp S used in the German language). Where as with SearcDiacriticSensitive off: "s" would match "s", "ś", "ṣ", etc.

2. Right now Lo's implementation doesn't follow the collation strength rules. We can search while ignoring case and accents but not just case (since accent is ignored by default). Ideally IMHO we should have simple option to toggle between the first three collation strengths:

[Quote]
The Strength attribute determines whether accent or case is taken into account when collating or comparing text strings. In writing systems without case or accent, the Strength attribute controls similarly important features.
The possible values are: primary (1), secondary (2), tertiary (3), quaternary (4), and identity (I).

To ignore:

—accent and case, use the primary strength level
—case only, use the secondary strength level
—neither accent nor case, use the tertiary strength level

Almost all characters can be distinguished by the first three strength levels, therefore in most locales the default Strength attribute is set at the tertiary level. However if the Alternate attribute (described in a following row) is set to shifted, then the quaternary strength level can be used to break ties among white space characters, punctuation marks, and symbols that would otherwise be ignored.
[End of Quote]
https://www.ibm.com/docs/en/db2/11.5?topic=collation-unicode-algorithm-based-collations
https://www.php.net/manual/en/collator.setstrength.php

with the SearchCaseSensitive property we can switch between strength level 2 & 3. And by implementing SearchDiacriticSensitive property we could switch between strength level 1 & 2.

3. Microsoft Office API, Apple and Opensearch provide feature for diacritic-sensitivity or ASCIIfolding:
https://learn.microsoft.com/en-us/office/vba/api/word.find.matchdiacritics
https://developer.apple.com/documentation/foundation/nsstring/compareoptions/1412313-diacriticinsensitive
https://opensearch.org/docs/latest/analyzers/token-filters/asciifolding/

4. The idea should be that if we can type it, then we should be able to find it.
With today's keyboard layouts (whether Android or Apple phone or on Computers), it is very easy to type diacritics and accents. So we should have a property to find them as well.

It will be great if LO also has an equivalent so that it will help with macros and other search and replace features.

Comment 1 madhavkiran.sodum 2024-10-28 06:15:30 UTC

https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html