169103 – Add option to treat similar characters as equivalent (spaces, hyphens, quotes) in Find & Replace across all applications

Bug 169103 - Add option to treat similar characters as equivalent (spaces, hyphens, quotes) in Find & Replace across all applications

Summary: Add option to treat similar characters as equivalent (spaces, hyphens, quotes...

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	framework (show other bugs)
Version: (earliest affected)	Inherited From OOo
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	Find-Search Authors
	Show dependency tree / graph

Reported:	2025-10-27 17:44 UTC by Piotr Osada
Modified:	2025-10-30 20:03 UTC (History)
CC List:	1 user (show)

See Also:	126294 38261 https://github.com/gitxpy/libreoffice-alt-search/issues/128
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Piotr Osada 2025-10-27 17:44:16 UTC

Currently, LibreOffice cannot find non-breaking hyphens '‑' (U+2011) (Ctrl+Shift+-) when searching with a regular hyphen '-' (U+002d).

USE CASE: In scientific documents, I replaced all hyphens in a-C abbreviations with non-breaking hyphens (U+2011) to prevent line breaks. However, this makes searching difficult — typing a regular hyphen in the Find dialog does not match non-breaking hyphens.

*a-C (amorphous carbon) 
https://en.wikipedia.org/wiki/Amorphous_carbon
https://www.sciencedirect.com/topics/materials-science/amorphous-carbon

COMPARISON WITH MS WORD:
Microsoft Word treats these characters as equivalent in Find operations:
U+002D  -       Hyphen-Minus (regular)
U+2011  ‑       Non-Breaking Hyphen

When searching for '-', Word automatically finds both variants.


The same problem exists with space variants and quotation mark variants.


ENHANCEMENT:
Add a character normalization option to Find & Replace dialogs.

Proposed checkbox label (choose one):
☑ Treat similar characters as equivalent
☑ Match character variants
☑ Ignore punctuation differences

Note: The label should clearly indicate this is separate from diacritic/accent handling.


BEHAVIOR WHEN ENABLED:
- All space variants → regular space (U+0020)
- All hyphen/dash variants → hyphen-minus (U+002D)
- All quotation mark variants → standard quotes (U+0022, U+0027)

EXAMPLES:
- Searching for "a-C" matches: a-C, a‑C, a–C, a—C
- Searching for "hello world" matches text with any space variant
- Searching for "test" matches: "test", "test", «test», 『test』

BEHAVIOR WHEN DISABLED (default):
- Only exact character matches (current behavior)



OPEN QUESTION: How should primes and accents be handled?

Primes (′, ″, ‴):
- Used in mathematical/scientific notation: f′(x), 5′ 10″
- Currently often confused with quotation marks
- Recommendation: Keep primes separate to preserve mathematical meaning

Accents (`, ´, ^):
- Typically handled by Unicode NFKC normalization
- Can be standalone (spacing) or combining characters
- Should these be included in character variant matching?

See: [3] https://en.wikipedia.org/wiki/Prime_(symbol)



For the discussion, I suggest this schema for illustrating the range of the topic:

Text Canonicalization
 └─ Character Folding
     ├─ Unicode Normalization (NFKC, NFKD)
     ├─ Case Folding
     ├─ Accent/Diacritic Folding
     └─ Character Class Normalization
         ├─ Whitespace normalization
         ├─ Hyphen/dash normalization
         └─ Quote normalization

┌──────┬─────────────┬───────────────┬──────────────────────────┐
│ Form │ Decomposed? │ Compatibility?│         Use case         │
├──────┼─────────────┼───────────────┼──────────────────────────┤
│ NFC  │      no     │       no      │ Storage, display         │
│ NFD  │     yes     │       no      │ Accent removal, analysis │
│ NFKC │      no     │      yes      │ Search                   │
│ NFKD │     yes     │      yes      │ Text simplification      │
└──────┴─────────────┴───────────────┴──────────────────────────┘
NFC   = Normalization Form Canonical Composition
NFD   = Normalization Form Canonical Decomposition
NFKC  = Normalization Form Compatibility Composition
NFKD  = Normalization Form Compatibility Decomposition
        [1] https://en.wikipedia.org/wiki/Unicode_equivalence



SPACES (U+0020  )
Code	Char	Name
U+00A0		No-Break Space (NBSP)
U+1680		Ogham Space Mark
U+2000		En Quad
U+2001		Em Quad
U+2002		En Space
U+2003		Em Space
U+2004		Three-Per-Em Space
U+2005		Four-Per-Em Space
U+2006		Six-Per-Em Space
U+2007		Figure Space
U+2008		Punctuation Space
U+2009		Thin Space
U+200A		Hair Space
U+202F		Narrow No-Break Space
U+205F		Medium Mathematical Space
U+3000	　	Ideographic Space (CJK full-width space)
U+200B		Zero-Width Space (invisible)
U+200C	‌	Zero-Width Non-Joiner (invisible)
U+200D	‍	Zero-Width Joiner (invisible)
U+2060	⁠	Word Joiner (invisible)
U+FEFF		Zero-Width No-Break Space (BOM)



HYPHEN/DASH VARIANTS (U+002D -)
Code	Char	Name
U+002D	-	Hyphen-Minus (ASCII standard)
U+1806	᠆	Mongolian Todo Soft Hyphen
U+2010	‐	Hyphen
U+2011	‑	Non-Breaking Hyphen
U+2012	‒	Figure Dash
U+2013	–	En Dash
U+2014	—	Em Dash
U+FE58	﹘	Small Em Dash
U+FE63	﹣	Small Hyphen-Minus
U+FF0D	−	Fullwidth Hyphen-Minus



SINGLE QUOTES (U+0027 ')
Code	Char	Name
U+2018	'	Left Single Quotation Mark
U+2019	'	Right Single Quotation Mark
U+201A	‚	Single Low-9 Quotation Mark
U+201B	‛	Single High-Reversed-9 Quotation Mark
U+2039	‹	Single Left-Pointing Angle Quotation Mark
U+203A	›	Single Right-Pointing Angle Quotation Mark
U+275B	❛	Heavy Single Turned Comma Quotation Mark Ornament
U+275C	❜	Heavy Single Comma Quotation Mark Ornament
U+276E	❮	Heavy Left-Pointing Angle Quotation Mark Ornament
U+276F	❯	Heavy Right-Pointing Angle Quotation Mark Ornament
U+FF07	＇	Fullwidth Apostrophe
U+300C	「	Left Corner Bracket (Chinese, Japanese, Korean)
U+300D	」	Right Corner Bracket



DOUBLE QUOTES (U+0022 ")
Code	Char	Name
U+00AB	«	Left-Pointing Double Angle Quotation Mark
U+00BB	»	Right-Pointing Double Angle Quotation Mark
U+201C	"	Left Double Quotation Mark
U+201D	"	Right Double Quotation Mark
U+201E	„	Double Low-9 Quotation Mark
U+201F	‟	Double High-Reversed-9 Quotation Mark
U+275D	❝	Heavy Double Turned Comma Quotation Mark Ornament
U+275E	❞	Heavy Double Comma Quotation Mark Ornament
U+2E42	⹂	Double Low-Reversed-9 Quotation Mark
U+301D	〝	Reversed Double Prime Quotation Mark
U+301E	〞	Double Prime Quotation Mark
U+301F	〟	Low Double Prime Quotation Mark
U+FF02	＂	Fullwidth Quotation Mark
U+300E	『	Left White Corner Bracket
U+300F	』	Right White Corner Bracket



APOSTROPHES (U+0027 ')
Code	Char	Name
U+0027	'	Apostrophe (standard)
U+02BC	ʼ	Modifier Letter Apostrophe
U+02BB	ʻ	Modifier Letter Turned Comma
U+02BD	ʽ	Modifier Letter Reversed Comma
U+02C8	ˈ	Modifier Letter Vertical Line (stress mark)
U+055A	՚	Armenian Apostrophe
U+2032	′	Prime (sometimes misused as apostrophe)



PRIMES (mathematical/scientific notation)
Code	Char	Name
U+2032	′	Prime (minutes, feet, derivatives)
U+2033	″	Double Prime (seconds, inches)
U+2034	‴	Triple Prime
U+2035	‵	Reversed Prime
U+2036	‶	Reversed Double Prime
U+2037	‷	Reversed Triple Prime
U+2057	⁗	Quadruple Prime
U+02B9	ʹ	Modifier Letter Prime
U+02BA	ʺ	Modifier Letter Double Prime



ACCENTS (grave, acute, circumflex)
Code	Char	Name
U+0060	`	Grave Accent (backtick)
U+00B4	´	Acute Accent (spacing)
U+005E	^	Circumflex Accent (caret)
U+02C6	ˆ	Modifier Letter Circumflex Accent
U+02C7	ˇ	Caron (háček)
U+02D8	˘	Breve
U+02D9	˙	Dot Above
U+02DA	˚	Ring Above
U+02DC	˜	Small Tilde
U+02DD	˝	Double Acute Accent
U+0300	̀	Combining Grave Accent
U+0301	́	Combining Acute Accent
U+0302	̂	Combining Circumflex Accent
U+0303	̃	Combining Tilde
U+0304	̄	Combining Macron




[1] https://en.wikipedia.org/wiki/Unicode_equivalence
[2] https://www.compart.com/en/unicode/category/Pd
[3] https://en.wikipedia.org/wiki/Prime_(symbol)


        CODE IMPLEMENTATIONS
[4] Normalize all UTF quotes in Javascript
https://gist.github.com/thanpolas/244d9a13151caf5a12e42208b6111aa6

[5] tehsis/normalize: Normalize a string with utf-8 characters.
https://github.com/tehsis/normalize


        RELATED LIBRARIES
[6] VitorLuizC/normalize-text: 📝 Provides a simple functions to normalize texts, whitespaces, paragraphs & diacritics.
https://github.com/VitorLuizC/normalize-text

[7] icu/icu4c/source/common/unicode/normalizer2.h at main · unicode-org/icu
https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/normalizer2.h


        GitHub SEARCH: 'NORMALIZATION'
[8] GitbookIO/normall: Normall: normalize filenames, accents etc ... in JS
https://github.com/GitbookIO/normall


        TECHNICAL DOCUMENTATION
[9] normalization · GitHub Topics
https://github.com/topics/normalization?l=python

[10] I18N/CanonicalNormalizationIssues - W3C Wiki
https://www.w3.org/wiki/I18N/CanonicalNormalizationIssues

[11] icu/docs/userguide/transforms/normalization/index.md at main · unicode-org/icu
https://github.com/unicode-org/icu/blob/main/docs/userguide/transforms/normalization/index.md

[12] Normalization | ICU Documentation
https://unicode-org.github.io/icu/userguide/transforms/normalization/

[13] Custom Normalization | ICU Documentation
https://unicode-org.github.io/icu/design/normalization/custom.html

[14] ICU 78.1: common/unicode/unorm.h File Reference
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/unorm_8h.html

Comment 1 Piotr Osada 2025-10-27 18:42:52 UTC

Since AI chatbots often create text with Unicode sub- and superscripts, it would be good to add these symbols to searching character normalization.
As a result, H2O and H₂O could match at once.

SUPERSCRIPT CHARACTERS
Code	Char	Name
U+2070	⁰	Superscript Zero
U+00B9	¹	Superscript One
U+00B2	²	Superscript Two
U+00B3	³	Superscript Three
U+2074	⁴	Superscript Four
U+2075	⁵	Superscript Five
U+2076	⁶	Superscript Six
U+2077	⁷	Superscript Seven
U+2078	⁸	Superscript Eight
U+2079	⁹	Superscript Nine
U+207A	⁺	Superscript Plus Sign
U+207B	⁻	Superscript Minus
U+207C	⁼	Superscript Equals Sign
U+207D	⁽	Superscript Left Parenthesis
U+207E	⁾	Superscript Right Parenthesis

SUBSCRIPT CHARACTERS
Code	Char	Name
U+2080	₀	Subscript Zero
U+2081	₁	Subscript One
U+2082	₂	Subscript Two
U+2083	₃	Subscript Three
U+2084	₄	Subscript Four
U+2085	₅	Subscript Five
U+2086	₆	Subscript Six
U+2087	₇	Subscript Seven
U+2088	₈	Subscript Eight
U+2089	₉	Subscript Nine
U+208A	₊	Subscript Plus Sign
U+208B	₋	Subscript Minus
U+208C	₌	Subscript Equals Sign
U+208D	₍	Subscript Left Parenthesis
U+208E	₎	Subscript Right Parenthesis

[15] https://symbl.cc/en/collections/superscript-and-subscript-numbers/

Comment 2 Piotr Osada 2025-10-27 19:11:32 UTC

Additionally, Unicode includes superscript and subscript letters (beyond just numbers) that are commonly used in scientific notation, phonetics (IPA), and mathematical expressions. These should also be included in character normalization.

MOTIVATION:
Scientific documents often mix formatting styles from different sources (manual typing, AI-generated content, copy-paste from different software), making it difficult to search for the same chemical formula or physical quantity written in different ways.

BASIC EXAMPLES:
- Searching for "xn" should match: xⁿ, xₙ
- Searching for "H2O" should match: H₂O
- Searching for "CO2" should match: CO₂, CO²
- Searching for "Ca2+" should match: Ca²⁺
- Searching for "Cp" should match: Cₚ


USE CASES IN SCIENTIFIC DOCUMENTS:

PHYSICS
Mechanics:
aₜ, aₙ — tangential and normal acceleration
σₓₓ, τₓᵧ — stress tensor components

Thermodynamics:
Cᵥ, Cₚ — heat capacity at constant volume/pressure

CHEMISTRY
Ca²⁺ — calcium cation
[H₃O⁺] — hydronium ion

MATERIAL SCIENCE
Crystal Structure:
t-ZrO₂ — tetragonal zirconia
α-Fe, γ-Fe — iron crystal phases

Material stoichiometry:
ZrO₂₋ₓ — zirconium oxide (non-stoichiometric compound)


SUPERSCRIPT LETTERS (MODIFIERS)
Code	Char	Name
U+1D43	ᵃ	Modifier Letter Small A
U+1D47	ᵇ	Modifier Letter Small B
U+1D9C	ᶜ	Modifier Letter Small C
U+1D48	ᵈ	Modifier Letter Small D
U+1D49	ᵉ	Modifier Letter Small E
U+1DA0	ᶠ	Modifier Letter Small F
U+1D4D	ᵍ	Modifier Letter Small G
U+02B0	ʰ	Modifier Letter Small H
U+2071	ⁱ	Superscript Latin Small Letter I
U+02B2	ʲ	Modifier Letter Small J
U+1D4F	ᵏ	Modifier Letter Small K
U+02E1	ˡ	Modifier Letter Small L
U+1D50	ᵐ	Modifier Letter Small M
U+207F	ⁿ	Superscript Latin Small Letter N
U+1D52	ᵒ	Modifier Letter Small O
U+1D56	ᵖ	Modifier Letter Small P
U+02B3	ʳ	Modifier Letter Small R
U+02E2	ˢ	Modifier Letter Small S
U+1D57	ᵗ	Modifier Letter Small T
U+1D58	ᵘ	Modifier Letter Small U
U+1D5B	ᵛ	Modifier Letter Small V
U+02B7	ʷ	Modifier Letter Small W
U+02E3	ˣ	Modifier Letter Small X
U+02B8	ʸ	Modifier Letter Small Y
U+1DBB	ᶻ	Modifier Letter Small Z

SUPERSCRIPT CAPITALS
Code	Char	Name
U+1D2C	ᴬ	Modifier Letter Capital A
U+1D2E	ᴮ	Modifier Letter Capital B
U+1D30	ᴰ	Modifier Letter Capital D
U+1D31	ᴱ	Modifier Letter Capital E
U+1D33	ᴳ	Modifier Letter Capital G
U+1D34	ᴴ	Modifier Letter Capital H
U+1D35	ᴵ	Modifier Letter Capital I
U+1D36	ᴶ	Modifier Letter Capital J
U+1D37	ᴷ	Modifier Letter Capital K
U+1D38	ᴸ	Modifier Letter Capital L
U+1D39	ᴹ	Modifier Letter Capital M
U+1D3A	ᴺ	Modifier Letter Capital N
U+1D3C	ᴼ	Modifier Letter Capital O
U+1D3E	ᴾ	Modifier Letter Capital P
U+1D3F	ᴿ	Modifier Letter Capital R
U+1D40	ᵀ	Modifier Letter Capital T
U+1D41	ᵁ	Modifier Letter Capital U
U+1D42	ᵂ	Modifier Letter Capital W

SUBSCRIPT LETTERS
Code	Char	Name
U+2090	ₐ	Latin Subscript Small Letter A
U+2091	ₑ	Latin Subscript Small Letter E
U+2095	ₕ	Latin Subscript Small Letter H
U+1D62	ᵢ	Latin Subscript Small Letter I
U+2C7C	ⱼ	Latin Subscript Small Letter J
U+2096	ₖ	Latin Subscript Small Letter K
U+2097	ₗ	Latin Subscript Small Letter L
U+2098	ₘ	Latin Subscript Small Letter M
U+2099	ₙ	Latin Subscript Small Letter N
U+2092	ₒ	Latin Subscript Small Letter O
U+209A	ₚ	Latin Subscript Small Letter P
U+1D63	ᵣ	Latin Subscript Small Letter R
U+209B	ₛ	Latin Subscript Small Letter S
U+209C	ₜ	Latin Subscript Small Letter T
U+1D64	ᵤ	Latin Subscript Small Letter U
U+1D65	ᵥ	Latin Subscript Small Letter V
U+2093	ₓ	Latin Subscript Small Letter X

GREEK SUBSCRIPTS
Code	Char	Name
U+1D66	ᵦ	Greek Subscript Small Letter Beta
U+1D67	ᵧ	Greek Subscript Small Letter Gamma
U+1D68	ᵨ	Greek Subscript Small Letter Rho
U+1D69	ᵩ	Greek Subscript Small Letter Phi
U+1D6A	ᵪ	Greek Subscript Small Letter Chi

[16] https://symbl.cc/en/collections/superscript-and-subscript-letters/
[17] https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts

Comment 3 fpy 2025-10-27 21:01:28 UTC

How about regex already supported ? -> https://help.libreoffice.org/latest/en-US/text/shared/01/02100001.html

you can using existing character properties with \p{} syntax, 
e.g. \p{Alnum}, \p{space}, \p{QMark}, ...
or define your own ranges, e.g [₁-₉₊₋₌₍₎]

to batch all these, the current recommendation is to use altsearch
https://extensions.libreoffice.org/en/extensions/show/70066
which has more velocity than the core F&R.

Comment 4 Piotr Osada 2025-10-30 18:47:05 UTC

Thank you, @fpy, for add-on recommendation.

(In reply to fpy from comment #3)
> e.g. \p{Alnum}, \p{space}, \p{QMark}, ...

[:hyphen:] or \p{hyphen} is good enough for finding at once in my case. 

But is still time-consuming, in comparison with possible checkbox mode, as it is in MS Word: 
"☑ Ignore punctuation characters̲"
and you can type there <space>, <hyphen> "-" or "'"
and you get variants selected.

Typing:
"-" -- is instantaneous.
\p{space} -- requires memorization or help lookup (says 5~30 seconds)

Interoperable search:
<spaces>
<quotes>
<hyphens>
...IMHO because of this use-case it is worth to implement.

Comment 5 Piotr Osada 2025-10-30 19:01:17 UTC

Another variant to:

HYPHEN/DASH VARIANTS (U+002D -)
Code	Char	Name
U+002D	-	Hyphen-Minus (ASCII standard)

[20] https://unicode-explorer.com/c/2212
[21] https://www.compart.com/en/unicode/search?q=minus#characters