Currently, LibreOffice cannot find non-breaking hyphens '‑' (U+2011) (Ctrl+Shift+-) when searching with a regular hyphen '-' (U+002d). USE CASE: In scientific documents, I replaced all hyphens in a-C abbreviations with non-breaking hyphens (U+2011) to prevent line breaks. However, this makes searching difficult — typing a regular hyphen in the Find dialog does not match non-breaking hyphens. *a-C (amorphous carbon) https://en.wikipedia.org/wiki/Amorphous_carbon https://www.sciencedirect.com/topics/materials-science/amorphous-carbon COMPARISON WITH MS WORD: Microsoft Word treats these characters as equivalent in Find operations: U+002D - Hyphen-Minus (regular) U+2011 ‑ Non-Breaking Hyphen When searching for '-', Word automatically finds both variants. The same problem exists with space variants and quotation mark variants. ENHANCEMENT: Add a character normalization option to Find & Replace dialogs. Proposed checkbox label (choose one): ☑ Treat similar characters as equivalent ☑ Match character variants ☑ Ignore punctuation differences Note: The label should clearly indicate this is separate from diacritic/accent handling. BEHAVIOR WHEN ENABLED: - All space variants → regular space (U+0020) - All hyphen/dash variants → hyphen-minus (U+002D) - All quotation mark variants → standard quotes (U+0022, U+0027) EXAMPLES: - Searching for "a-C" matches: a-C, a‑C, a–C, a—C - Searching for "hello world" matches text with any space variant - Searching for "test" matches: "test", "test", «test», 『test』 BEHAVIOR WHEN DISABLED (default): - Only exact character matches (current behavior) OPEN QUESTION: How should primes and accents be handled? Primes (′, ″, ‴): - Used in mathematical/scientific notation: f′(x), 5′ 10″ - Currently often confused with quotation marks - Recommendation: Keep primes separate to preserve mathematical meaning Accents (`, ´, ^): - Typically handled by Unicode NFKC normalization - Can be standalone (spacing) or combining characters - Should these be included in character variant matching? See: [3] https://en.wikipedia.org/wiki/Prime_(symbol) For the discussion, I suggest this schema for illustrating the range of the topic: Text Canonicalization └─ Character Folding ├─ Unicode Normalization (NFKC, NFKD) ├─ Case Folding ├─ Accent/Diacritic Folding └─ Character Class Normalization ├─ Whitespace normalization ├─ Hyphen/dash normalization └─ Quote normalization ┌──────┬─────────────┬───────────────┬──────────────────────────┐ │ Form │ Decomposed? │ Compatibility?│ Use case │ ├──────┼─────────────┼───────────────┼──────────────────────────┤ │ NFC │ no │ no │ Storage, display │ │ NFD │ yes │ no │ Accent removal, analysis │ │ NFKC │ no │ yes │ Search │ │ NFKD │ yes │ yes │ Text simplification │ └──────┴─────────────┴───────────────┴──────────────────────────┘ NFC = Normalization Form Canonical Composition NFD = Normalization Form Canonical Decomposition NFKC = Normalization Form Compatibility Composition NFKD = Normalization Form Compatibility Decomposition [1] https://en.wikipedia.org/wiki/Unicode_equivalence SPACES (U+0020 ) Code Char Name U+00A0 No-Break Space (NBSP) U+1680 Ogham Space Mark U+2000 En Quad U+2001 Em Quad U+2002 En Space U+2003 Em Space U+2004 Three-Per-Em Space U+2005 Four-Per-Em Space U+2006 Six-Per-Em Space U+2007 Figure Space U+2008 Punctuation Space U+2009 Thin Space U+200A Hair Space U+202F Narrow No-Break Space U+205F Medium Mathematical Space U+3000 Ideographic Space (CJK full-width space) U+200B Zero-Width Space (invisible) U+200C Zero-Width Non-Joiner (invisible) U+200D Zero-Width Joiner (invisible) U+2060 Word Joiner (invisible) U+FEFF Zero-Width No-Break Space (BOM) HYPHEN/DASH VARIANTS (U+002D -) Code Char Name U+002D - Hyphen-Minus (ASCII standard) U+1806 ᠆ Mongolian Todo Soft Hyphen U+2010 ‐ Hyphen U+2011 ‑ Non-Breaking Hyphen U+2012 ‒ Figure Dash U+2013 – En Dash U+2014 — Em Dash U+FE58 ﹘ Small Em Dash U+FE63 ﹣ Small Hyphen-Minus U+FF0D − Fullwidth Hyphen-Minus SINGLE QUOTES (U+0027 ') Code Char Name U+2018 ' Left Single Quotation Mark U+2019 ' Right Single Quotation Mark U+201A ‚ Single Low-9 Quotation Mark U+201B ‛ Single High-Reversed-9 Quotation Mark U+2039 ‹ Single Left-Pointing Angle Quotation Mark U+203A › Single Right-Pointing Angle Quotation Mark U+275B ❛ Heavy Single Turned Comma Quotation Mark Ornament U+275C ❜ Heavy Single Comma Quotation Mark Ornament U+276E ❮ Heavy Left-Pointing Angle Quotation Mark Ornament U+276F ❯ Heavy Right-Pointing Angle Quotation Mark Ornament U+FF07 ' Fullwidth Apostrophe U+300C 「 Left Corner Bracket (Chinese, Japanese, Korean) U+300D 」 Right Corner Bracket DOUBLE QUOTES (U+0022 ") Code Char Name U+00AB « Left-Pointing Double Angle Quotation Mark U+00BB » Right-Pointing Double Angle Quotation Mark U+201C " Left Double Quotation Mark U+201D " Right Double Quotation Mark U+201E „ Double Low-9 Quotation Mark U+201F ‟ Double High-Reversed-9 Quotation Mark U+275D ❝ Heavy Double Turned Comma Quotation Mark Ornament U+275E ❞ Heavy Double Comma Quotation Mark Ornament U+2E42 ⹂ Double Low-Reversed-9 Quotation Mark U+301D 〝 Reversed Double Prime Quotation Mark U+301E 〞 Double Prime Quotation Mark U+301F 〟 Low Double Prime Quotation Mark U+FF02 " Fullwidth Quotation Mark U+300E 『 Left White Corner Bracket U+300F 』 Right White Corner Bracket APOSTROPHES (U+0027 ') Code Char Name U+0027 ' Apostrophe (standard) U+02BC ʼ Modifier Letter Apostrophe U+02BB ʻ Modifier Letter Turned Comma U+02BD ʽ Modifier Letter Reversed Comma U+02C8 ˈ Modifier Letter Vertical Line (stress mark) U+055A ՚ Armenian Apostrophe U+2032 ′ Prime (sometimes misused as apostrophe) PRIMES (mathematical/scientific notation) Code Char Name U+2032 ′ Prime (minutes, feet, derivatives) U+2033 ″ Double Prime (seconds, inches) U+2034 ‴ Triple Prime U+2035 ‵ Reversed Prime U+2036 ‶ Reversed Double Prime U+2037 ‷ Reversed Triple Prime U+2057 ⁗ Quadruple Prime U+02B9 ʹ Modifier Letter Prime U+02BA ʺ Modifier Letter Double Prime ACCENTS (grave, acute, circumflex) Code Char Name U+0060 ` Grave Accent (backtick) U+00B4 ´ Acute Accent (spacing) U+005E ^ Circumflex Accent (caret) U+02C6 ˆ Modifier Letter Circumflex Accent U+02C7 ˇ Caron (háček) U+02D8 ˘ Breve U+02D9 ˙ Dot Above U+02DA ˚ Ring Above U+02DC ˜ Small Tilde U+02DD ˝ Double Acute Accent U+0300 ̀ Combining Grave Accent U+0301 ́ Combining Acute Accent U+0302 ̂ Combining Circumflex Accent U+0303 ̃ Combining Tilde U+0304 ̄ Combining Macron [1] https://en.wikipedia.org/wiki/Unicode_equivalence [2] https://www.compart.com/en/unicode/category/Pd [3] https://en.wikipedia.org/wiki/Prime_(symbol) CODE IMPLEMENTATIONS [4] Normalize all UTF quotes in Javascript https://gist.github.com/thanpolas/244d9a13151caf5a12e42208b6111aa6 [5] tehsis/normalize: Normalize a string with utf-8 characters. https://github.com/tehsis/normalize RELATED LIBRARIES [6] VitorLuizC/normalize-text: 📝 Provides a simple functions to normalize texts, whitespaces, paragraphs & diacritics. https://github.com/VitorLuizC/normalize-text [7] icu/icu4c/source/common/unicode/normalizer2.h at main · unicode-org/icu https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/normalizer2.h GitHub SEARCH: 'NORMALIZATION' [8] GitbookIO/normall: Normall: normalize filenames, accents etc ... in JS https://github.com/GitbookIO/normall TECHNICAL DOCUMENTATION [9] normalization · GitHub Topics https://github.com/topics/normalization?l=python [10] I18N/CanonicalNormalizationIssues - W3C Wiki https://www.w3.org/wiki/I18N/CanonicalNormalizationIssues [11] icu/docs/userguide/transforms/normalization/index.md at main · unicode-org/icu https://github.com/unicode-org/icu/blob/main/docs/userguide/transforms/normalization/index.md [12] Normalization | ICU Documentation https://unicode-org.github.io/icu/userguide/transforms/normalization/ [13] Custom Normalization | ICU Documentation https://unicode-org.github.io/icu/design/normalization/custom.html [14] ICU 78.1: common/unicode/unorm.h File Reference https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/unorm_8h.html
Since AI chatbots often create text with Unicode sub- and superscripts, it would be good to add these symbols to searching character normalization. As a result, H2O and H₂O could match at once. SUPERSCRIPT CHARACTERS Code Char Name U+2070 ⁰ Superscript Zero U+00B9 ¹ Superscript One U+00B2 ² Superscript Two U+00B3 ³ Superscript Three U+2074 ⁴ Superscript Four U+2075 ⁵ Superscript Five U+2076 ⁶ Superscript Six U+2077 ⁷ Superscript Seven U+2078 ⁸ Superscript Eight U+2079 ⁹ Superscript Nine U+207A ⁺ Superscript Plus Sign U+207B ⁻ Superscript Minus U+207C ⁼ Superscript Equals Sign U+207D ⁽ Superscript Left Parenthesis U+207E ⁾ Superscript Right Parenthesis SUBSCRIPT CHARACTERS Code Char Name U+2080 ₀ Subscript Zero U+2081 ₁ Subscript One U+2082 ₂ Subscript Two U+2083 ₃ Subscript Three U+2084 ₄ Subscript Four U+2085 ₅ Subscript Five U+2086 ₆ Subscript Six U+2087 ₇ Subscript Seven U+2088 ₈ Subscript Eight U+2089 ₉ Subscript Nine U+208A ₊ Subscript Plus Sign U+208B ₋ Subscript Minus U+208C ₌ Subscript Equals Sign U+208D ₍ Subscript Left Parenthesis U+208E ₎ Subscript Right Parenthesis [15] https://symbl.cc/en/collections/superscript-and-subscript-numbers/
Additionally, Unicode includes superscript and subscript letters (beyond just numbers) that are commonly used in scientific notation, phonetics (IPA), and mathematical expressions. These should also be included in character normalization. MOTIVATION: Scientific documents often mix formatting styles from different sources (manual typing, AI-generated content, copy-paste from different software), making it difficult to search for the same chemical formula or physical quantity written in different ways. BASIC EXAMPLES: - Searching for "xn" should match: xⁿ, xₙ - Searching for "H2O" should match: H₂O - Searching for "CO2" should match: CO₂, CO² - Searching for "Ca2+" should match: Ca²⁺ - Searching for "Cp" should match: Cₚ USE CASES IN SCIENTIFIC DOCUMENTS: PHYSICS Mechanics: aₜ, aₙ — tangential and normal acceleration σₓₓ, τₓᵧ — stress tensor components Thermodynamics: Cᵥ, Cₚ — heat capacity at constant volume/pressure CHEMISTRY Ca²⁺ — calcium cation [H₃O⁺] — hydronium ion MATERIAL SCIENCE Crystal Structure: t-ZrO₂ — tetragonal zirconia α-Fe, γ-Fe — iron crystal phases Material stoichiometry: ZrO₂₋ₓ — zirconium oxide (non-stoichiometric compound) SUPERSCRIPT LETTERS (MODIFIERS) Code Char Name U+1D43 ᵃ Modifier Letter Small A U+1D47 ᵇ Modifier Letter Small B U+1D9C ᶜ Modifier Letter Small C U+1D48 ᵈ Modifier Letter Small D U+1D49 ᵉ Modifier Letter Small E U+1DA0 ᶠ Modifier Letter Small F U+1D4D ᵍ Modifier Letter Small G U+02B0 ʰ Modifier Letter Small H U+2071 ⁱ Superscript Latin Small Letter I U+02B2 ʲ Modifier Letter Small J U+1D4F ᵏ Modifier Letter Small K U+02E1 ˡ Modifier Letter Small L U+1D50 ᵐ Modifier Letter Small M U+207F ⁿ Superscript Latin Small Letter N U+1D52 ᵒ Modifier Letter Small O U+1D56 ᵖ Modifier Letter Small P U+02B3 ʳ Modifier Letter Small R U+02E2 ˢ Modifier Letter Small S U+1D57 ᵗ Modifier Letter Small T U+1D58 ᵘ Modifier Letter Small U U+1D5B ᵛ Modifier Letter Small V U+02B7 ʷ Modifier Letter Small W U+02E3 ˣ Modifier Letter Small X U+02B8 ʸ Modifier Letter Small Y U+1DBB ᶻ Modifier Letter Small Z SUPERSCRIPT CAPITALS Code Char Name U+1D2C ᴬ Modifier Letter Capital A U+1D2E ᴮ Modifier Letter Capital B U+1D30 ᴰ Modifier Letter Capital D U+1D31 ᴱ Modifier Letter Capital E U+1D33 ᴳ Modifier Letter Capital G U+1D34 ᴴ Modifier Letter Capital H U+1D35 ᴵ Modifier Letter Capital I U+1D36 ᴶ Modifier Letter Capital J U+1D37 ᴷ Modifier Letter Capital K U+1D38 ᴸ Modifier Letter Capital L U+1D39 ᴹ Modifier Letter Capital M U+1D3A ᴺ Modifier Letter Capital N U+1D3C ᴼ Modifier Letter Capital O U+1D3E ᴾ Modifier Letter Capital P U+1D3F ᴿ Modifier Letter Capital R U+1D40 ᵀ Modifier Letter Capital T U+1D41 ᵁ Modifier Letter Capital U U+1D42 ᵂ Modifier Letter Capital W SUBSCRIPT LETTERS Code Char Name U+2090 ₐ Latin Subscript Small Letter A U+2091 ₑ Latin Subscript Small Letter E U+2095 ₕ Latin Subscript Small Letter H U+1D62 ᵢ Latin Subscript Small Letter I U+2C7C ⱼ Latin Subscript Small Letter J U+2096 ₖ Latin Subscript Small Letter K U+2097 ₗ Latin Subscript Small Letter L U+2098 ₘ Latin Subscript Small Letter M U+2099 ₙ Latin Subscript Small Letter N U+2092 ₒ Latin Subscript Small Letter O U+209A ₚ Latin Subscript Small Letter P U+1D63 ᵣ Latin Subscript Small Letter R U+209B ₛ Latin Subscript Small Letter S U+209C ₜ Latin Subscript Small Letter T U+1D64 ᵤ Latin Subscript Small Letter U U+1D65 ᵥ Latin Subscript Small Letter V U+2093 ₓ Latin Subscript Small Letter X GREEK SUBSCRIPTS Code Char Name U+1D66 ᵦ Greek Subscript Small Letter Beta U+1D67 ᵧ Greek Subscript Small Letter Gamma U+1D68 ᵨ Greek Subscript Small Letter Rho U+1D69 ᵩ Greek Subscript Small Letter Phi U+1D6A ᵪ Greek Subscript Small Letter Chi [16] https://symbl.cc/en/collections/superscript-and-subscript-letters/ [17] https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
How about regex already supported ? -> https://help.libreoffice.org/latest/en-US/text/shared/01/02100001.html you can using existing character properties with \p{} syntax, e.g. \p{Alnum}, \p{space}, \p{QMark}, ... or define your own ranges, e.g [₁-₉₊₋₌₍₎] to batch all these, the current recommendation is to use altsearch https://extensions.libreoffice.org/en/extensions/show/70066 which has more velocity than the core F&R.