106137 – Find & Replace: handle \uhhhh and \Uhhhhhhhh notation in replacement string

Bug 106137 - Find & Replace: handle \uhhhh and \Uhhhhhhhh notation in replacement string

Summary: Find & Replace: handle \uhhhh and \Uhhhhhhhh notation in replacement string

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	LibreOffice (show other bugs)
Version: (earliest affected)	unspecified
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	difficultyMedium, easyHack, skillCpp

Duplicates (1):	113992 (view as bug list)
Depends on:
Blocks:	Find-Search Find&Replace-Regex
	Show dependency tree / graph

Reported:	2017-02-22 12:01 UTC by Wolfgang Jäger
Modified:	2025-03-27 16:48 UTC (History)
CC List:	11 users (show)

See Also:	102374 113992 45344 157303 159002
Crash report or crash signature:

Attachments
Broken results of Find and Replace using regular expression for replacement (57.79 KB, image/png) 2017-02-22 14:36 UTC, V Stuart Foote	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Wolfgang Jäger 2017-02-22 12:01:38 UTC

F&R with 'Regular expressions' allowed under 'Other options':

Characters given by their unicode point like \u000A (LF) or \u0041 (upper case letter A) in the 'Replace:' edit line are not interpreted. The notation is treated and inserted as a string of 6 literals. 

The bug is vital in Writer and in Calc as well.

I tested some older versions (since the \uhhhh was introduced) and did not get an example with a \u replacement working as expected with this respect.

Comment 1 V Stuart Foote 2017-02-22 14:36:25 UTC

Created attachment 131406 [details]
Broken results of Find and Replace using regular expression for replacement

Confirmed on master. The replacement string is not parsed for use of ICU regular expression, but probably should be.

Version: 5.4.0.0.alpha0+
Build ID: f0c7cbe1d8505d3c1f5a2b2253efda35542c898b
CPU threads: 8; OS: Windows 6.19; UI render: GL; 
TinderBox: Win-x86@39, Branch:master, Time: 2017-02-22_04:35:00
Locale: en-US (en_US); Calc: CL

Comment 2 V Stuart Foote 2017-02-22 16:31:39 UTC

With RegEx search enabled, are the replacement strings parsed against ICU regex? Expect not and this happens.

Looking at the textsearch.hxx and textsearch.cxx we do not that I could see.

Seems allowing the option to interpret regex in the replacment string would be a reasonable enhancement to the Find and Replace dialog.

Comment 3 Eike Rathke 2017-02-22 17:27:59 UTC

The replacement string *IS NOT* a regular expression to match anything and it doesn't make sense to treat it as such.

However, we maybe could add the \u and \U notations to be used in the Replacement string as well, ignoring all control characters below 0x20 except CR, LF and TAB. Note that \t (and \n in Writer) are already supported, see https://help.libreoffice.org/Common/List_of_Regular_Expressions

Comment 4 V Stuart Foote 2017-04-26 19:52:46 UTC

(In reply to Eike Rathke from comment #3)
> The replacement string *IS NOT* a regular expression to match anything and
> it doesn't make sense to treat it as such.
> 
> However, we maybe could add the \u and \U notations to be used in the
> Replacement string as well, ignoring all control characters below 0x20
> except CR, LF and TAB. Note that \t (and \n in Writer) are already
> supported, see
> https://help.libreoffice.org/Common/List_of_Regular_Expressions

While fixing the very annoying omission of the \t wildcard/regex for replacement strings against some Find results (bug 102374), could we also handle the \u, \U flavors as well?

Rather than picking from the Special Character dialog, direct entry by Unicode point would be very appealing. Maybe even provide for the U+hhhhh notation we use with the <Alt>+X Unicode toggle?

Comment 5 Stephan Bergmann 2017-04-27 07:15:56 UTC

(In reply to V Stuart Foote from comment #4)
> Maybe even provide for the U+hhhhh
> notation we use with the <Alt>+X Unicode toggle?

how would you escape that if you want it verbatim?

Comment 6 V Stuart Foote 2017-04-27 12:10:33 UTC

(In reply to Stephan Bergmann from comment #5)
> (In reply to V Stuart Foote from comment #4)
> > Maybe even provide for the U+hhhhh
> > notation we use with the <Alt>+X Unicode toggle?
> 
> how would you escape that if you want it verbatim?

maybe a leading "'" as used for text cells in calc? Or even just use the <Alt>+X while in the Replace box field to toggle to the glyph--as long as it is picked up in the replacement string.

Comment 7 Wolfgang Jäger 2017-04-27 13:45:59 UTC

Of course, readers here will understand that the replacement string is not a RegEx. That won't hinder me to express the hope that one day the means to compose replacement strings will be as near as possible to those used for RegEx search expressions. I also would suggest the introduction of an option to use "StrictRegex" exactly supporting the features of the specific engine. Name the engine (provider/flavor) in the help texts. Link the complete specification. No help text can replace that. A huge waste of power!

This should include to drop the unsystematic "&" for 'Everything found'. Even though this not is supported as a RegEx by the engine itself a "\0" is more logical or "respective" for replacement. And of course \1 ... used as backreferences in RegEx is better than the stubborn $1 ... in replacement stings, too. Using \n in gravely different meanings here and there is a sin.

Of course getting search expressions and replacement strings under the "same" (very similar) syntax will require more escaping in replacement.

I am told RegEx aren't used much. Maybe. But when used by experienced persons they are very valuable AND their support is a relevant advantage of this free Office over the main commercial competitor's. The compatibility strategy is not as successful as once expected and may even reach its end one day. Being better should then get relevant again.

Yes. I am aware of the fact that this is an "OffTopic" concerning the subject. Writing 5 new reports would not find anybody putting things together, I am afraid. Atomisation cannot be the way.

Comment 8 Wolfgang Jäger 2017-04-27 14:44:25 UTC

(In reply to Eike Rathke from comment #3)
> ... Note that \t (and \n in Writer) are already
> supported, see
> https://help.libreoffice.org/Common/List_of_Regular_Expressions

From the linked help: 
\n in the Replace text box stands for a paragraph break that can be entered with the Enter or Return key. 

In fact I do not know how to insert hard line breaks via a replacement string of 'F&R' currently. Have to use the commendable but loafing AltSearch - and to keep additional controls.

Comment 9 V Stuart Foote 2017-11-23 06:01:39 UTC

*** Bug 113992 has been marked as a duplicate of this bug. ***

Comment 10 Justin L 2018-02-08 08:26:48 UTC

(In reply to V Stuart Foote from comment #4)
>  Maybe even provide for the U+hhhhh
> notation we use with the <Alt>+X Unicode toggle?

Interesting idea to support Alt-X unicode toggle in the find/replace dialog. (Well, that's not exactly what you said, but that would be cool.) Unfortunately, Alt-X is already a shortcut for "find ne~xt", so that won't work.

The code path for handling keyboard entry in the find/replace box is vcl/source/control/edit/cxx:ImplHandleKeyEvent() and vcl/source/window/dlgctrl.cxx:ImplFindAccelWindow() for finding the current alt-x implementation.

Comment 11 Eike Rathke 2022-10-10 14:38:31 UTC

Operating systems / window managers nowadays should have functionality for that, e.g. in GNOME it's Shift+Ctrl+U to enter the numeric (decimal or hex) sequence of a Unicode character that also works in dialogs.
Implementing this at the application level IMHO is unnecessary (and supporting/overriding Alt+X in the dialog an ugly hack anyway where furthermore keyboard shortcuts depend on UI localization).

Btw, instead of the outdated https://help.libreoffice.org/Common/List_of_Regular_Expressions wiki page rather regard the current online help https://help.libreoffice.org/latest/en-US/text/shared/01/02100001.html that for \n in Replace says
"Has no special meaning in Calc, and is treated literally there."
(though the rest of that paragraph about paragraph break with Enter or Return is confusing because of the structure of the sentence).

Comment 12 Eike Rathke 2022-10-10 15:10:20 UTC

(just clarified, see https://bugs.documentfoundation.org/show_bug.cgi?id=43107#c27 )

Comment 13 Mike Kaganski 2024-09-09 13:37:57 UTC

This is rather straightforward: grep the codebase for places that make use of i18nutil::SearchOptions2::replaceString, and in those cases where this string is used to replace something, and the mode is regex, pre-process it to replace the \uhhhh and \Uhhhhhhhh with their Unicode counterpart strings (mind that the replacement can be not a single character!). This should be implemented as a function in i18nutil; it must take comment 3; and it must take care of the usual escaping (note comment 5, and take a look how currently the replacement works for a regex in Writer with \t and \\t as replacement string).

Where possible, this replacement should be done once, when a function actually doing the search would get the replacement string for the first time. Where not easy, due to the actual algorithm, this optimization can be left for a follow-up.

Indeed, the fixes need to be modular - no need to do it at once for all mosules. A commit adding the function to process the replacement string; a commit to introduce it to Writer; then to calc, etc.

All commits need to have unit tests.