Bug 135538 - Search-Replace: Regular Expression engine fails on zero length matches
Summary: Search-Replace: Regular Expression engine fails on zero length matches
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
3.3.0 release
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 52504 132870 145856 (view as bug list)
Depends on:
Blocks: Find&Replace-Regex
  Show dependency treegraph
 
Reported: 2020-08-07 15:51 UTC by masz0
Modified: 2023-10-04 16:20 UTC (History)
9 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description masz0 2020-08-07 15:51:56 UTC
Description:
It seems the regular expression engine (in Search-Replace) expects in most instances to match a string of some length > 0. It fails on zero-length matches.

Steps to reproduce:
1. Enter text in a cell in Calc, or a paragraph in Writer.
   E.g. "abcde".
2. Attempt to Search-Replace using a regular expression that would make the "match" zero-width (using any valid and text-matching combination of look-behind and look-ahead).
   E.g. "(?<=ab)"
   (but not "(?<=de)" matching text at the end of Calc cell - see notes below)

Current behavior:
No match is found.

Expected behavior:
- Minimum:
Not return a result of "no match", but "matching to zero-length string not allowed" (or some such).

Something to indicate that there isn't necessarily anything wrong with logic of the used regular expression - LO just hasn't implemented a way to process it - regardless of whether it's a design decision, unfinished functionality, or a bug. I personally spent hours trying to get this to work, thinking it was user/application configuration error - even OS configuration error.

This would also be an adequate stop-gap measure if it was decided to go ahead with a more comprehensive solution (like my "preferred" scheme below), but that due to prioritization or delays would take long time to arrive.

- Preferred:
Zero-width matches should be found normally - at least as long as they have some meaningful anchor so aren't pathological and match at every position - like "(?=.?)".

If matching every position (pathological case) is not allowed, more accurate reporting would be preferable: "matching at every position not allowed".

Or limit matching every position to selection, and return "matching at every position only allowed for selection" when attempted elsewhere.

Reproducible: Always

User Profile Reset: Yes


This problem affects at least Calc and Writer - I suppose the entire suite shares the same regex engine.

It is present in both the current 7.0.0.3 and the 6.x version I used a few days ago. (I thought my install might be borked due to this, so went to download the latest version to reinstall. Turns out 7 had just come out.)


Additional notes/confirmation testing:

Assuming source text "abcde", these all will match:
    (?<=ab)c
    c(?=de)
    (?<=ab)c(?=de)

But if your match is zero width (you want to add something after, before, or between), it won't match:
    (?<=ab)
    (?=cd)
    (?<=ab)(?=cd)
or even
    ^

Of course depending on the situation, this problem can be sidestepped by doing something like "(ab)" -> "$1addthis".

Something special is going on with "end of line", in that
    $
    (?=$)
both work (in Calc and Writer).

In Calc, still assuming text "abcde", even
    (?<=de)
works when "de" is found at the end of a cell, but not elsewhere.



My 2 systems:
Windows 10 64-bit 1909 (Windows Beta Unicode UTF-8 support enabled)
Windows 10 64-bit 2004 (Windows Beta Unicode UTF-8 support enabled/disabled; also tried resetting profile)
Comment 1 Michael Warner 2020-08-08 15:02:31 UTC
I am able to confirm this in:

Version: 6.0.7.3
Build ID: 1:6.0.7-0ubuntu0.18.04.10
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.UTF-8); Calc: group


I didn't trace through it while executing, so I may be looking at the wrong place for this particular test case, but core/i18npool/source/search/textsearch.cxx lines 942-952 state explicitly that they are there to ignore zero-length matches. The specific comment is this:
        // #i118887# ignore zero-length matches e.g. "a*" in "bc"

It was a decision made in OpenOffice (I added the link to their bug in the See Also field).

So this is intended behavior to avoid the matching-every-position case, not a bug. 

Whether it should be intended behavior and how to address it is another question. Personally, I tend to think that users searching for regular expressions are knowledgeable about the regex pattern they are providing (or should be) and therefore we should match the pattern as written.
Comment 2 Michael Warner 2020-08-31 12:59:03 UTC
*** Bug 52504 has been marked as a duplicate of this bug. ***
Comment 3 Michael Warner 2020-08-31 13:00:58 UTC
*** Bug 132870 has been marked as a duplicate of this bug. ***
Comment 4 Heiko Tietze 2020-08-31 13:31:56 UTC
IIUC, the original request was to find digits like ABC1EFG per "\d *". Works for me with and without the code around nStartOfs/nEndOfs returning "Search key not found" for ABC-EFG.

Don't see much benefit from adding a note about zero-length matches to the UI; although it's easy to implement and unobtrusively replacing the "Search key not found" label. Point is that you get the zero result anyway. But no objection to implement this.
Comment 5 Michael Warner 2020-08-31 16:50:00 UTC
(In reply to Heiko Tietze from comment #4)
> IIUC, the original request was to find digits like ABC1EFG per "\d *". Works

If I am not mistaken, "\d *" has a minimum length of one (a single digit), so is not an example of this bug. Trying to match "\d*" instead would have zero length.
Comment 6 Michael Warner 2020-08-31 16:52:12 UTC
(In reply to Michael Warner from comment #5)
> (In reply to Heiko Tietze from comment #4)
> > IIUC, the original request was to find digits like ABC1EFG per "\d *". Works
> 
> If I am not mistaken, "\d *" has a minimum length of one (a single digit),
> so is not an example of this bug. Trying to match "\d*" instead would have
> zero length.

But searching for "\d*" would match everywhere is not actually that useful. Where allowing zero-length matches would be useful is with anchors like in the original request of this bug or the other ones linked in the see also section.
Comment 7 masz0 2020-08-31 16:54:55 UTC
(In reply to Michael Warner from comment #5)
> (In reply to Heiko Tietze from comment #4)
> > IIUC, the original request was to find digits like ABC1EFG per "\d *". Works
> 
> If I am not mistaken, "\d *" has a minimum length of one (a single digit),
> so is not an example of this bug. Trying to match "\d*" instead would have
> zero length.

No, "\d *" tries to match for 1 digit, followed by 0+ spaces.
Comment 8 Michael Warner 2020-08-31 22:42:15 UTC
(In reply to masz0 from comment #7)
> (In reply to Michael Warner from comment #5)
> > (In reply to Heiko Tietze from comment #4)
> > > IIUC, the original request was to find digits like ABC1EFG per "\d *". Works
> > 
> > If I am not mistaken, "\d *" has a minimum length of one (a single digit),
> > so is not an example of this bug. Trying to match "\d*" instead would have
> > zero length.
> 
> No, "\d *" tries to match for 1 digit, followed by 0+ spaces.

Which is what I was trying to say. At any rate, I don't think it is a valid test case for the bug you reported, please correct me if I am wrong.
Comment 9 masz0 2020-08-31 23:09:56 UTC
(In reply to Michael Warner from comment #8)
> (In reply to masz0 from comment #7)
> > (In reply to Michael Warner from comment #5)
> > > (In reply to Heiko Tietze from comment #4)
> > > > IIUC, the original request was to find digits like ABC1EFG per "\d *". Works
> > > 
> > > If I am not mistaken, "\d *" has a minimum length of one (a single digit),
> > > so is not an example of this bug. Trying to match "\d*" instead would have
> > > zero length.
> > 
> > No, "\d *" tries to match for 1 digit, followed by 0+ spaces.
> 
> Which is what I was trying to say. At any rate, I don't think it is a valid
> test case for the bug you reported, please correct me if I am wrong.

Oh, sorry, I misunderstood.

Affirmative for "\d *" being an invalid test.

Since it requires and matches one digit, not having any in the input (ABC-EFG) will make it fail (legitimately; not thru the artificial limitation).

If the input does have digits (ABC1EFG), the pattern will match each in turn. The matches will be length 1 (or more where followed one or more spaces) - therefore LO won't discard them.

My problem was specifically about zero-width assertions "(?<=..)", "(?<!..)", "(?=..)", "(?!..)", "^", and combinations of them. Unlike them, standalone "X*" isn't very useful even though it too can be zero-length.
Comment 10 Heiko Tietze 2020-09-01 09:39:52 UTC
Whatever the best example is, if someone volunteers, the label can be used without deteriorating effect on usability to give feedback.
Comment 11 Xisco Faulí 2021-02-09 14:08:24 UTC
Dear Michael Warner,
This bug has been in ASSIGNED status for more than 3 months without any
activity. Resetting it to NEW.
Please assign it back to yourself if you're still working on this.
Comment 12 Michael Warner 2021-11-19 15:32:25 UTC
*** Bug 145774 has been marked as a duplicate of this bug. ***
Comment 13 LeroyG 2021-11-23 20:23:19 UTC
*** Bug 145856 has been marked as a duplicate of this bug. ***
Comment 14 Edier Guzman 2021-11-23 20:43:10 UTC
Hi, same behaviour here with version 7.2.2.2:

Version: 7.2.2.2
Build ID: 20(Build:2)
CPU threads: 4; OS: Linux 5.14; UI render: default; VCL: gtk3
Locale: en-GB (en_GB.UTF-8); UI: en-US
Calc: threaded

After trying to find matches with regular expression '^' in order to put a single quote at the beginning of each cell, Calc will say that there is no match.

That is a wrong behaviour, as ^ is a valid regular expression for matching beginnings of strings.