147875 – REGEX returns dummy empty matches for already processed chunks (?)

Bug 147875 - REGEX returns dummy empty matches for already processed chunks (?)

Summary: REGEX returns dummy empty matches for already processed chunks (?)

Status:	RESOLVED NOTABUG

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	unspecified
Hardware:	All All

Importance:	medium normal
Assignee:	Mike Kaganski

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2022-03-09 14:49 UTC by Mike Kaganski
Modified:	2022-03-18 14:00 UTC (History)
CC List:	4 users (show)

See Also:
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mike Kaganski 2022-03-09 14:49:03 UTC

Put '=REGEX("111;;222;333;555";"[^;]*";;ROW())' into A1. Drag-copy down to A10.

The expectation is that the regex will split the string into tokens separated by ";". So expectation is:

A1 = '111'
A2 = '' (empty string)
A3 = '222'
A4 = '333'
A5 = '555'
A6 and later = '#NA'

The actual results with Version: 7.3.1.3 (x64) / LibreOffice Community
Build ID: a69ca51ded25f3eefd52d7bf9a5fad8c90b87951
CPU threads: 12; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: ru-RU (ru_RU); UI: en-US
Calc: CL:

A1 = '111'
A2 = '' (empty string)
A3 = '' (empty string)
A4 = '222'
A5 = '' (empty string)
A6 = '333'
A7 = '' (empty string)
A8 = '555'
A9 = '' (empty string)
A10 and later = '#NA'

So there are unexpected empty matches *after* each correct non-empty match.
Of course, the regex may be easily tweaked to workaround, e.g.

 =REGEX("111;;222;333;555";"(?<=^|;)[^;]*";;ROW())

Comment 1 Roman Kuznetsov 2022-03-09 21:05:34 UTC

https://gerrit.libreoffice.org/c/core/+/131269

Comment 2 Eike Rathke 2022-03-11 16:55:09 UTC

The empty matches aren't unexpected. The [^;]* pattern looks for any number of non-; characters, including 0, so each ; is a match as well. Using a look-behind assertion  (?<=^|;)[^;]*  is a correct pattern for the desired match. Using only [^;]* is wrong expectations.

This is not a bug.

Comment 3 Mike Kaganski 2022-03-11 17:05:44 UTC

(In reply to Eike Rathke from comment #2)
> The empty matches aren't unexpected. The [^;]* pattern looks for any number
> of non-; characters, including 0, so each ; is a match as well.

I agree that this is not a bug - *in a sense that this is not OUR bug*. We use a regex library; it uses the approach that you explain. It looks this approach is *predominant*, and thus it's not a bug.

However, I still believe that this predominant way of work is *wrong* (internally self-inconsistent). A greedy operator *must* return *everything* until non-matching token; if after that result, it returns *anything* else from the same range, it means that the greedy operator failed to include that empty string into the previous result.

I.e., empty strings between all characters must be treated as separate tokens from the algorithm PoV; matching those tokens must be performed in the same way as matching normal characters; and "[^;]*" in "111;;222;333;555" must match not only "111", but also the empty string between 111 and ;, so using "|" for empty strings, the original string must be

> ^1|1|1|;|;|2|2|2|;|3|3|3|;|5|5|5$

the results (in tokens) must be (separated by square brackets)

> [^1|1|1|];[|];[|2|2|2|];[|3|3|3|];[|5|5|5$]

Comment 4 Eike Rathke 2022-03-11 18:21:54 UTC

(In reply to Mike Kaganski from comment #3)
> A greedy operator *must* return *everything*
> until non-matching token; if after that result, it returns *anything* else
> from the same range, it means that the greedy operator failed to include
> that empty string into the previous result.
If you try this at https://regex101.com/ you'll see that all 7 regex flavors agree on the same 9 matches. If you switch to PCRE/PCRE2 with Ungreedy option set you'll see the single characters result similar to your example, but giving 29 matches (two additional null matches, one in front and one at end).

Comment 5 Mike Kaganski 2022-03-11 18:26:13 UTC

(In reply to Eike Rathke from comment #4)
> If you try this at https://regex101.com/ you'll see that all 7 regex flavors
> agree on the same 9 matches.

Sure; I agree that current behavior is predominant :)

> If you switch to PCRE/PCRE2 with Ungreedy
> option set you'll see the single characters result similar to your example,
> but giving 29 matches (two additional null matches, one in front and one at
> end).

Yes, it finds the tokens [^] and [$] mentioned in my example :)

The one engine following my idea is mentioned at https://www.regular-expressions.info/zerolength.html.

Comment 6 Igor 2022-03-12 08:36:34 UTC

The behavior of the REGEX function has been changed, probably not intentionally: the function should skip empty strings by default.

Currently, the result of the function (regarding the return of empty strings) is different from the result of the searchForward method of com.sun.star.util.TextSearch object (service), where regex empty strings are skipped by default.

Comment 7 Eike Rathke 2022-03-14 13:45:06 UTC

(In reply to Igor from comment #6)
> The behavior of the REGEX function has been changed
No, it has not, it is the same since its first implementation in 6.2.z.

> Currently, the result of the function (regarding the return of empty
> strings) is different from the result of the searchForward method of
> com.sun.star.util.TextSearch object (service), where regex empty strings are
> skipped by default.
Yes (and you can't change it so it's not even a default), but css::util::TextSearch does all sort of tweaking for backwards (also API behaviour) compatibility with the old OOo engine plus some, including skipping zero-length matches. The Calc REGEX() function uses the ICU regex matcher as is.

I can see why these zero-length matches may be regarded unexpected, but omitting them for the REGEX() function would mean to derive from what 8 regex engines agree upon, which may be even more unexpected. As is, at least one can try, test and compare expressions with the (Java) one offered at for example regex101.com (note some subtle differences mentioned at https://unicode-org.github.io/icu/userguide/strings/regexp.html#differences-with-java-regular-expressions ).

Comment 8 Eike Rathke 2022-03-18 13:56:56 UTC

As a side note, funny enough, bug 135538 complains that Regex-Search from dialog does not find zero-length matches. (which exactly uses css::util::TextSearch where those are filtered out).

Comment 9 Mike Kaganski 2022-03-18 14:00:03 UTC

(In reply to Eike Rathke from comment #8)

:-D

(Just to note, that IMO empty matches *must* be found, just not when they are part of already processed chunk - see the expected empty match (A2) in my description.)