118505 – When using regex and find.text different positions (begining or the end of string) are returned.

Bug 118505 - When using regex and find.text different positions (begining or the end of string) are returned.

Summary: When using regex and find.text different positions (begining or the end of st...

Status:	CLOSED NOTABUG

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	6.0.5.1 rc
Hardware:	x86-64 (AMD64) All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	implementationError

Depends on:
Blocks:

Reported:	2018-07-02 15:53 UTC by Maciej Kotliński
Modified:	2018-07-03 16:16 UTC (History)
CC List:	2 users (show)

See Also:
Crash report or crash signature:

Attachments
An example in .ODS (10.31 KB, application/vnd.oasis.opendocument.spreadsheet) 2018-07-02 15:54 UTC, Maciej Kotliński	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Maciej Kotliński 2018-07-02 15:53:01 UTC

Description:
Find returns the beginning or the end of string found when using regex.
The easiest way to explain this is an example:

The string  result   formula
123456789     5     =SZUKAJ.TEKST("3?5"; A1)
123456789     3     =SZUKAJ.TEKST("345"; B2)
123456790     3     =SZUKAJ.TEKST("3.5"; B3)
123456791     5     =SZUKAJ.TEKST("3*5"; B4)
123456792     3     =SZUKAJ.TEKST("3[0-9]5"; B5)
123456795     3     =SZUKAJ.TEKST("3[0-9]*5"; B6)
(Polish szukaj.text = find)

One can notice that when we use "?" or "*" we get the position of the last character.
When "." or [0-9] or [0-9]* is used the position of the first character is returned which, I suppose, is correct.



Steps to Reproduce:
1. Use find function with '.' and '?' locate some substring in string e.g "345" (test "345", "3.5" and "3?5") in "123456789".
2. Compare the results (returned position)


Actual Results:
when searching "345" in "123456789" position 3 is returned (correct)
when searching "3?5" in "123456789" position 5 is returned (not correct)
when searching "3*5" in "123456789" position 5 is returned (not correct)
when searching "3[0-9]5" in "123456789" position 3 is returned (correct)

Expected Results:
I suppose that always the begining of found substring should be returned (3 in the above example).


Reproducible: Always


User Profile Reset: No


OpenGL enabled: Yes

Additional Info:
Tested on 6.0.5.1(Linux), 6.0.5.2 (win), 6.0.2.1 (win). Always the same.

Comment 1 Maciej Kotliński 2018-07-02 15:54:00 UTC

Created attachment 143267 [details]
An example in .ODS

Comment 2 V Stuart Foote 2018-07-02 18:23:22 UTC

Text search using glob() sytle "wildcard" syntax was implemented for bug 72196 [1] and compatibility with MS Excel products.

Otherwise IIUC Search and our Regex implementation is provided by ICU libs.

They are completely different search implementations. Personally I see no reason to go back in and adjust the search logic or resulting pointer--but others can decide that.

@Eike?

=-ref-=

[1] https://cgit.freedesktop.org/libreoffice/core/commit/?id=f1a4663c819bf698f95a75b5a3319506c66f2778

Comment 3 Eike Rathke 2018-07-02 21:11:06 UTC

This has nothing to do with Wildcard search, the document has Regular Expressions enabled.

This is not a bug. One has to understand how regular expressions work. Specifically the ? and * operators can not be used the same as in wildcards.

3?5 means zero or one 3 followed by one 5. For the sequence "345" this matches only 5.

3*5 means zero or more 3 followed by one 5. For the sequence "345" this matches only 5.

For the available metacharacters, operators and set expressions in ICU and thus LibreOffice see http://userguide.icu-project.org/strings/regexp

Comment 4 Maciej Kotliński 2018-07-03 13:06:03 UTC

Actually I've got this problem when striping the last digit (and possibly a character) from the name names of samples from some analyses.
I've got such names as in first column and would like to strip last digit and possibly letter. I was surprised to get different results for "KHe3r" and "KHe3" when using "[0-9]?$". One can deal with the problem with "[0-9][a-z]{0,1}$" which work as I expected.

 =SZUKAJ.TEKST(REGEX; AX)

         "[0-9]?$"  stripped text*  "[0-9][a-z]{0,1}$"  stripped text*
K-7dpi3      7          K-7dpi             7             K-7dpi
KHe1         4          KHe                4             KHe
KHe2s        6          KHe2s              4             KHe
KHe3         4          KHe                4             KHe
KHe3r        6          KHe3r              4             KHe
KL101a       7          KL101a             5             KL10
KL102        5          KL10               5             KL10
KL103        5          KL10               5             KL10
KL10-3dpi1  10          KL10-3dpi         10             KL10-3dpi
KL10-3dpi2  10          KL10-3dpi         10             KL10-3dpi
MM-10dpi1    9          MM-10dpi           9             MM-10dpi

*stripped text returned by =FRAGMENT.TEKSTU(A1; 1; [result from the coulumn with position]-1)

Thank you for clarifying the problem. I describe the solution which works for me. Maybe it will be helpful for somebody.

Comment 5 Eike Rathke 2018-07-03 16:16:34 UTC

Testing expressions in an online tester (for example https://regex101.com/) may help, it even gives some explanation about the expression entered and if quantifiers are greedy/docile/lazy/possessive. The default PCRE dialect is fine and quite close to the Java inspired regex of ICU, though there are subtle differences. Note that the match offsets in that tester (and others) are 0-based while in Calc they are 1-based. For the quantifiers subtleties see https://www.rexegg.com/regex-quantifiers.html