Bug 78840 - Add the regular expression (?ismwx-ismwx: ... ) Flag settings. Evaluate parenthesized expression with specifics flags enabled or -disabled. To have a case sensitive mode in functions using regular expressions.
Summary: Add the regular expression (?ismwx-ismwx: ... ) Flag settings. Evaluate paren...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
4.3.0.0.alpha1
Hardware: Other All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard: target:6.5.0
Keywords:
Depends on:
Blocks: Find-Search
  Show dependency treegraph
 
Reported: 2014-05-17 21:35 UTC by m_a_riosv
Modified: 2020-04-24 14:34 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Sample file to test (26.62 KB, application/vnd.oasis.opendocument.spreadsheet)
2014-05-17 21:35 UTC, m_a_riosv
Details

Note You need to log in before you can comment on or make changes to this bug.
Description m_a_riosv 2014-05-17 21:35:31 UTC
Created attachment 99252 [details]
Sample file to test

Currently is not possible search in functions using regular expression in a case sensitive mode, the finder has their own option for it.

http://ask.libreoffice.org/en/question/33989/case-sensitivity-using-calc-countif-function/

e.g. =COUNTIF(B$1:B$4;"x") find all 'x' in the range no matter their case.

In the icu regular expression operators (http://userguide.icu-project.org/strings/regexp) there is:

(?ismwx-ismwx: ... ) Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
(?ismwx-ismwx) 	Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.

But those seem don't work. I guess they aren't implemented.

'i' is the flag for the case insensitive mode:

=COUNTIF(B$1:B$4;"(?-i)x")

(?-i) should deactivate case-insensitive mode, so the formula should find only lower case 'x', but it is not the result.
Comment 1 Joel Madero 2015-03-11 18:08:17 UTC
Talked to the dev room and they say this is valid - although might confuse users with yet more options.

As such marking as NEW.

Also got this from one of the devs: "well a workaround would be to use \uXXXX (see https://help.libreoffice.org/Common/List_of_Regular_Expressions )" -- no clue what that means but maybe it'll help those who are interested.
Comment 2 Eike Rathke 2015-03-11 18:16:01 UTC
Well, "valid request" doesn't mean it would come easy.. it would be yet another option in the calculation settings confusing more users, rework how text is fed to the regex matcher (with or without ignore-case transliteration), and last but not least would have to be stored in the document, yet another ODF extension I presume.
Comment 3 m_a_riosv 2015-03-11 20:14:15 UTC
Thanks a lot for take a look.

I think the workaround entering the character unicode can cover well with a few characters to find.

By Eike answer seems that search and replace text it's analyze and transformed before it is passed to regex matcher, so this can imply a high effort for a little benefit.

Maybe if a regex expression could be used directly without any transformation, but sure it's also not an easy garden to cut.

For me no problem on closing the request.
Comment 4 Mike Kaganski 2019-12-19 09:36:35 UTC
(In reply to m.a.riosv from comment #0)
> In the icu regular expression operators
> (http://userguide.icu-project.org/strings/regexp) there is:
> 
> (?ismwx-ismwx: ... ) Flag settings. Evaluate the parenthesized expression
> with the specified flags enabled or -disabled.
> (?ismwx-ismwx) 	Flag settings. Change the flag settings. Changes apply to
> the portion of the pattern following the setting. For example, (?i) changes
> to a case insensitive match.
> 
> But those seem don't work. I guess they aren't implemented.

They are implemented. It's just the CountIf disables case sensitivity for unclear reason (see "Never case-sensitive" comment in ScInterpreter::ScCountIf in sc/source/core/tool/interpr1.cxx): the OpenFormula does not impose the insensitivity restriction on the function (unlike e.g. MATCH: cf. 6.13.9 and 6.14.9 at [1]).

There's a new REGEX function [2] implemented by Eike for 6.2, which doesn't disable the case sensitivity. Of course, using it with COUNTIF is more involved than plain regex in COUNTIF criteria.

I don't think that any new option in calculation settings is required. Since the standard does not impose the restriction on the function, it might simply be considered insensitive by default, and when regex is used, it may honor the advanced regex options mentioned in comment 0.

[1] http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part2.html
[2] https://help.libreoffice.org/6.3/en-US/text/scalc/01/func_regex.html
Comment 5 m_a_riosv 2019-12-20 00:28:33 UTC
Thanks for the hints Mike.
I use REGEX function, in which, (?i) can be used in the regex pattern to change sensitive case as default.
The count can be done with REGEX() but long from to be easy, as I know not inside COUNTIF, at least SUMPRODUCT or a matrix, and some more function to analyze the REGEX result. 

I don't like the situation, the same regex pattern should work equal every where on the whole LibreOffice. But here we are and as Eike commented in tdf#113977 it's hard to change, specially by backward incompatibilities.
Of course I don't like new options, we have already a lot of them, and many users don't expend neither a minute to review them, how the world goes.

Sure there more needed enhancements to develop, so I think it's better for now  to close as wontfix.
Comment 6 Mike Kaganski 2019-12-20 05:15:21 UTC
I don't believe in compatibility issue with currently-case-insensitive functions like COUNTIF. I don't believe in actually used regexes there which include the (?i). So i actually believe that simply adding sensitivity to those without any new settings is really enough.45
Comment 7 Mike Kaganski 2019-12-20 05:18:43 UTC
... simply adding sensitivity to those *flags* without any new settings ...
Comment 8 Buovjaga 2019-12-20 10:06:45 UTC
If Mike is optimistic, let's keep as NEW
Comment 9 Mike Kaganski 2019-12-20 10:32:53 UTC
(In reply to Buovjaga from comment #8)
> If Mike is optimistic, let's keep as NEW

Heh - thanks! Yet I was hoping that Eike would comment - because my perception could be of course wrong if that could be actually a problem that I overlook...
Comment 10 Eike Rathke 2019-12-21 00:21:56 UTC
We maybe could this, iff regular expressions are enabled and "(?-i)" or "(?i)" are present in the search string (note that those flags can appear anywhere) then disable the overall case insensitive option. However, that should be done in a consistent manner for all functions that obey the regular expression setting.
Comment 11 Eike Rathke 2019-12-21 00:28:54 UTC
Though it can get a bit tricky, if "(?-i)" is present somewhere (not at the start) without a preceding "(?i)" then the user may have assumed that the overall setting was case insensitive.
Comment 12 Mike Kaganski 2019-12-21 07:03:14 UTC
(In reply to Eike Rathke from comment #10)
> However, that should be done in a consistent manner for all functions
> that obey the regular expression setting.

... except possibly those that explicitly say about case-(in)sensitive behaviour in standard, like MATCH?

I didn't look into the code, but isn't there a way to set a *default* in the regex engine (case sensitive/insensitive)? so that if regex is enabled, we don't use any preprocessing of the string (transliteration) for insensitivity, but instead set regex engine to "insensitive by default" mode, and rely on the engine obey (?i) normally?
Comment 13 Mike Kaganski 2019-12-21 09:22:42 UTC
i18npool/source/search/textsearch.cxx

TextSearch::setOptions2 handles rOptions.transliterateFlags and initializes xTranslit for IGNORE_CASE. Then it calls RESrchPrepare, where it adds UREGEX_CASE_INSENSITIVE to nIcuSearchFlags. Then in searchForward/searchBackward it uses the transliteration service to get lowercase version of original string, thus making regex engine unable to make reasonable use of (?-i) flag.

There's a comment in TextSearch::RESrchPrepare:

> // Note that the search flag ALL_IGNORE_CASE is deprecated in UNO
> // probably because the transliteration flag IGNORE_CASE handles it as well.

So maybe in case of regex search, transliteration service should not be used/should do nothing? The UREGEX_CASE_INSENSITIVE should already do the trick?
Comment 14 Mike Kaganski 2019-12-21 09:29:51 UTC
By the way, this behaviour makes regex with (?-i) work wrong:

> A1   q
> A2   Q
> A3   =COUNTIF(A1:A2; "(?-i)q")

gives 2

> A1   q
> A2   Q
> A3   =COUNTIF(A1:A2; "(?-i)Q")

gives 0

because contents of both cells made lowercase "q" before being passed to regex engine to test against regex which is explicitly set case-sensitive.
Comment 15 Mike Kaganski 2019-12-21 10:38:17 UTC
https://gerrit.libreoffice.org/85650 seems to pass tests - it's quite simple, but it should have quite wide effect. Also it doesn't treat MATCH etc specially. Eike, could you please take a look if it makes sense?

As aside: regardless of this being implemented, I believe we need documentation addition to mention (default) case-sensitivity/insensitivity for all functions where it applies.
Comment 16 Commit Notification 2020-01-03 19:52:29 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/12b4590f3a9ba64bcc27e60185ee7366d9894cc7

tdf#78840: disable case-insensitive transliteration for regex search

It will be available in 6.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 17 Mike Kaganski 2020-01-03 20:05:58 UTC
Eike: thanks for the review!
Comment 18 Commit Notification 2020-01-03 20:14:36 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/ace8602466986e0249aa41845dce4e7da4fcafba

Elaborate comment what happens, tdf#78840 follow-up

It will be available in 6.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 19 Mike Kaganski 2020-01-03 21:07:48 UTC
(In reply to Commit Notification from comment #18)
> Eike Rathke committed a patch related to this issue.
> It has been pushed to "master":
> 
> https://git.libreoffice.org/core/commit/
> ace8602466986e0249aa41845dce4e7da4fcafba
> 
> Elaborate comment what happens, tdf#78840 follow-up

Ah, thanks for that!^ I only now realized that I should have written "RESrchPrepare will consider TransliterationFlags::IGNORE_CASE in aSrchPara.transliterateFlags", not "... SearchAlgorithms2::REGEXP in aSrchPara.transliterateFlags". :-)