Bug 113977 - Implement REGEX.* function(s)
Implement REGEX.* function(s)
 Status: RESOLVED FIXED None LibreOffice Unclassified Calc (show other bugs) unspecified All All medium enhancement Eike Rathke target:6.2.0 target:6.2.0.1 (view as bug list) Calc-Function Show dependency tree / graph

 Reported: 2017-11-21 17:19 UTC by Dan Dascalescu 2019-04-16 09:53 UTC (History) 9 users (show) dlynch1319 erack jag maddes+docfoundation miguelangelrv raal turbobeholder winfrieddonkers xiscofauli

Attachments

 Note You need to log in before you can comment on or make changes to this bug.
 Dan Dascalescu 2017-11-21 17:19:24 UTC ```Description: Extracting a substring from a string is a royal pain with the existing text function. It has to be done using a combination of MID(), LEFT() and awkward repetitive arithmetic. Steps to Reproduce: https://help.libreoffice.org/Calc/Text_Functions only lists functions that return the start of a found string, or that extract strings based on an offset and number of characters (the latter causing the repetitive arithmetic I mentioned. Here's my best attempt to extract a currency value for JPY: A1: EUR:1.2,JPY:0.02,AUD:0.9,... A2: =SEARCH("(?<=JPY:)\d", A1) // ← value starts A3: =FIND(",", A7, SEARCH("(?<=JPY:)\d", A1)) // ← value ends right before that A4: =MID(A1, SEARCH("(?<=JPY:)\d", A1), FIND(",", A1, SEARCH("(?<=JPY:)\d", A1)) - SEARCH("(?<=JPY:)\d", A1)) // standalone formula #facepalm Actual Results: Horrible formulas Expected Results: Per the documentation at https://support.google.com/docs/answer/3098244?hl=en, REGEXEXTRACT would be far, far simpler: =REGEXEXTRACT(A1, "(?<=JPY:)[^,]+") Reproducible: Always User Profile Reset: No Additional Info: A user cares a lot more about the actual substring they want to extract, than about its offset and length within the original string. User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36``` Xisco Faulí 2017-11-21 20:29:05 UTC `@Winfried, what do you think ?` Winfried Donkers 2017-11-22 07:22:02 UTC ```(In reply to Xisco Faulí from comment #1) > @Winfried, what do you think ? There is no REGEXEXTRACT function defined in ODFF nor is it in Excel, so it's clearly an enhancement request. Personally I am both attracted by the request and reluctant to be enthusiastic. At first sight it looks like a proper enhancement, one function that with regex can extract the most wonderful parts out of a text string. However I see some aspects that reduces its usefulness: -regular expressions are not everybody's cup of tea. This makes the function only useful to a very small group of users; -Calc has a setting (in preferences) in which wild cards or regular expressions can be selected to be used in function arguments. This request is limited to regular expressions only and thereby somewhat conflicts with the wild card/regex setting; -As the function is neither in ODFF nor in Excel, it will be incompatible with other applications where LibreOffice strives for optimal interoperability with other applications; -Calculations (including those on text strings) tend to look very complex when the calculation goes further than just one simple calculation. This can be avoided by using more than one step (in the example in the description cells A2 and A3 can be used to make the formula in A4 much simpler: '=MID(A1;A2;A3-A2)', and the formula in A3 can be reduced to '=FIND(",", A1, A2)'). Creating new functions that combine other functions will only be useful if there is a common need for them. Otherwise there will be thousands of 'combination functions', each hardly used and very hard to to find the exact function a user has a need for. Right now there are already too much variations of functions (e.g. FIND and SEARCH) that confuse users and makes it hard to select the 'best' function. In programming it is the same: with some standard functions an almost infinite amount of functionalities can be programmed; only functionality that is widely used is put into a separate function. The complexity is the same too, a piece of code can be divided into easy to understand chunks or be written in a single line (that can only be deciphered by the programmer him/herself, provided (s)he remembers the clues). -A macro can provide the same functionality and be exactly suited to the user's need. Well, these are my thoughts, with 'my' meaning personal. The enhancement request is clear, so I set the status to new. @Dan : should you wish to do this yourself, LibreOffice has a mentor for those new to developing for LibreOffice. I can help you with some code pointers as well. The number of developers with both knowledge of the Calc code base and time available for enhancements like REGEXEXTRACT is very small, so it may take a long time before anyone picks this up.``` Winfried Donkers 2017-11-22 07:22:56 UTC ```(In reply to Dan Dascalescu from comment #0) > A3: =FIND(",", A7, SEARCH("(?<=JPY:)\d", A1)) // ← value ends right before > that IMHO A7 should be A1.``` Dan Dascalescu 2017-11-25 20:05:56 UTC ```> At first sight it looks like a proper enhancement, one function that > with regex can extract the most wonderful parts out of a text string. That's what I was going for, thank you. > However I see some aspects that reduces its usefulness: > -regular expressions are not everybody's cup of tea. > This makes the function only useful to a very small group of users; A lot of users are familiar with wildcards, e.g. *.jpg. Many regexp tasks are solvable with just "foo.*bar". Anyway, by the logic "a function only serves some users therefore we shouldn't have it", none of the trigonometry functions should be in LibreCalc. For example, I've never used "arctangent" in my life. Or "BAHTTEXT". Wish I had numbers to show that far more Google Docs users use REGEXEXTRACT than BAHTTEXT. > -Calc has a setting (in preferences) in which wild cards or regular > expressions can be selected to be used in function arguments. This request is > limited to regular expressions only and thereby somewhat conflicts with the > wild card/regex setting; It "conflicts" only as much as the regex pattern for existing functions that use regexps, like SEARCH, conflicts. We can solve this problem the same way we solved the SEARCH problem. > -As the function is neither in ODFF nor in Excel, it will be incompatible > with other applications where LibreOffice strives for optimal > interoperability with other applications; Do we limit formatting because it's not compatible with CSV? 1. If a document author wants to provide backwards compatibility with applications that don't support a given function, they won't use that function. 2. What other applications exactly are we talking about? Only Excel has significant market share, because Google Sheets supports REGEXEXTRACT. 3. Innovation requires breaking compatibility. By providing a new function, we enable users to force the developers of their applications to step up and provide the same function. This very process is at work now as I'm adovating for supporting a function from Google Sheets... 4. ...which is reason #4 why we should implement that function - so Google Sheets that use it could be opened in LibreCalc. > -Calculations (including those on text strings) tend to look very complex > when the calculation goes further than just one simple calculation. This can > be avoided by using more than one step I prefer not to litter my spreadsheet with cells that contain intermediate values. In my example with web service-based currency conversion (BTW a very common task that we don't have a good solution for - see bug 113974), the string "EUR:1.2,JPY:0.02,AUD:0.9,..." would require 2 intermediate cells per each currency. This would trade off formula complexity (usually invisible and forgotten about once you've figured out the formula) for the visible complexity of the intermediate cells (which, yes, could be hidden as well). Anyway, I'd love to have one clean function that can extract exactly what I need into the desired cell. > Creating new functions that combine other functions will only be useful if > there is a common need for them. I think Google Sheets shows there's a need for REGEXEXTRACT... > In programming it is the same: with some standard functions an almost > infinite amount of functionalities can be programmed; only functionality that > is widely used is put into a separate function. ... and so do all modern programming languages: "match" in JavaScript "search" in Python "matcher" in Java etc. > -A macro can provide the same functionality and be exactly suited to the > user's need. I would say a lot more users are comfortable using functions already provided by Calc/Excel/Sheets, than creating macros. > The enhancement request is clear, so I set the status to new. > @Dan : should you wish to do this yourself, LibreOffice has a mentor for > those new to developing for LibreOffice. I can help you with some code > pointers as well. Thank you. I hope to have time to tackle this on next year, though for now I've just been using a long formula, so my motivation has decreased somewhat. > IMHO A7 should be A1. Correct :) Which proves how brittle that sort of complex MID+SEARCH expression is.``` Eike Rathke 2018-01-30 15:56:05 UTC ```(In reply to Dan Dascalescu from comment #4) > Google Sheets supports REGEXEXTRACT. Out of curiosity: when saving/exporting/downloading such a Google Sheet as ODF, what is written as function name?``` Winfried Donkers 2018-01-31 09:17:12 UTC ```(In reply to Eike Rathke from comment #5) > (In reply to Dan Dascalescu from comment #4) > > Google Sheets supports REGEXEXTRACT. > Out of curiosity: when saving/exporting/downloading such a Google Sheet as > ODF, what is written as function name? with a common function looking like If this function is to be added, we should think about REGEXMATCH and REGEXREPLACE as well. (I'm interested in giving it a try, but only if there are more people saying the enhancement is wanted than just Dan Dascalescu and me.)``` Eike Rathke 2018-01-31 15:30:21 UTC ```(In reply to Winfried Donkers from comment #6) > table:formula="of:=IFERROR(__xludf.dummyfunction("REGEXEXTRACT( A1, > ""a."" )");"ab")" Horrible. Why don't they use a simple COM.GOOGLE.REGEXTRACT(...) instead like ODFF suggests. Their construct is terrible, it embeds the entire function call and parameters as string parameter into the __xludf.dummyfunction(). REGEXMATCH and REGEXREPLACE would indeed make sense if we implemented any REGEX* function.``` Winfried Donkers 2018-01-31 16:26:25 UTC ```(In reply to Eike Rathke from comment #7) > (In reply to Winfried Donkers from comment #6) > Horrible. Why don't they use a simple COM.GOOGLE.REGEXTRACT(...) instead > like ODFF suggests. Their construct is terrible, it embeds the entire > function call and parameters as string parameter into the > __xludf.dummyfunction(). And the IFERROR around that is not from me, but from Google Sheets too. As if they don't trust other applications to handle unknown functions properly.``` Eike Rathke 2018-02-01 11:32:24 UTC ```(In reply to Winfried Donkers from comment #8) > And the IFERROR around that is not from me, but from Google Sheets too. As > if they don't trust other applications to handle unknown functions properly. That again may make sense, if the function isn't handled and results in error but the formula is compiled and interpreted otherwise, then the final result is still the result that was displayed in Google Sheets, here "ab", instead of an error result. It will of course never follow changed input.``` Eike Rathke 2018-10-24 09:04:35 UTC `*** Bug 119488 has been marked as a duplicate of this bug. ***` Eike Rathke 2018-10-24 09:34:20 UTC ```Recap: I'm for implementing some REGEX function(s) to search for and also replace similar to /regex/ and s/regex/replacement/flags I'm against naming it (or them) the same REGEXMATCH or REGEXEXTRACT as in Google Sheets or mapping import to such function, because Google Sheets uses RE2 while we use ICU's Regex and there are subtle differences in the supported expressions and metacharacters, which could lead to different results for the same regex.``` m.a.riosv 2018-10-24 09:57:54 UTC Comment hidden (me-too) `+1` Winfried Donkers 2018-10-24 11:54:46 UTC Comment hidden (me-too) ```(In reply to Eike Rathke from comment #11) > Recap: > > I'm for implementing some REGEX function(s) to search for and also replace > similar to /regex/ and s/regex/replacement/flags > > I'm against naming it (or them) the same REGEXMATCH or REGEXEXTRACT as in > Google Sheets or mapping import to such function, because Google Sheets uses > RE2 while we use ICU's Regex and there are subtle differences in the > supported expressions and metacharacters, which could lead to different > results for the same regex. +1``` Commit Notification 2018-10-25 10:14:35 UTC ```Eike Rathke committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=8dec85a3b3f4cbd46b03f707458347a25cc22c15 Resolves: tdf#113977 implement REGEX() spreadsheet function It will be available in 6.2.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.``` Eike Rathke 2018-10-25 10:33:19 UTC ``` REGEX( Text ; Expression [ ; Replacement ] ) Using ICU regular expressions http://userguide.icu-project.org/strings/regexp``` Xisco Faulí 2018-10-25 10:36:33 UTC `@Eike, This is wonderful. I think it makes sense to add it to the release notes -> https://wiki.documentfoundation.org/ReleaseNotes/6.2` Eike Rathke 2018-10-25 12:54:04 UTC `Of course. Done.` Commit Notification 2018-10-26 13:24:41 UTC ```Eike Rathke committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=2107e08fe4798d48610f7472fa581103462081aa Let REGEX() return #N/A instead of #VALUE! if no match, tdf#113977 follow-up It will be available in 6.2.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.``` Eike Rathke 2018-10-26 13:25:54 UTC `Further work to be done on this, do not consider it fully implemented.` Commit Notification 2018-10-30 01:36:17 UTC ```Eike Rathke committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/+/9574076674acfed2e7f24a86929c10b7a539a35c%5E%21 Rewrite REGEX() to directly use icu::RegexMatcher, tdf#113977 It will be available in 6.2.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.``` Commit Notification 2018-10-30 13:43:58 UTC ```Eike Rathke committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/+/61aa2f2e7e89268bab174447016f3a3e14a541df%5E%21 4th parameter Flags for REGEX(), tdf#113977 It will be available in 6.2.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.``` Eike Rathke 2018-10-30 13:45:16 UTC ``` REGEX( Text ; Expression [ ; [ Replacement ] [ ; Flags ] ] ) REGEX(Text;Expression) extracts the first match of Expression in Text. If there is no match, #N/A is returned. REGEX(Text;Expression;Replacement) replaces the first match of Expression in Text, not extracted. If there is no match, Text is returned unmodified. REGEX(Text;Expression;Replacement;"g") replaces all matches of Expression in Text, not extracted. If there is no match, Text is returned unmodified.``` Commit Notification 2018-11-01 17:02:39 UTC ```Eike Rathke committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/+/0ee11c3f55ec1285d8febc065bda44224052eaae%5E%21 Add REGEX() 4th parameter also to Excel import/export, tdf#113977 It will be available in 6.2.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.``` Maddes 2018-11-01 19:56:53 UTC ```Tested a daily build from TinderBox: Win-x86_64@62-TDF, Branch:MASTER, Time: 2018-10-31_22:39:43 According to the documenatation in this "bug" REGEX() returns the first match, but what if I only want a part of the whole match: For example I want to get the filename from a path: =REGEX("/path/to/file.txt"; "/([^/]*)\$") returns "/file.txt", but wanted result is "file.txt" only. =REGEX("/path/to/file.txt"; "/[^/]*\$") should return "/file.txt" as is.``` Maddes 2018-11-01 20:20:08 UTC ```Forgot to mention that I need to check for the slash separator. As otherwise I would have to handle the empty values and "MISSING" values separately with additional IF()s.``` Commit Notification 2018-11-13 20:25:19 UTC ```Zdeněk Crhonek committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/+/ab4e307fe4538c3d4d4ba4b9f23df93ce2d648a3%5E%21 tdf#113977 unit test for function REGEX It will be available in 6.2.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.``` Eike Rathke 2018-11-22 18:41:31 UTC ```(In reply to Maddes from comment #24) > =REGEX("/path/to/file.txt"; "/([^/]*)\$") returns "/file.txt", but wanted > result is "file.txt" only. Then do not include the '/' slash in the pattern, =REGEX("/path/to/file.txt", "([^/]*)\$") If there shall be no match if there is no preceding slash then use a (?<=...) look-behind assertion, =REGEX("/path/to/file.txt", "(?<=/)([^/]*)\$")``` Eike Rathke 2018-11-22 18:56:45 UTC `And probably you do not want to match an empty string after a trailing slash (like in "/path/to/") so that should match at least one character instead of zero or more, and the capture group is unnecessary, so the expression should be "(?<=/)[^/]+\$"` David Lynch 2018-11-24 15:56:54 UTC ```This looks very useful: thank you are very much. I wish I'd had the chance to comment earlier. I would have made the following point: Change the fourth parameter (to be more useful and to be compatible with SUBSTITUTE) REGEX(Text;Expression;Replacement) replaces *** all *** matches of Expression in Text, not extracted. If there is no match, Text is returned unmodified. REGEX(Text;Expression;Replacement;n) replaces *** the nth *** match of Expression in Text, not extracted. If there is no match, Text is returned unmodified. (If n=0, all matches are replaced).``` Wolfgang Jäger 2018-11-24 17:41:21 UTC ```-1- I would like to back the suggestion by @David Lynch (Comment #29). -2- Details: The implementation of REGEX() function as found in V6.2.0.0.beta1 (x64) for Win supports te modifyers concerning case sensitivity (?i) and (?-i) ("insensitive", "not insenitive" (also inside a RegEx for the part to come). Default is (?-i). That's FINE. However, the standard functions supporting RegEx for a long time now are working with the default (?i) and do NOT support any changes concerning case sensitivity. To reduce the danger of annoying misunderstandings and of wasting time this way I would suggest to ALSO support (?-i) with SEARCH(), MATCH(), COUNTIS() ... and to change the default for REGEX() to (?i) to avoid new inconcistencies. This might best be done before 6.2. gets released. @ any QA person; asking for advice: This enhancement bug was closed as RESOLVED FIXED. That's basically ok. Should I better start a new thread with the above content or should I re-open this bug OR will YOU do so?``` Eike Rathke 2018-11-27 01:25:38 UTC ```(In reply to David Lynch from comment #29) > REGEX(Text;Expression;Replacement) replaces *** all *** matches I'm against this. Common implementations using search and replace with regular expressions replace only the first occurrence if no further flags are given (or specialized functions such as gawk gsub() or gensub() are used). It would be odd to reverse that logic. > REGEX(Text;Expression;Replacement;n) replaces *** the nth *** match That might be possible. I can investigate. > (If n=0, all matches are replaced). I'd not do that but rather stick with "g"(lobal) there, which is well known from other implementations. n=0 might return an unmodified Text though, not sure about that. Might be useful in calculated replacements. (In reply to Wolfgang Jäger from comment #30) > However, the standard functions supporting RegEx for a long time now are > working with the default (?i) and do NOT support any changes concerning case > sensitivity. SEARCH(), MATCH() and others are defined to match case-insensitive. > To reduce the danger of annoying misunderstandings and of wasting time this > way I would suggest to ALSO support (?-i) with SEARCH(), MATCH(), COUNTIS() (?-i) is ignored and thus can't be used if the overall matching is flagged case-insensitive, which we obviously can't change. We could only internally change an expression "regex" to "(?i)regex" and feed that to a case-sensitive matching. I'm not much a friend of this but it could work. However, any change there would introduce a backwards incompatibility as such new expressions using (?-i) would yield different results in older release. So no, I'm against this. > ... and to change the default for REGEX() to (?i) to avoid new > inconcistencies. The REGEX() function closely follows the implementation and API of the ICU regular expressions engine and the behaviour is described in detail at http://userguide.icu-project.org/strings/regexp . Its default, as with all common regular expression implementations, is case-sensitive and I'd rather not change that because I consider it expected behaviour.``` David Lynch 2018-11-27 07:32:47 UTC ```(In reply to Eike Rathke from comment #31) > > REGEX(Text;Expression;Replacement;n) replaces *** the nth *** match > That might be possible. I can investigate. > I'm happy with Eike's reply. What's important to me is to have the facility to replace the nth match.``` Commit Notification 2018-11-29 10:31:47 UTC ```Eike Rathke committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/+/e3af4947fd4b8d1411212775e8ffe42e330364c3%5E%21 Support occurrence number as REGEX() 4th argument, tdf#113977 follow-up It will be available in 6.3.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.``` Eike Rathke 2018-11-29 10:33:30 UTC ``` REGEX( Text ; Expression [ ; [ Replacement ] [ ; Flags|Occurrence ] ] ) REGEX(Text;Expression) extracts the first match of Expression in Text. If there is no match, #N/A is returned. REGEX(Text;Expression;Replacement) replaces the first match of Expression in Text, not extracted. If there is no match, Text is returned unmodified. REGEX(Text;Expression;Replacement;"g") replaces all matches of Expression in Text with Replacement, not extracted. If there is no match, Text is returned unmodified. REGEX(Text;Expression;;Occurrence) extracts the n-th match of Expression in Text. If there is no n-th match, #N/A is returned. If Occurrence is 0, Text is returned unmodified. REGEX(Text;Expression;Replacement;Occurrence) replaces the n-th match of Expression in Text with Replacement, not extracted. If there is no n-th match, Text is returned unmodified. If Occurrence is 0, Text is returned unmodified.``` Eike Rathke 2018-11-29 10:34:02 UTC `Pending for 6-2 https://gerrit.libreoffice.org/64219` Commit Notification 2018-11-29 14:59:56 UTC ```Eike Rathke committed a patch related to this issue. It has been pushed to "libreoffice-6-2": https://git.libreoffice.org/core/+/e18c4c5f55f12725821e2d357f251ae1c548217a%5E%21 Support occurrence number as REGEX() 4th argument, tdf#113977 follow-up It will be available in 6.2.0.1. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.``` Commit Notification 2018-11-29 20:17:57 UTC ```Eike Rathke committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/help/+/5f6e0b4ea97c5011a80613fc21a703d2178a8c6e%5E%21 Occurrence number as REGEX() 4th argument, tdf#113977 tdf#121084``` Commit Notification 2018-11-29 20:26:10 UTC ```Eike Rathke committed a patch related to this issue. It has been pushed to "libreoffice-6-2": https://git.libreoffice.org/help/+/ee142eec43e73bac7377380c3c72c31600a20355%5E%21 Occurrence number as REGEX() 4th argument, tdf#113977 tdf#121084``` Winfried Donkers 2019-04-16 09:53:38 UTC `*** Bug 123000 has been marked as a duplicate of this bug. ***`