Bug 139840 - Error in Basic Instr function in case of case insensitive search
Summary: Error in Basic Instr function in case of case insensitive search
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: BASIC (show other bugs)
Version:
(earliest affected)
7.0.4.2 release
Hardware: All All
: medium normal
Assignee: Andreas Heinisch
URL:
Whiteboard: target:7.2.0 target:7.3.0 target:7.2.0.2
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-22 17:52 UTC by Vladimir Sokolinskiy
Modified: 2021-07-20 08:18 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vladimir Sokolinskiy 2021-01-22 17:52:31 UTC
Description:
Error in Basic Instr function in case of case insensitive search with non-Latin letters.

Steps to Reproduce:
Run macro:

Sub TestInstrFunction()
  MsgBox Instr(1, "α", "Α", 1)
End Sub




Actual Results:
Result is 0.

Expected Results:
Must be 1. Arguments - Greek letter alpha (uppercase and lowercase).

There is no error with Latin letters.


Reproducible: Always


User Profile Reset: No



Additional Info:
-
Comment 2 Vladimir Sokolinskiy 2021-02-06 16:47:02 UTC
Yes, the toAsciiUpperCase and toAsciiLowerCase functions MUST NOT be used when processing texts containing Unicode characters with codes>= U+0080.
Comment 3 Andreas Heinisch 2021-05-12 19:57:50 UTC
Confirmed in:

Version: 7.2.0.0.alpha0+ (x64) / LibreOffice Community
Build ID: db35b9086476259fa2c047f2e4dfe7862d026530
CPU threads: 6; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: de-DE (de_DE); UI: en-US
Calc: CL
Comment 4 Commit Notification 2021-05-13 16:37:38 UTC
Andreas Heinisch committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/7a578c06352328799c644e0399f14d58b05246f9

tdf#139840 - Case-insensitive operation for non-ASCII characters

It will be available in 7.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 5 Stephan Bergmann 2021-05-14 14:58:58 UTC
Note that the Unicode standard defines a concept of locale-independent "default caseless matching" (D144 in section 3.13 "Default Case Algorithms", <https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf>), which might be more appropriate to use here than any specific locale-dependent approach.
Comment 6 Vladimir Sokolinskiy 2021-05-15 16:16:49 UTC
Colleagues, thank you very much for your attention to the topic and the bug fix!
Comment 7 Andreas Heinisch 2021-07-11 20:50:22 UTC
There is still an error with this fix. Consider:

Sub Test()
     MsgBox InStr(2, "Straße", "s")
End Sub


it should return 0 instead of 5. The error comes from the fact that the German ß is replaced using SS.
Comment 8 Andreas Heinisch 2021-07-12 20:03:32 UTC
As Mike Kaganski pointed out in gerrit:

"It might be a tough problem. Note that in Writer, not using regular expressions nor case-sensitive search, the Find & Replace (Ctrl+H) behaves as InStr - it finds 's' in 'ß'.

Interesting fact: entering 'ß' in Google Chrome's search box finds 'ss'."

So,

Sub Test()
     MsgBox InStr(2, "Straße", "s")
End Sub

should indeed return 5 to make it consistent with Writer etc.
Comment 9 Andreas Heinisch 2021-07-12 20:05:43 UTC
The same holds for InstrRev which is implemented using the toAsciiUpperCase function.
Comment 10 Vladimir Sokolinskiy 2021-07-13 13:16:25 UTC
I don't think the Instr function implementation should conform to the "Unicode normalization forms" standard.

By the way: in the corresponding SEARCH (Calc, Excel), Instr (VBA) functions, the search is performed character by character and the text is not normalized. In addition, the Replace function is often executed after the Instr function, and normalizing the text would complicate things a lot.
Comment 11 Commit Notification 2021-07-16 07:29:57 UTC
Andreas Heinisch committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/afddd56a8049957b9c0e025992d47c04342dbb88

tdf#139840 - Use utl::TextSearch to implement the InStr function

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 12 Commit Notification 2021-07-19 12:06:56 UTC
Andreas Heinisch committed a patch related to this issue.
It has been pushed to "libreoffice-7-2":

https://git.libreoffice.org/core/commit/632fd5fd504d9800d580ceeeb87bc2b5d626d56a

tdf#139840 - Use utl::TextSearch to implement the InStr function

It will be available in 7.2.0.2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 13 Eike Rathke 2021-07-19 12:53:03 UTC
(In reply to Vladimir Sokolinskiy from comment #10)
> By the way: in the corresponding SEARCH (Calc, Excel), [...]
> functions, the search is performed character by character and the text is
> not normalized.
In Calc SEARCH("Straße";"ss") returns 5 (as does SEARCH("Strasse";"ß")). That's not about normalization though but due to case-ignore transliteration.
Comment 14 Vladimir Sokolinskiy 2021-07-19 13:11:20 UTC
Hello Eike! In localizations en-US (and ru-Ru :) ), both formulas you specified return the value #VALUE!
Comment 15 Vladimir Sokolinskiy 2021-07-19 13:40:20 UTC
SEARCH has a different order of arguments, so in Calc everything is essentially the same as in # 13 (sorry, I don’t know how to change the previously written comment).
Excel returns #VALUE! in both cases (at least in the en-US localization).
Comment 16 Eike Rathke 2021-07-20 08:18:31 UTC
Sorry, of course it's SEARCH("ss";"Straße") and SEARCH("ß";"Strasse")

I should had copied instead of typing..