Description: Error in Basic Instr function in case of case insensitive search with non-Latin letters. Steps to Reproduce: Run macro: Sub TestInstrFunction() MsgBox Instr(1, "α", "Α", 1) End Sub Actual Results: Result is 0. Expected Results: Must be 1. Arguments - Greek letter alpha (uppercase and lowercase). There is no error with Latin letters. Reproducible: Always User Profile Reset: No Additional Info: -
source code pointer: https://opengrok.libreoffice.org/xref/core/basic/source/runtime/methods.cxx?r=3482f590#888
Yes, the toAsciiUpperCase and toAsciiLowerCase functions MUST NOT be used when processing texts containing Unicode characters with codes>= U+0080.
Confirmed in: Version: 7.2.0.0.alpha0+ (x64) / LibreOffice Community Build ID: db35b9086476259fa2c047f2e4dfe7862d026530 CPU threads: 6; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win Locale: de-DE (de_DE); UI: en-US Calc: CL
Andreas Heinisch committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/7a578c06352328799c644e0399f14d58b05246f9 tdf#139840 - Case-insensitive operation for non-ASCII characters It will be available in 7.2.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Note that the Unicode standard defines a concept of locale-independent "default caseless matching" (D144 in section 3.13 "Default Case Algorithms", <https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf>), which might be more appropriate to use here than any specific locale-dependent approach.
Colleagues, thank you very much for your attention to the topic and the bug fix!
There is still an error with this fix. Consider: Sub Test() MsgBox InStr(2, "Straße", "s") End Sub it should return 0 instead of 5. The error comes from the fact that the German ß is replaced using SS.
As Mike Kaganski pointed out in gerrit: "It might be a tough problem. Note that in Writer, not using regular expressions nor case-sensitive search, the Find & Replace (Ctrl+H) behaves as InStr - it finds 's' in 'ß'. Interesting fact: entering 'ß' in Google Chrome's search box finds 'ss'." So, Sub Test() MsgBox InStr(2, "Straße", "s") End Sub should indeed return 5 to make it consistent with Writer etc.
The same holds for InstrRev which is implemented using the toAsciiUpperCase function.
I don't think the Instr function implementation should conform to the "Unicode normalization forms" standard. By the way: in the corresponding SEARCH (Calc, Excel), Instr (VBA) functions, the search is performed character by character and the text is not normalized. In addition, the Replace function is often executed after the Instr function, and normalizing the text would complicate things a lot.
Andreas Heinisch committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/afddd56a8049957b9c0e025992d47c04342dbb88 tdf#139840 - Use utl::TextSearch to implement the InStr function It will be available in 7.3.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Andreas Heinisch committed a patch related to this issue. It has been pushed to "libreoffice-7-2": https://git.libreoffice.org/core/commit/632fd5fd504d9800d580ceeeb87bc2b5d626d56a tdf#139840 - Use utl::TextSearch to implement the InStr function It will be available in 7.2.0.2. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
(In reply to Vladimir Sokolinskiy from comment #10) > By the way: in the corresponding SEARCH (Calc, Excel), [...] > functions, the search is performed character by character and the text is > not normalized. In Calc SEARCH("Straße";"ss") returns 5 (as does SEARCH("Strasse";"ß")). That's not about normalization though but due to case-ignore transliteration.
Hello Eike! In localizations en-US (and ru-Ru :) ), both formulas you specified return the value #VALUE!
SEARCH has a different order of arguments, so in Calc everything is essentially the same as in # 13 (sorry, I don’t know how to change the previously written comment). Excel returns #VALUE! in both cases (at least in the en-US localization).
Sorry, of course it's SEARCH("ss";"Straße") and SEARCH("ß";"Strasse") I should had copied instead of typing..
Pierre F committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/help/commit/dc7a2b091a7652d4907391f8e39262b65ae6c780 clarify Instr(). tdf#129436, tdf#139840
Pierre F, many thanks!