Bug 99589 - tolower / toupper - dangerous to Turks ...
Summary: tolower / toupper - dangerous to Turks ...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
5.1.0.1 rc
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:5.2.0 target:5.3.0 target:5.4.0
Keywords: difficultyBeginner, easyHack, skillCpp, topicCleanup
Depends on:
Blocks:
 
Reported: 2016-04-30 13:23 UTC by Michael Meeks
Modified: 2017-05-18 07:54 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Meeks 2016-04-30 13:23:09 UTC
We have had build failures when compiling in the Turkish locale - ie.

export LANG=tr_TR.UTF-8

This is because in this locale - toupper(i) is != 'I' and toupper(ı) == 'I' cf. wikipedia: https://en.wikipedia.org/wiki/Turkish_alphabet#Letters 

bug#99587 - is a first cut at fixing one instance of this but - I think we can conclude on this basis that any use of 'toupper' or 'tolower' in the code is a bug. Instead we should use our own ASCII comparison functions here - ideally using the Ascii variants in sal/ OString / OUString which cope with this.

This sort of thing looks plausibly safe in comparison - albeit ugly:

char easytolower( char in )
{
    if( in<='Z' && in>='A' )
        return in-('Z'-'z');
    return in;
}

$ git grep toupper
$ git grep tolower

Thanks ! =)
Comment 1 Michael Meeks 2016-04-30 18:13:21 UTC
Oh ye of little faith ;-) the demo app:

/* intended for ~/a.c */
#include <stdio.h>
#include <ctype.h>
#include <string.h>
#include <locale.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
	setlocale(0,getenv("LANG"));
	fprintf(stderr, "%d -> %d\n", 'I', tolower('I'));
	return 0;
}

$ gcc ~/a.c && LANG=tr_TR.UTF-8 ./a.out
73 -> 73
$ gcc ~/a.c && LANG=C ./a.out
73 -> 105

=)
Comment 2 Julien Nabet 2016-04-30 19:08:48 UTC
Thought you might be interested in this other tdf since it also concerns a problem with I and i in Turkish.
Comment 3 Michael Meeks 2016-05-01 08:10:13 UTC
Perhaps the best way to fix this is a global / manual replace using:

#include <rtl/character.hxx>

and using:

inline sal_uInt32 toAsciiLowerCase(sal_uInt32 code)
inline sal_uInt32 toAsciiUpperCase(sal_uInt32 code)

in place of tolower / toupper.
Comment 4 Gökhan Gurbetoğlu 2016-05-01 08:30:48 UTC
I am starting to apply the fix suggested by Michael and try to make it less dangerous for Turks.
Comment 5 Eike Rathke 2016-05-03 19:10:08 UTC
isupper() and islower() are equally evil. Instead use

inline bool isAsciiLowerCase(sal_uInt32 code)
inline bool isAsciiUpperCase(sal_uInt32 code)

from rtl/character.hxx
Comment 6 Gökhan Gurbetoğlu 2016-05-04 14:35:44 UTC
Apparently, replacing tolower & toupper to toAsciiLowerCase & toAsciiUpperCase respectively breaks other things and the build fails for any language. It may be related to the islower & isupper functions that Eike Rathke mentioned.
Comment 7 Commit Notification 2016-05-09 09:33:07 UTC
Krishna Keshav committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=26d314d2e25945941d49a4872d7ffa27cfc2fdc8

tdf#99589 tolower / toupper - dangerous to Turks ...

It will be available in 5.2.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 8 Commit Notification 2016-05-10 15:27:44 UTC
Krishna Keshav committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=1fcd8dfb70124acc935c24e066dfd3e2144baec9

tdf#99589 tolower / toupper - dangerous to Turks ...

It will be available in 5.2.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 9 Commit Notification 2016-05-30 08:50:02 UTC
apurvapriyadarshi committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=2b99f832dc611355eedace357cd50d976723d1dd

tdf#99589 tolower / toupper - dangerous to Turks ...

It will be available in 5.3.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 10 Commit Notification 2016-06-07 09:25:45 UTC
krishna keshav committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=7a11763823eef04e9602c7de3fff44415061d7cb

tdf#99589 tolower / toupper - dangerous to Turks ...

It will be available in 5.3.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 11 jani 2016-06-14 09:56:22 UTC
Seems solved
Comment 12 Michael Meeks 2016-06-15 08:32:33 UTC
Is it really solved ? I do:

git grep 'toupper'

and still see rather a number of suspicious hits =)

Same for 'tolower'

Of course - each one needs auditing to see if it is a real use =)
Comment 13 Gökhan Gurbetoğlu 2016-06-21 07:34:36 UTC
There are a few missing replacements. Currently working on them (for all tolower, toupper, islower and isupper functions).
Comment 14 Gökhan Gurbetoğlu 2016-06-21 10:48:36 UTC
I have completed the basic replacement of the naughty functions in .cxx files:
tolower -> rtl::toAsciiLowerCase
toupper -> rtl::toAsciiUpperCase
islower -> rtl::isAsciiLowerCase
isupper -> rtl::isAsciiUpperCase

There are also several different similar functions that does similar things on characters, namely: toLower, ToLower, toUpper, ToUpper, etc. Should we check for their behavior if they cause similar problems?

In addition, the first set of functions also appear in .c, .py and .java files. Should we also modify them?
Comment 15 Gökhan Gurbetoğlu 2016-06-21 11:07:28 UTC
I am now starting a full build with locale tr_TR.UTF-8 and will post the results if it is successful.
Comment 16 Commit Notification 2016-06-23 10:00:32 UTC
Gökhan Gurbetoğlu committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=9b9e5cfd2fa629b2e1dc4a193e48a4a4e8d34126

tdf#99589 - tolower / toupper - dangerous to Turks ...

It will be available in 5.3.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 17 Commit Notification 2016-06-23 10:01:52 UTC
Gökhan Gurbetoğlu committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=ea5a5b1dbb669415586520c2b0c526b133aa07e4

tdf#99589 - tolower / toupper - dangerous to Turks ...

It will be available in 5.3.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 18 Stephan Bergmann 2016-06-23 10:23:57 UTC
Please note that replacing calls to isupper, toupper, etc. with calls to rtl::isAsciiUpperCase, rtl::toAsciiUpperCase, etc. is non-trivial.

The rtl functions all take a UTF-32 code unit parameter of type sal_uInt32, and assert that it is in the valid UTF-32 range 0--0x10FFFF.  When the input to the original isupper, toupper, etc. was of type 'char' or 'unsigned char', and the value may have been outside the ASCII range of 0--0x7F (i.e., may have been negative), blindly converting to rtl::isAsciiUpperCase, rtl::toAsciiUpperCase, etc. may cause those asserts to fire now (as those small negative char values are implicitly converted to large sal_uInt32 values).
Comment 19 Gökhan Gurbetoğlu 2016-06-23 10:59:44 UTC
(In reply to Stephan Bergmann from comment #18)
> Please note that replacing calls to isupper, toupper, etc. with calls to
> rtl::isAsciiUpperCase, rtl::toAsciiUpperCase, etc. is non-trivial.
> 
> The rtl functions all take a UTF-32 code unit parameter of type sal_uInt32,
> and assert that it is in the valid UTF-32 range 0--0x10FFFF.  When the input
> to the original isupper, toupper, etc. was of type 'char' or 'unsigned
> char', and the value may have been outside the ASCII range of 0--0x7F (i.e.,
> may have been negative), blindly converting to rtl::isAsciiUpperCase,
> rtl::toAsciiUpperCase, etc. may cause those asserts to fire now (as those
> small negative char values are implicitly converted to large sal_uInt32
> values).

I did not consider this when doing the patch. I tried to locate the functions and replace them accordingly as Michael Meeks suggested at comment #3. What do you suggest to fix these?
Comment 20 Stephan Bergmann 2016-06-23 11:05:47 UTC
One idea that comes to mind is to extend include/rtl/character.hxx:  For every declaration of a function taking such a 'sal_uInt32' parameter, add deleted definitions (using SAL_DELETED_FUNCTION, as character.hxx is URE API) of overloads taking 'char' and 'unsigned char' parameters.  That will prevent dangerous uses of those functions from even compiling.
Comment 21 jani 2016-07-24 06:46:36 UTC Comment hidden (obsolete)
Comment 22 Gökhan Gurbetoğlu 2016-07-25 14:44:13 UTC
(In reply to jan iversen from comment #21)
> A polite ping, still working on this bug ?
Pong!
I will not be contributing to this issue at the moment.
Comment 23 jani 2016-07-25 14:54:40 UTC
As said in comment
Comment 24 Commit Notification 2016-12-28 09:42:41 UTC
abdulwd committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=47f338c1cfca96db8f0384c3498755c4cd6ff2c4

tdf#99589 tolower changed to rtl::toAsciiLowerCase

It will be available in 5.4.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 25 Commit Notification 2017-01-02 14:09:21 UTC
Julien Nabet committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=f8e77c0894f5eb4495edb79a387f0deb63830226

Revert "tdf#99589 tolower changed to rtl::toAsciiLowerCase"

It will be available in 5.4.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 26 jani 2017-01-09 08:52:57 UTC
Does the revert mean there are still work to be done ?
Comment 27 Stephan Bergmann 2017-02-07 15:10:03 UTC
(In reply to jan iversen from comment #26)
> Does the revert mean there are still work to be done ?

To see whether this issue is done, check whether there are any remaining uses of islower, isupper, tolower, toupper in C++ code (and I spot exactly one such use on current master).
Comment 28 Michael Stahl (allotropia) 2017-02-07 15:22:01 UTC
in our own C++ code, there are 2 problematic areas left:

* rsc has a couple calls to tolower/toupper, one of which is for
  a command line flag -I

  these can be replaced with toAsciiLowerCase like comment #3 says

* setup_native/source/win32/wintools/
  there are also some tolower calls in here, one of which is
  a command line flag -I

  because these are MSI DLLs, it is not possible to use sal
  library and toAsciiLowerCase here.

  instead, find some header in there and add the "easytolower"
  function to it, and replace tolower calls with that.

  also, this MSI code can only be built on Windows platform.
Comment 29 Michael Stahl (allotropia) 2017-02-07 15:52:26 UTC
Stephan points out that the "rsc/source/rscpp" stuff is C not C++
so you can't use toAsciiLowerCase there, put another "easytolower"
there too i guess.
Comment 30 Commit Notification 2017-05-18 07:40:18 UTC
dilekuzulmez committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=33cbc99fea5a3e4b87e883fd60ec97d4503109f3

tdf#99589 tolower / toupper - dangerous to Turks ...

It will be available in 5.4.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 31 Stephan Bergmann 2017-05-18 07:47:12 UTC
(In reply to Stephan Bergmann from comment #20)
> One idea that comes to mind is to extend include/rtl/character.hxx:  For
> every declaration of a function taking such a 'sal_uInt32' parameter, add
> deleted definitions (using SAL_DELETED_FUNCTION, as character.hxx is URE
> API) of overloads taking 'char' and 'unsigned char' parameters.  That will
> prevent dangerous uses of those functions from even compiling.

meanwhile done as <https://cgit.freedesktop.org/libreoffice/core/commit/?id=7778d9f51bd1f4d086cafe95995406c3157afb89> "Prevent calls to rtl/character.hxx functions with (signed) char arguments"
Comment 32 Stephan Bergmann 2017-05-18 07:54:16 UTC
(In reply to Stephan Bergmann from comment #27)
> To see whether this issue is done, check whether there are any remaining
> uses of islower, isupper, tolower, toupper in C++ code (and I spot exactly
> one such use on current master).

The only remaining calls to islower, isupper, tolower, toupper are in C code (which cannot use rtl/character.hxx) in rsc/source/rscpp/cpp{3,5}.c.