Download it now!
Bug 125995 - C locale is currently broken for file handling
Summary: C locale is currently broken for file handling
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
6.4.0.0.alpha1+
Hardware: All Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: needsDevAdvice
Depends on:
Blocks: GTK KDE
  Show dependency treegraph
 
Reported: 2019-06-18 18:06 UTC by Jan-Marek Glogowski
Modified: 2020-11-10 16:54 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jan-Marek Glogowski 2019-06-18 18:06:58 UTC
Description:
This is the "extension" of bug 125971.

Something in the local file URL handling is currently broken when you use the C locale, at least on all unix backends. I can't test MacOS and Windows, but since I suspect an error in the URL handling with regard to the current locale setting, at least MacOS might be affected too. Has Windows some equivalent of C locale?

Steps to Reproduce:
1. Have a unicode / UTF8 file system (that's standard I guess)
2. Have a file name with non-ASCII characters (łąka.png - 'LC_ALL=C ls -b' will show the correct UTF8 encoding \305\202\304\205ka.png)
3. Start LO with LANG=C / LC_ALL=C
4. Open the file
5. Export the file

Actual Results:
1. The file picker for "gen" shows the wrong file names. kde5 and gtk3 are fine.
2. After opening, the window title has the file name with a wrong encoding.
3. The recent file list has the file name with wrong encoding (which actually works!)
4. The save dialog has the wrong default name.

5. Saving the file will generate the right file name only for gen.
For gen the wrong default on save is consequently correct and it'll ask before overwriting the existing opened file. kde5 and gtk3 will write a new file with a - now really wrong name.

The saved file name for gtk3 and kde5 is: \303\205\302\202\303\204\302\205ka.png. 

That's the same encoded name I would generate for the fix for bug 125971 via

OUString aNewURL = uri::ExternalUriReferenceTranslator::create(m_xContext)->translateToInternal(toOUString(aURL.toEncoded()));

But this is actually some double encoding, because aURL.toEncoded() is already the correctly encoded UTF8, which LO expects. And it's probably the origin of most of the bug. 

FWIW this is the only encoding variant LO currently accepts from either kde5 or the  gtk3 file picker.

Expected Results:
The correct filename is used everywhere, where it's now wrong in the "Actual Results".


Reproducible: Always


User Profile Reset: No



Additional Info:
Comment 1 Stephan Bergmann 2019-06-19 07:58:46 UTC
(In reply to Jan-Marek Glogowski from comment #0)
> Description:
> This is the "extension" of bug 125971.
> 
> Something in the local file URL handling is currently broken when you use
> the C locale, at least on all unix backends. I can't test MacOS and Windows,
> but since I suspect an error in the URL handling with regard to the current
> locale setting, at least MacOS might be affected too. Has Windows some
> equivalent of C locale?

I'm not sure why you qualify this issue with "currently".  The behavior should be as it is ever since OOo.

A traditional Unix (incl. Linux) file name is just a sequence of bytes, without a means specifying in what encoding to interpret those bytes.  Ever since OOo was made Unicode-aware, it wanted to represent pathnames internally as Unicode (UTF-16) strings (whether or not that was a good decision, but it's consequences permeate the code base and it would probably be hard to change it now).  It adopted the convention of translating between a pathname's bytes and the internal OUString according to the system locale that OOo is run with (i.e., LANG/LC_ALL; see osl_getThreadTextEncoding).  (That of course means that there can be problems, e.g. when a pathname consists of a sequence of bytes that is not valid according to osl_getThreadTextEncoding(), or when some internal OUString shall be translated to a pathname's sequence of bytes, but contains Unicode letters that cannot be mapped to osl_getThreadTextEncoding().  OOo/LO have always been prone to such problems.  In practice, their impact is reduced by people using a single, consistent system locale (text encoding) to name their files and to run LO, and by many people exclusively using UTF-8 locales anyway these days.)

> Steps to Reproduce:
> 1. Have a unicode / UTF8 file system (that's standard I guess)

Traditional Unix (incl. Linux) file systems do not have an encoding, see above.

> 2. Have a file name with non-ASCII characters (łąka.png - 'LC_ALL=C ls -b'
> will show the correct UTF8 encoding \305\202\304\205ka.png)
> 3. Start LO with LANG=C / LC_ALL=C

This is the "user mistake".  To operate well with files whose names are encoded in UTF-8, LO should be run with a UTF-8 locale.  Otherwise, problems are expected to occur (see above).

> 4. Open the file
> 5. Export the file
> 
> Actual Results:
> 1. The file picker for "gen" shows the wrong file names. kde5 and gtk3 are
> fine.

Arguably, according to my above explanation, the gen file picker shows the right file name here.  With LANG=C, osl_getThreadTextEconding() effectively is RTL_TEXTENCODING_ISO_8859_1 (though technically it is RTL_TEXTENCODING_ASCII_US), so you get "ÅÄka.jpg".

The kde5 and gtk3 file pickers presumably use external library code that doesn't follow LO's convention of interpreting pathnames' byte sequences according to the system locale, but instead always interpret them as UTF-8.  That would explain why the kde5 file picker dialog shows the file's name as "łąka.png" instead of "ÅÄka.jpg".  But once the kde5 file picker has passed the <file:///.../%C5%82%C4%85ka.jpg> URL (which is the same URL as the gen file picker passes) to LO's internals, LO will again treat that as representing a pathname whose bytes are interpreted according to osl_getThreadTextEncoding().

> 2. After opening, the window title has the file name with a wrong encoding.

Again, it is the right encoding according to the above.

> 3. The recent file list has the file name with wrong encoding (which
> actually works!)

ditto...
Comment 2 Stephan Bergmann 2019-06-19 10:14:44 UTC
(In reply to Stephan Bergmann from comment #1)
> Arguably, according to my above explanation, the gen file picker shows the
> right file name here.  With LANG=C, osl_getThreadTextEconding() effectively
> is RTL_TEXTENCODING_ISO_8859_1 (though technically it is
> RTL_TEXTENCODING_ASCII_US), so you get "ÅÄka.jpg".

(Above and below, Bugzilla apparently dropped the C1 control characters \U+0082 and \U+0085 from "ÅÄka.jpg", where they should appear after "Å" and after "Ä", respectively.)

> The kde5 and gtk3 file pickers presumably use external library code that
> doesn't follow LO's convention of interpreting pathnames' byte sequences
> according to the system locale, but instead always interpret them as UTF-8. 
> That would explain why the kde5 file picker dialog shows the file's name as
> "łąka.png" instead of "ÅÄka.jpg".  But once the kde5 file picker has passed
> the <file:///.../%C5%82%C4%85ka.jpg> URL (which is the same URL as the gen
> file picker passes) to LO's internals, LO will again treat that as
> representing a pathname whose bytes are interpreted according to
> osl_getThreadTextEncoding().

Sorry, the above "which is the same URL as the gen file picker passes" is wrong:  With LANG=C, LO interprets that file name as written with the characters

  \U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
  \U+0082 <control>
  \U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS
  \U+0085 <control>
  \U+006B LATIN SMALL LETTER K
  ...

and "LO internal file URLs" always have their "payload" encoded as UTF-8 (see udkapi/com/sun/star/uri/XExternalUriReferenceTranslator.idl), so the LO internal file URL that the gen file picker returns is <file:///.../%C3%85%C2%82%C3%84%C2%85ka.png>.  (And when LO wants to access the actual file and converts that URL back to a pathname byte sequence under LANG=C, it first converts from the URL syntax "%C3%85%C2%82%C3%84%C2%85ka.png" to an OUString containing

  \U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
  \U+0082 <control>
  \U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS
  \U+0085 <control>
  \U+006B LATIN SMALL LETTER K
  ...

code units, and then, because of the osl_getThreadTextEncoding() mandated by LANG=C, to the correct byte sequence "\xC5\x82\xC4\x85ka.png".)
Comment 3 Jan-Marek Glogowski 2019-06-19 16:02:20 UTC
Thanks for the input. The gen file picker also works - as expected - if I set LANG=C.UTF-8. I forgot that, thanks for the reminder.

(In reply to Stephan Bergmann from comment #1)
> (In reply to Jan-Marek Glogowski from comment #0)
> 
> I'm not sure why you qualify this issue with "currently".  The behavior
> should be as it is ever since OOo.

Ok.

So I tried to find the definition for the C / POSIX locale and found: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html and http://port70.net/~nsz/c/c89/c89-draft.html

What I actually couldn't find is a setting, which has any meaning with regard to the filesystem encoding. If you say file names are just strings, then LC_CTYPE might match, but IMHO that's today a long shot.

So now arises the question: should / can we change the C locale to default to UTF-8 for interpreting filenames instead of ISO_8859_1, which is today much more common?
I don't know if this actually does match. And I'm also not sure this is feasible in a sensible matter in LO, but somehow Qt and Gtk+ make this assumption and act accordingly.

I'm not aware of any other system API, which could be queried for something like that. I tested gimp on Windowmaker with xterm and LANG=C. Still there might be some system setting.
Comment 4 Buovjaga 2020-11-10 16:54:30 UTC
(In reply to Jan-Marek Glogowski from comment #0)
> Steps to Reproduce:
> 1. Have a unicode / UTF8 file system (that's standard I guess)
> 2. Have a file name with non-ASCII characters (łąka.png - 'LC_ALL=C ls -b'
> will show the correct UTF8 encoding \305\202\304\205ka.png)
> 3. Start LO with LANG=C / LC_ALL=C
> 4. Open the file
> 5. Export the file
> 
> Actual Results:
> 1. The file picker for "gen" shows the wrong file names. kde5 and gtk3 are
> fine.
> 2. After opening, the window title has the file name with a wrong encoding.
> 3. The recent file list has the file name with wrong encoding (which
> actually works!)
> 4. The save dialog has the wrong default name.

I confirm with gen. Step 4 should be "Insert - Image". I don't understand the mention about "after opening", "window title" and "recent file list", because they don't apply to inserted images.

Arch Linux 64-bit
Version: 7.1.0.0.alpha1+
Build ID: c9b320c32aceab7e22d381b688e7b44030e01c2d
CPU threads: 8; OS: Linux 5.9; UI render: default; VCL: x11
Locale: fi-FI (fi_FI.UTF-8); UI: en-US
Calc: threaded
Built on 8 November 2020