Bug 115056 - FILESAVE Calc doesn't write CSV as UTF-8
Summary: FILESAVE Calc doesn't write CSV as UTF-8
Status: RESOLVED DUPLICATE of bug 82254
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
5.4.4.2 release
Hardware: x86-64 (AMD64) Windows (All)
: medium normal
Assignee: Not Assigned
URL: http://www.unicode.org/faq/utf_bom.html
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-01-16 21:50 UTC by librebug
Modified: 2018-01-17 19:45 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description librebug 2018-01-16 21:50:03 UTC
Description:
I create a CSV file "u.csv" in UTF-8 form with BOM which shows the single character "ü"; it contains exactly the seven hex bytes

    ef bb bf c3 bc 0d 0a

When I open it in Calc, "ü" appears correctly in cell A1. I add and immediately delete a space character, then save the file with "Save" and also "Save as" the file "uu.csv".

Both are written without the UTF-8 BOM, but with "ü" as a multi-byte character, so they are now incorrect: the file consists exactly of the four hex bytes

    c3 bc 0d 0a

and re-opening the files in Calc fails to recognize them as UTF-8, giving "ü" in A1. Other programs may or may not treat the file as UTF-8, because it lacks the BOM.

This could hardly be plainer: Calc should write UTF-8 files with the BOM. At the very least it should offer the user the chance to specify the character set for writing a sheet, and act appropriately. Its behavior now is wrong.

Ideally it should be possible to specify a global default character set for text formats, with per-sheet formats possible and retained in .ODS files. The same should go for other components (e.g., Writer) where applicable.

Note: I deal with spreadsheets that mix a huge variety of languages, including Korean, Chinese, Russian, Polish, Thai, and all European languages, so handling UTF-8 correctly is extremely important to me.  With Calc right now it's quite painful to do this reliably when exchanging CSV files.

Steps to Reproduce:
1. Create a UTF-8 CSV file (with BOM) containing the single character "ü".
2. Read it with Calc.
3. Make a null change.
4. Save the file.
5. Calc can no longer read the file correctly. It contains a multi-byte character, but no BOM.

Actual Results:  
The file no longer signals itself as UTF-8, and Calc reads the contents as "ü". UTF-8 CSV files with Chinese characters, Russian characters, Thai characters, accented European characters, etc. are all wrecked for Calc by its own actions.

Expected Results:
"ü" in A1.  Chinese characters, Russian characters, Thai characters, accented European characters, etc. appear correctly.


Reproducible: Always


User Profile Reset: Yes


OpenGL enabled: Yes

Additional Info:
Version: 5.4.4.2 (x64)
Build ID: 2524958677847fb3bb44820e40380acbe820f960
CPU threads: 8; OS: Windows 6.1; UI render: default; 
Locale: en-US (en_US); Calc: group


User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
Comment 1 Regina Henschel 2018-01-17 19:45:51 UTC
UTF-8 may use a BOM, but does not need a BOM. A BOM can be dangerous, if using a csv file in a database environment.

When you open a csv file, then use filter "Text - Choose Encoding" in Writer and filter "Text CSV" in Calc. In both cases you get a dialog to choose the encoding. Both filter detect themselves, whether a BOM exists or not.

Because there are application, which are not able to handle UTF-8 without BOM, users should have means to decide, whether to write a BOM or not. Such enhancement request exists already as bug 82254,

*** This bug has been marked as a duplicate of bug 82254 ***