115056 – FILESAVE Calc doesn't write CSV as UTF-8

Bug 115056 - FILESAVE Calc doesn't write CSV as UTF-8

Summary: FILESAVE Calc doesn't write CSV as UTF-8

Status:	RESOLVED DUPLICATE of bug 82254

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	5.4.4.2 release
Hardware:	x86-64 (AMD64) Windows (All)

Importance:	medium normal
Assignee:	Not Assigned

URL:	http://www.unicode.org/faq/utf_bom.html
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-01-16 21:50 UTC by librebug
Modified:	2018-01-17 19:45 UTC (History)
CC List:	2 users (show)

See Also:
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description librebug 2018-01-16 21:50:03 UTC

Description:
I create a CSV file "u.csv" in UTF-8 form with BOM which shows the single character "ü"; it contains exactly the seven hex bytes

ef bb bf c3 bc 0d 0a

When I open it in Calc, "ü" appears correctly in cell A1. I add and immediately delete a space character, then save the file with "Save" and also "Save as" the file "uu.csv".

Both are written without the UTF-8 BOM, but with "ü" as a multi-byte character, so they are now incorrect: the file consists exactly of the four hex bytes

c3 bc 0d 0a

and re-opening the files in Calc fails to recognize them as UTF-8, giving "Ã¼" in A1. Other programs may or may not treat the file as UTF-8, because it lacks the BOM.

This could hardly be plainer: Calc should write UTF-8 files with the BOM. At the very least it should offer the user the chance to specify the character set for writing a sheet, and act appropriately. Its behavior now is wrong.

Ideally it should be possible to specify a global default character set for text formats, with per-sheet formats possible and retained in .ODS files. The same should go for other components (e.g., Writer) where applicable.

Note: I deal with spreadsheets that mix a huge variety of languages, including Korean, Chinese, Russian, Polish, Thai, and all European languages, so handling UTF-8 correctly is extremely important to me. With Calc right now it's quite painful to do this reliably when exchanging CSV files.

Steps to Reproduce:
1. Create a UTF-8 CSV file (with BOM) containing the single character "ü".
2. Read it with Calc.
3. Make a null change.
4. Save the file.
5. Calc can no longer read the file correctly. It contains a multi-byte character, but no BOM.

Actual Results:
The file no longer signals itself as UTF-8, and Calc reads the contents as "Ã¼". UTF-8 CSV files with Chinese characters, Russian characters, Thai characters, accented European characters, etc. are all wrecked for Calc by its own actions.

Expected Results:
"ü" in A1. Chinese characters, Russian characters, Thai characters, accented European characters, etc. appear correctly.

Reproducible: Always

User Profile Reset: Yes

OpenGL enabled: Yes

Additional Info:
Version: 5.4.4.2 (x64)
Build ID: 2524958677847fb3bb44820e40380acbe820f960
CPU threads: 8; OS: Windows 6.1; UI render: default;
Locale: en-US (en_US); Calc: group

User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0

Comment 1 Regina Henschel 2018-01-17 19:45:51 UTC

UTF-8 may use a BOM, but does not need a BOM. A BOM can be dangerous, if using a csv file in a database environment.

When you open a csv file, then use filter "Text - Choose Encoding" in Writer and filter "Text CSV" in Calc. In both cases you get a dialog to choose the encoding. Both filter detect themselves, whether a BOM exists or not.

Because there are application, which are not able to handle UTF-8 without BOM, users should have means to decide, whether to write a BOM or not. Such enhancement request exists already as bug 82254,

*** This bug has been marked as a duplicate of bug 82254 ***