Bug Hunting Session
Bug 82254 - FILESAVE: UTF-8 BOM removed from CSV when saving file
Summary: FILESAVE: UTF-8 BOM removed from CSV when saving file
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
4.3.0.4 release
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 115056 (view as bug list)
Depends on:
Blocks: CSV-Export
  Show dependency treegraph
 
Reported: 2014-08-06 15:32 UTC by Jose Lameira
Modified: 2019-05-05 20:27 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
BOM-test (63 bytes, text/plain)
2014-08-06 15:32 UTC, Jose Lameira
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jose Lameira 2014-08-06 15:32:25 UTC
Created attachment 104157 [details]
BOM-test

When a file is saved in Calc UTF-8 BOM is removed even when it exists on the source file.

Steps to reproduce:
Open the BOM-test.csv (this one starts with UTF-8 BOM)
Choose Save As
Resulting file will be similar but without UTF-8 BOM

If source file starts with BOM, exported one should also have BOM, or at least we should be able to choose on the export filter what we need.
Comment 1 Robinson Tryon (qubit) 2014-08-16 08:59:53 UTC
TESTING on LO 4.3.1.1 + Ubuntu 12.04

(In reply to comment #0)
> When a file is saved in Calc UTF-8 BOM is removed even when it exists on the
> source file.

Confirmed.

> Steps to reproduce:
> Open the BOM-test.csv (this one starts with UTF-8 BOM)
> Choose Save As
> Resulting file will be similar but without UTF-8 BOM
> 
> If source file starts with BOM, exported one should also have BOM, or at
> least we should be able to choose on the export filter what we need.

+1

I can understand that in the import process, some aspects of a file (e.g. the BOM) might be stripped away and not re-included in the export process, although ideally there would be consistency, especially if someone wants to use LibreOffice to edit shared CSV files.

Sounds like a reasonable enhancement request.

Status -> NEW
Comment 2 Jose Lameira 2014-08-18 10:13:34 UTC
As for us the problem lies exactly on a CSV that generated internally with one application, it's audited (and corrected if needed) with LO and is uploaded to a 3rd party server. This 3rd party server gets encoding through BOM. It needs it there.

As LO starting from version 4.3 does strip this for us it's a blocker. For this task we are using LO 4.2.

Although i agree with you that this is not a blocker for the majority of people, as this was introduced in LO 4.3 don't know if it shouldn't be classified as a regression bug.
Comment 3 Dag Bakke 2015-01-05 07:55:22 UTC
Ouch.

Not sure if this relates to shared code, but when saving a utf-8 text file from writer (4.3.1.2), it will be saved *with* a BOM, even if the source file did not have one. 

And my application is not suited to handle the BOM and rejects the resulting file.
Comment 4 Robinson Tryon (qubit) 2015-01-10 23:56:58 UTC
As it sounds like this affects all platforms:
OS/Arch -> All

(In reply to Dag Bakke from comment #3)
> Not sure if this relates to shared code, but when saving a utf-8 text file
> from writer (4.3.1.2), it will be saved *with* a BOM, even if the source
> file did not have one. 

Yep, sounds like a similar problem. I've no idea how much is shared in the filesave process, though, so I'll check w/the devs.
Comment 5 Robinson Tryon (qubit) 2015-01-11 00:32:33 UTC
(In reply to Dag Bakke from comment #3)
> Not sure if this relates to shared code, but when saving a utf-8 text file
> from writer (4.3.1.2), it will be saved *with* a BOM, even if the source
> file did not have one. 

Dag: some information:
The Writer codepath is separate from the Calc one, so please file a separate bug for that issue.

Per Wikipedia (https://en.wikipedia.org/wiki/Byte_order_mark):
"The Unicode Standard permits the BOM in UTF-8,[2] but does not require or recommend its use....The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work."

In this case, your file didn't originally have a BOM, so I definitely suggest that we don't *add* one.
Comment 6 Henrique Clemente 2015-09-23 08:52:30 UTC
+1

To me this is a major bug, since I'm using iMacros in a daily basis and I have always to re-encode my csv files with Notepad++
Comment 7 Regina Henschel 2018-01-17 19:45:51 UTC
*** Bug 115056 has been marked as a duplicate of this bug. ***
Comment 8 librebug 2018-01-18 11:01:44 UTC
The state of how applications handle UTF-8 is a mess: some look for the BOM, some treat it as noise and ignore it, some require it for reading files but omit it on writing them, some treat it as an error and fail, and many don't even know that UTF-8 exists. I have no simple global solution to propose for this, but I do think LO itself should be consistent, and that the rules should be in the documentation and easy to find. It's not exactly obvious to have to click "Save filter settings"; and the process is clumsy for someone like me who has to do it almost every time.

I'd like to see an option -- global or per LO component -- to set a writing mode for plain text formats, if necessary even per nominal file type (.txt, .dif, .slk, .csv):

    Coding (UTF-8, UTF-16, Windows blahblah, Mac blahblah, Linux blahblah)

    If UTF, then with or without BOM

And let me set my own defaults.