Bug 82254 - FILESAVE: UTF-8 BOM removed from CSV when saving file
Summary: FILESAVE: UTF-8 BOM removed from CSV when saving file
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
4.3.0.4 release
Hardware: All All
: medium enhancement
Assignee: Andreas Heinisch
URL:
Whiteboard: target:7.6.0
Keywords:
: 115056 (view as bug list)
Depends on:
Blocks: CSV-Export
  Show dependency treegraph
 
Reported: 2014-08-06 15:32 UTC by Jose Lameira
Modified: 2023-03-15 10:19 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
BOM-test (63 bytes, text/plain)
2014-08-06 15:32 UTC, Jose Lameira
Details
Export Text File dialog (15.00 KB, image/png)
2023-01-23 10:40 UTC, Andreas Heinisch
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jose Lameira 2014-08-06 15:32:25 UTC
Created attachment 104157 [details]
BOM-test

When a file is saved in Calc UTF-8 BOM is removed even when it exists on the source file.

Steps to reproduce:
Open the BOM-test.csv (this one starts with UTF-8 BOM)
Choose Save As
Resulting file will be similar but without UTF-8 BOM

If source file starts with BOM, exported one should also have BOM, or at least we should be able to choose on the export filter what we need.
Comment 1 Robinson Tryon (qubit) 2014-08-16 08:59:53 UTC
TESTING on LO 4.3.1.1 + Ubuntu 12.04

(In reply to comment #0)
> When a file is saved in Calc UTF-8 BOM is removed even when it exists on the
> source file.

Confirmed.

> Steps to reproduce:
> Open the BOM-test.csv (this one starts with UTF-8 BOM)
> Choose Save As
> Resulting file will be similar but without UTF-8 BOM
> 
> If source file starts with BOM, exported one should also have BOM, or at
> least we should be able to choose on the export filter what we need.

+1

I can understand that in the import process, some aspects of a file (e.g. the BOM) might be stripped away and not re-included in the export process, although ideally there would be consistency, especially if someone wants to use LibreOffice to edit shared CSV files.

Sounds like a reasonable enhancement request.

Status -> NEW
Comment 2 Jose Lameira 2014-08-18 10:13:34 UTC
As for us the problem lies exactly on a CSV that generated internally with one application, it's audited (and corrected if needed) with LO and is uploaded to a 3rd party server. This 3rd party server gets encoding through BOM. It needs it there.

As LO starting from version 4.3 does strip this for us it's a blocker. For this task we are using LO 4.2.

Although i agree with you that this is not a blocker for the majority of people, as this was introduced in LO 4.3 don't know if it shouldn't be classified as a regression bug.
Comment 3 Dag Bakke 2015-01-05 07:55:22 UTC
Ouch.

Not sure if this relates to shared code, but when saving a utf-8 text file from writer (4.3.1.2), it will be saved *with* a BOM, even if the source file did not have one. 

And my application is not suited to handle the BOM and rejects the resulting file.
Comment 4 Robinson Tryon (qubit) 2015-01-10 23:56:58 UTC
As it sounds like this affects all platforms:
OS/Arch -> All

(In reply to Dag Bakke from comment #3)
> Not sure if this relates to shared code, but when saving a utf-8 text file
> from writer (4.3.1.2), it will be saved *with* a BOM, even if the source
> file did not have one. 

Yep, sounds like a similar problem. I've no idea how much is shared in the filesave process, though, so I'll check w/the devs.
Comment 5 Robinson Tryon (qubit) 2015-01-11 00:32:33 UTC
(In reply to Dag Bakke from comment #3)
> Not sure if this relates to shared code, but when saving a utf-8 text file
> from writer (4.3.1.2), it will be saved *with* a BOM, even if the source
> file did not have one. 

Dag: some information:
The Writer codepath is separate from the Calc one, so please file a separate bug for that issue.

Per Wikipedia (https://en.wikipedia.org/wiki/Byte_order_mark):
"The Unicode Standard permits the BOM in UTF-8,[2] but does not require or recommend its use....The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work."

In this case, your file didn't originally have a BOM, so I definitely suggest that we don't *add* one.
Comment 6 Henrique Clemente 2015-09-23 08:52:30 UTC
+1

To me this is a major bug, since I'm using iMacros in a daily basis and I have always to re-encode my csv files with Notepad++
Comment 7 Regina Henschel 2018-01-17 19:45:51 UTC
*** Bug 115056 has been marked as a duplicate of this bug. ***
Comment 8 librebug 2018-01-18 11:01:44 UTC
The state of how applications handle UTF-8 is a mess: some look for the BOM, some treat it as noise and ignore it, some require it for reading files but omit it on writing them, some treat it as an error and fail, and many don't even know that UTF-8 exists. I have no simple global solution to propose for this, but I do think LO itself should be consistent, and that the rules should be in the documentation and easy to find. It's not exactly obvious to have to click "Save filter settings"; and the process is clumsy for someone like me who has to do it almost every time.

I'd like to see an option -- global or per LO component -- to set a writing mode for plain text formats, if necessary even per nominal file type (.txt, .dif, .slk, .csv):

    Coding (UTF-8, UTF-16, Windows blahblah, Mac blahblah, Linux blahblah)

    If UTF, then with or without BOM

And let me set my own defaults.
Comment 9 Andreas Heinisch 2022-06-28 11:12:02 UTC
Could this be solved like in https://bugs.documentfoundation.org/show_bug.cgi?id=75263

Check, for the charset and the BOM and include it in the export process?
Comment 10 Mike Kaganski 2022-08-31 06:31:11 UTC
(In reply to Andreas Heinisch from comment #9)

Rather, as in commit 162f5a20095c6937030d23ee03fb8f72c51eefa1
  tdf#142669 Consider BOM on text encoding detection

  Return a flag if the auto detected text has a BOM.
  Save the flag in SwAsciiOptions so that BOM gets set correctly when
  file is written.
Comment 11 Andreas Heinisch 2023-01-23 10:40:59 UTC
Created attachment 184841 [details]
Export Text File dialog

Should we include an option in the Export Text File dialog under "Save as" -> "Edit filter settings", or should LO detect it automatically without the possibility to include the BOM?
Comment 12 Mike Kaganski 2023-01-23 10:50:48 UTC
(In reply to Andreas Heinisch from comment #11)
> Should we include an option in the Export Text File dialog under "Save as"
> -> "Edit filter settings", or should LO detect it automatically without the
> possibility to include the BOM?

An option would be nice; but it's OK to implement it separately, and only have it as an autodetected and command line (filter string) option initially. IIRC, there's some option in the filter already, that has no corresponding UI (yet?).
Comment 13 Commit Notification 2023-02-18 20:06:37 UTC
Andreas Heinisch committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/509ab788baf54285b4e38f2560326657d97510fd

tdf#82254 - Don't remove UTF-8 BOM from CSV when saving file

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 14 Commit Notification 2023-03-15 10:19:41 UTC
Andreas Heinisch committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/help/commit/4be597da538b8cdb54f1f12fedfd940a1fa9c60e

tdf#82254 - Add UTF-8 BOM (Token 14) to CSV filter parameters