Created attachment 104157 [details] BOM-test When a file is saved in Calc UTF-8 BOM is removed even when it exists on the source file. Steps to reproduce: Open the BOM-test.csv (this one starts with UTF-8 BOM) Choose Save As Resulting file will be similar but without UTF-8 BOM If source file starts with BOM, exported one should also have BOM, or at least we should be able to choose on the export filter what we need.
TESTING on LO 4.3.1.1 + Ubuntu 12.04 (In reply to comment #0) > When a file is saved in Calc UTF-8 BOM is removed even when it exists on the > source file. Confirmed. > Steps to reproduce: > Open the BOM-test.csv (this one starts with UTF-8 BOM) > Choose Save As > Resulting file will be similar but without UTF-8 BOM > > If source file starts with BOM, exported one should also have BOM, or at > least we should be able to choose on the export filter what we need. +1 I can understand that in the import process, some aspects of a file (e.g. the BOM) might be stripped away and not re-included in the export process, although ideally there would be consistency, especially if someone wants to use LibreOffice to edit shared CSV files. Sounds like a reasonable enhancement request. Status -> NEW
As for us the problem lies exactly on a CSV that generated internally with one application, it's audited (and corrected if needed) with LO and is uploaded to a 3rd party server. This 3rd party server gets encoding through BOM. It needs it there. As LO starting from version 4.3 does strip this for us it's a blocker. For this task we are using LO 4.2. Although i agree with you that this is not a blocker for the majority of people, as this was introduced in LO 4.3 don't know if it shouldn't be classified as a regression bug.
Ouch. Not sure if this relates to shared code, but when saving a utf-8 text file from writer (4.3.1.2), it will be saved *with* a BOM, even if the source file did not have one. And my application is not suited to handle the BOM and rejects the resulting file.
As it sounds like this affects all platforms: OS/Arch -> All (In reply to Dag Bakke from comment #3) > Not sure if this relates to shared code, but when saving a utf-8 text file > from writer (4.3.1.2), it will be saved *with* a BOM, even if the source > file did not have one. Yep, sounds like a similar problem. I've no idea how much is shared in the filesave process, though, so I'll check w/the devs.
(In reply to Dag Bakke from comment #3) > Not sure if this relates to shared code, but when saving a utf-8 text file > from writer (4.3.1.2), it will be saved *with* a BOM, even if the source > file did not have one. Dag: some information: The Writer codepath is separate from the Calc one, so please file a separate bug for that issue. Per Wikipedia (https://en.wikipedia.org/wiki/Byte_order_mark): "The Unicode Standard permits the BOM in UTF-8,[2] but does not require or recommend its use....The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work." In this case, your file didn't originally have a BOM, so I definitely suggest that we don't *add* one.
+1 To me this is a major bug, since I'm using iMacros in a daily basis and I have always to re-encode my csv files with Notepad++
*** Bug 115056 has been marked as a duplicate of this bug. ***
The state of how applications handle UTF-8 is a mess: some look for the BOM, some treat it as noise and ignore it, some require it for reading files but omit it on writing them, some treat it as an error and fail, and many don't even know that UTF-8 exists. I have no simple global solution to propose for this, but I do think LO itself should be consistent, and that the rules should be in the documentation and easy to find. It's not exactly obvious to have to click "Save filter settings"; and the process is clumsy for someone like me who has to do it almost every time. I'd like to see an option -- global or per LO component -- to set a writing mode for plain text formats, if necessary even per nominal file type (.txt, .dif, .slk, .csv): Coding (UTF-8, UTF-16, Windows blahblah, Mac blahblah, Linux blahblah) If UTF, then with or without BOM And let me set my own defaults.
Could this be solved like in https://bugs.documentfoundation.org/show_bug.cgi?id=75263 Check, for the charset and the BOM and include it in the export process?
(In reply to Andreas Heinisch from comment #9) Rather, as in commit 162f5a20095c6937030d23ee03fb8f72c51eefa1 tdf#142669 Consider BOM on text encoding detection Return a flag if the auto detected text has a BOM. Save the flag in SwAsciiOptions so that BOM gets set correctly when file is written.
Created attachment 184841 [details] Export Text File dialog Should we include an option in the Export Text File dialog under "Save as" -> "Edit filter settings", or should LO detect it automatically without the possibility to include the BOM?
(In reply to Andreas Heinisch from comment #11) > Should we include an option in the Export Text File dialog under "Save as" > -> "Edit filter settings", or should LO detect it automatically without the > possibility to include the BOM? An option would be nice; but it's OK to implement it separately, and only have it as an autodetected and command line (filter string) option initially. IIRC, there's some option in the filter already, that has no corresponding UI (yet?).
Andreas Heinisch committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/509ab788baf54285b4e38f2560326657d97510fd tdf#82254 - Don't remove UTF-8 BOM from CSV when saving file It will be available in 7.6.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Andreas Heinisch committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/help/commit/4be597da538b8cdb54f1f12fedfd940a1fa9c60e tdf#82254 - Add UTF-8 BOM (Token 14) to CSV filter parameters