Bug 44291 - Functionality request: option for removing BOM from beginning of saved text files
Summary: Functionality request: option for removing BOM from beginning of saved text f...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.0 Beta2
Hardware: x86 (IA32) Linux (All)
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard: target:6.2.0
Keywords:
Depends on:
Blocks: Save-Text
  Show dependency treegraph
 
Reported: 2011-12-29 13:29 UTC by Bruce Fowler
Modified: 2018-11-13 19:51 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Video demo of the patch. (882.95 KB, image/gif)
2018-02-26 18:44 UTC, Martin van Zijl
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bruce Fowler 2011-12-29 13:29:58 UTC
Extra hex bytes are being inserted into text files saved
from LibreOffice database queries.  To show this, do the following:

1) Open up a simple database and run a query
2) Open a new text (.odt) document
3) Drag the query by the upper-left corner onto the text document
[ A window titled "Insert Database Columns" will open ]
4) Choose "Insert data as: text" on the top line
5) pick a database column or two, and then click OK
[ The data will be inserted into the text document ]
6) Save the document as ".txt", i.e., plain ascii text
7) View the document with the linux "less" command (or with
   any program that will show the hex-byte content of the file)
8) Note that preceeding any of the ascii data from the database are
   three extra bytes, "0xefbbbf", or "U+FEFF" as "less" shows them

These three extra bytes cause me grief when I use this general
scheme to create address labels.  I didn't ask for them and they
don't belong at the beginning of the output file.  It works this
way on all versions of LObase, up through 3.5.

Thanks for listening...
Comment 1 Bruce Fowler 2012-01-21 18:03:37 UTC
Further experimentation reveled that this problem is not related to "base" but shows up simply by saving a "writer" file as "plain text".  So I am changing the component from base to writer.  To show it, one only need start with a short ".odt" file and follow steps 6-8 in the original bug report.
Comment 2 sasha.libreoffice 2012-03-14 08:00:07 UTC
Thanks for bugreport
Explanations of these 3 bytes is here:
http://en.wikipedia.org/wiki/Byte_order_mark

Please, tell: which program has problem with it?
Comment 3 Bruce Fowler 2012-03-29 19:29:24 UTC
Thanks for the reference.  I have read the Wikipedia article.  It appears to relate entirely to Unicode encoding.  In relation to UTF-8 it says, "The Unicode Standard does permit the BOM in UTF-8, but does not require or recommend its use."  It further states, "the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment"

In any case, I don't want my data saved in UTF-8 for this particular application, but rather in plain ASCII.  I tried setting the Tools/Options/Load save->HTML compatibility/Character set to Western Europe (ASCII/US), but the BOM is still there.  I can appreciate the utility of the BOM for information interchange, but not for local work with Postscript programs and shell scripts.  Perhaps the appropriate fix is to have an option in "load/save" that says, "I really want plain ASCII."

I wish I were knowledgeable enough to send you a patch, but the LibreOffice code is a bit formidable!  Thanks for your interest and help.
Comment 4 sasha.libreoffice 2012-03-29 23:05:14 UTC
> Perhaps the appropriate fix is to have an option in "load/save" that says, "I
> really want plain ASCII."
I agree with this. But currently we have very few developers. This may take several years. Sorry for such situation.

> but not for local work with Postscript programs and shell scripts.
But may be will more faster add to script removing this BOM and to ask Postscript programs authors to fix their programs
Comment 5 leighman 2012-09-10 20:10:15 UTC
It's easy enough to stop the BOM being written but I presume we want to preserve it in existing documents.
Comment 6 Alex Thurgood 2015-01-03 17:39:34 UTC Comment hidden (no-value)
Comment 7 Bruce Fowler 2015-01-03 21:59:17 UTC
Glad to see that this bug is still alive.  I fixed my immediate problem with a simple "tr" command in my shell script, but I am still not happy with extraneous stuff being inserted in my text data.  The easy fix would seem to be to have "Save Text as UTF-8" and "Save Text as ASCII" options available as a preference I can set.  Thanks for your continued interest.
Comment 8 Martin van Zijl 2018-02-26 18:44:51 UTC
Created attachment 140162 [details]
Video demo of the patch.
Comment 9 Martin van Zijl 2018-02-26 18:48:14 UTC
I created a patch for review. With this patch if you do:

1) File --> Save As...
2) Choose Type = "Text (Choose Encoding)"
3) Click "Use Text - ..." 
4) In the final dialog will be a checkbox "Include byte-order-mark". If you un-check this, then the BOM will not be included in the output.

Video demo attached.

Review link:
https://gerrit.libreoffice.org/#/c/50388/
Comment 10 Mike Kaganski 2018-11-13 13:04:10 UTC
Thanks to Martin van Zijl, this is fixed in 6.2.
Comment 11 Commit Notification 2018-11-13 13:04:24 UTC
Martin van Zijl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/+/607ab542d043c24bfbd6a08bb62fbebd095114e3%5E%21

Fix tdf#44291. Allow saving text without byte-order mark.

It will be available in 6.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.