Bug 60166 - FILEOPEN: BOM not ignored when using text file as database
Summary: FILEOPEN: BOM not ignored when using text file as database
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Base (show other bugs)
(earliest affected) release
Hardware: All Linux (All)
: low normal
Assignee: Not Assigned
Depends on:
Blocks: Database-Import
  Show dependency treegraph
Reported: 2013-02-01 18:44 UTC by pasqual milvaques
Modified: 2018-11-14 03:09 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:

text file to use as data source (in utf-8) (77 bytes, text/plain)
2013-02-01 18:45 UTC, pasqual milvaques
database which feeds from the text file (1.79 KB, application/vnd.oasis.opendocument.database)
2013-02-01 18:46 UTC, pasqual milvaques
odt file with the first column of the database added as database field (9.25 KB, application/vnd.oasis.opendocument.text)
2013-02-01 18:48 UTC, pasqual milvaques
context.xml from inside of the testing file (3.72 KB, application/xml)
2013-02-01 18:51 UTC, pasqual milvaques
testcase files including a procedure to reproduce the problem (230.00 KB, application/x-gzip)
2013-02-21 00:32 UTC, pasqual milvaques

Note You need to log in before you can comment on or make changes to this bug.
Description pasqual milvaques 2013-02-01 18:44:14 UTC
I use text files as source data for doing mail merging. after finding some incongruences in the generated documents of my mail merges I observed that libreoffice base doesn't ignore the BOM mark of the beginning of the files when you work with utf8 text files. 
in windows this is not a problem because you must work with ansi encoding if you want the special characters appear correctly but in linux you must use utf-8 to obtain the same results and here is where the problem appears. 
when you add the database fields to your writer document the BOM in the first column name is translated to the odt file. you later can have problems working with this document in other plataforms

I attach some files to help to see the problem
Comment 1 pasqual milvaques 2013-02-01 18:45:41 UTC
Created attachment 74058 [details]
text file to use as data source (in utf-8)
Comment 2 pasqual milvaques 2013-02-01 18:46:41 UTC
Created attachment 74059 [details]
database which feeds from the text file
Comment 3 pasqual milvaques 2013-02-01 18:48:09 UTC
Created attachment 74061 [details]
odt file with the first column of the database added as database field
Comment 4 pasqual milvaques 2013-02-01 18:51:38 UTC
Created attachment 74062 [details]
context.xml from inside of the testing file

if you issues a:
od -c testing_file_content.xml

you can see the raw content of the file and the BOM inside of it with the database field name:
0007100   x   t   :   c   o   l   u   m   n   -   n   a   m   e   =   "
0007120 357 273 277   C   o   l   u   m   n   1   "       t   e   x   t
Comment 5 Joel Madero 2013-02-20 03:47:28 UTC
Rainer - do you have any ideas about this one?
Comment 6 Rainer Bielefeld Retired 2013-02-20 07:50:41 UTC
Only reading tells me nothing, and I have no time for puzzling around.

Can you see what the problem might be and how to make it reproducible for non-database-mailmerge-experts?

@pasqual milvaques:
Please attach such test kits zipped as 1 single attachment!
Thank you for your report – unfortunately important information is missing.
May be hints on <http://wiki.documentfoundation.org/BugReport> will help you to find out what information will be useful to reproduce your problem? 
Please add all information requested in following:
- Write a meaningful Summary describing exactly what the problem is
- Explain what a "BOM" is
- Contribute a document related step by step instruction containing every 
  key press and every mouse click how to reproduce your problem 
  (similar to example in Bug 43431), Here nobody has time to puzzle how to reproduce the problem with your documents.
– if possible contribute an instruction how to create a sample document 
  from the scratch
- add information 
  -- what EXACTLY is unexpected (in step by step instruction)
  -- and WHY do you believe it's unexpected (cite Help or Documentation!)
  -- concerning your PC (video card, ...)
  -- concerning your Operating System (Version, Distribution, Language)
  -- concerning your LibO version (with Build ID if it's not a public release)
     and localization (UI language, Locale setting)
  –- Libo settings that might be related to your problems 
  -- how you launch LibO and how you opened the sample document
  –- Whether your problem persists when you renamed your user profile 
     before you launch LibO (please see
  -- Whether that worked in more early versions
  -- everything else crossing your mind after you read linked texts
Comment 7 Robert Großkopf 2013-02-20 20:12:01 UTC

isn't a special database-problem, but a problem for reading text-files with base. I have ha a look for BOM at wikipedia. It is a mark at the beginning of a text-file in utf-16 and utf-32-code; it seems to be optional in utf-8-code. And this is what I read in http://en.wikipedia.org/wiki/Byte_order_mark :
"Java does not support UTF-8 with BOM and does not intend to implement it in future releases."
Could be a problem described in this bug.
Comment 8 Rainer Bielefeld Retired 2013-02-20 20:34:07 UTC
Thank you for research, sounds plausible. As soon as reporter contributet step by step instruction I will do further tests.
Comment 9 pasqual milvaques 2013-02-21 00:32:16 UTC
Created attachment 75215 [details]
testcase files including a procedure to reproduce the problem

In the tar.gz I include all the files needed to reproduce the problem and a text document called testcase_procedure.odt in which I try to detail the steps to reproduce the problem
I hope to answer in it all the questions but if you need more detail say it to me, please

Comment 10 Joel Madero 2013-04-17 15:40:38 UTC
I have been able to confirm the issue on:
Platform: Bodhi Linux 2.2 x64
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
As I've been able to confirm this problem on an earlier release I am changing the version number as version is the earliest version that we can confirm the bug, we use comments to say that the bug exists in newer versions as well.

Marking as:

New (confirmed)
Normal - can prevent high quality work under certain circumstances
Low - unfortunately this one probably isn't affecting many users at all. Furthermore, there may be wacky workarounds such as bringing it into spreadsheet, saving it as a ods, then bringing into base ( I know not ideal, just a possible workaround for the time being.

Thanks so much for the clear and concise instructions.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
LibreOffice is powered by a team of volunteers, every bug is confirmed (triaged) by human beings who mostly give their time for free. We invite you to join our triaging by checking out this link:

There are also other ways to get involved including with marketing, UX, documentation, and of course developing -  http://www.libreoffice.org/get-help/mailing-lists/. 

Lastly, good bug reports help tremendously in making the process go smoother, please always provide reproducible steps (even if it seems easy) and attach any and all relevant material
Comment 11 Joel Madero 2013-04-17 15:42:43 UTC
*** Bug 60336 has been marked as a duplicate of this bug. ***
Comment 12 Regina Henschel 2013-04-17 19:40:43 UTC
I'm not sure whether this is a bug. Have you set the character set to "UTF-8" in Edit > Database > Properties? I can use the database with the text file for a mail merge without problem on Windows7. The field name has no strange characters.
Comment 13 pasqual milvaques 2013-04-24 04:08:58 UTC
regina, changing the encoding of the database in the way you have said in windows makes the utf8 characters legible but the bom is not ignored just in linux. that at the end can create problems in some special situations

checked in windows 8 with libreoffice 4.0.2

the problem reported in bug 60336 can't be addressed in this way also, the xml file indicates that it's in utf-8 but bom it's not ignored creating an erroneous behavior

I have observed that the option to choose the encoding is not present in the database wizard at creation time, it will be a nice improvement to have this option present in the wizard as it's a bit hard to find it(I didn't notice that it existed) and requires a reopening of the database to be applied

Comment 14 Julien Nabet 2014-11-30 22:18:09 UTC
On pc Debian x86-64 with master sources updated today, I could reproduce this.

I noticed these 2 commits about BOM:
- http://cgit.freedesktop.org/libreoffice/core/commit/?id=f38277dc0337df15f3ea689096a2c18a03354a61
- http://cgit.freedesktop.org/libreoffice/core/commit/?id=5eb408a3bb8df204452f0b931a254dad5f0cf35b

Then, we need a code pointer to know where to start for this case.
Comment 15 Julien Nabet 2014-11-30 22:46:24 UTC
Ok found a start to dig, see:
Comment 16 Julien Nabet 2014-12-01 21:43:34 UTC
Unwinding on gdb, I found different places to put a fix:
- http://opengrok.libreoffice.org/xref/core/sw/source/ui/fldui/flddb.cxx#187
- http://opengrok.libreoffice.org/xref/core/sw/source/uibase/dbui/dbtree.cxx#420
still digging brings to svtools module
- http://opengrok.libreoffice.org/xref/core/svtools/source/contnr/treelist.cxx#464
but wonder if this last one is really good option.

About the fix in itself:
I thought about using/copying (?) http://opengrok.libreoffice.org/xref/core/l10ntools/source/lngmerge.cxx#lcl_RemoveUTF8ByteOrderMarker method

Finally, even if we fix this one, should we consider too other encodings UTF-16 and UTF-32 (considering too Little/Big Endian part), see http://en.wikipedia.org/wiki/Byte_order_mark

Lionel: any thoughts?
Comment 17 Alex Thurgood 2015-01-03 17:39:18 UTC Comment hidden (no-value)
Comment 18 QA Administrators 2016-01-17 20:04:33 UTC Comment hidden (obsolete)
Comment 19 pasqual milvaques 2016-01-28 16:40:57 UTC
I have verified that the problem is still present in LibreOffice, tested in Windows 10 (32 bits). There has not been any behaviour change in the bug
Comment 20 QA Administrators 2017-03-06 14:26:26 UTC Comment hidden (obsolete)
Comment 21 Julien Nabet 2018-11-13 19:51:10 UTC
2 things which might help:
- tdf#63673 which has been fixed in 5.4.0
- tdf#44291 which has just been fixed in master sources.

Any update here?