Bug 120067 - Headless conversion txt→odt→txt inserts unnecessary characters (Linux-only)
Summary: Headless conversion txt→odt→txt inserts unnecessary characters (Linux-only)
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
3.6.7.2 release
Hardware: All Linux (All)
: medium minor
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-22 18:56 UTC by Zetok
Modified: 2021-10-16 12:33 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Zetok 2018-09-22 18:56:39 UTC
Description:
When converting formats txt→odt→txt LO inserts unnecessary characters at the beginning of the resulting text.

It shouldn't insert any characters by itself, and instead it should preserve text bytes just as they are.

Steps to Reproduce:
1.

```
echo 'foobar' > text.txt
libreoffice --convert-to odt text.txt
mv text.txt{,.orig}
libreoffice --convert-to txt text.odt
```

Actual Results:
$ sha256sum text.txt{,.orig}
a2969248c7632f26c29e6d46a177f1f63334202d1d6170928e2e737d2b0f7c29  text.txt
aec070645fe53ee3b3763059376134f058cc337247c978add178b6ccdfb0019f  text.txt.orig

$ hexdump -C text.txt.orig
00000000  66 6f 6f 62 61 72 0a                              |foobar.|
00000007

$ hexdump -C text.txt
00000000  ef bb bf 66 6f 6f 62 61  72 0a                    |...foobar.|
0000000a

Expected Results:
sha256 of both original and conversion output should be the same.


Reproducible: Always


User Profile Reset: No


OpenGL enabled: Yes

Additional Info:
Version: 6.1.0.3
Build ID: 10(Build:3)
CPU threads: 8; OS: Linux 4.18; UI render: default; VCL: gtk2; 
Locale: en-GB (en_GB.UTF-8); Calc: group threaded
Comment 1 Buovjaga 2018-10-14 15:24:32 UTC
Repro.

I tried regression testing. With 3.3.0 the commands I found (old syntax) did not work, but 3.6.7 worked and the same bad result.

libo36 --headless --convert-to odt:"OpenDocument Text Flat XML" text.txt
libo36 --headless --convert-to txt:text text.odt

Arch Linux 64-bit
Version: 6.2.0.0.alpha0+
Build ID: 00e10ae3189a4407ffb1a48f836cd52dc9a1b6df
CPU threads: 8; OS: Linux 4.18; UI render: default; VCL: gtk3_kde5; 
Locale: fi-FI (fi_FI.UTF-8); Calc: threaded
Built on 13 October 2018
Comment 2 QA Administrators 2019-10-15 02:28:33 UTC Comment hidden (obsolete)
Comment 3 QA Administrators 2021-10-15 03:54:18 UTC Comment hidden (obsolete)
Comment 4 Michael Warner 2021-10-15 10:05:52 UTC
No repro in:
Version: 7.2.0.4 (x64) / LibreOffice Community
Build ID: 9a9c6381e3f7a62afc1329bd359cc48accb6435b
CPU threads: 6; OS: Windows 6.1 Service Pack 1 Build 7601; UI render: Skia/Raster; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded

Maybe it's Linux-only, or maybe it has been fixed.
Comment 5 Buovjaga 2021-10-15 12:42:11 UTC
Yep, I still repro with 7.2.1 on Linux
Comment 6 Michael Warner 2021-10-16 12:31:28 UTC
The three bytes that are inserted at the beginning of the converted file: 
   ef bb bf 

Are the Unicode Byte Order Mark for UTF-8. 

You can explicitly specify whether you want it there or not by providing options to the text output filter. For example:

   libreoffice --convert-to "txt:Text (encoded):UTF8"  text.odt

will include the BOM, and:

   libreoffice --convert-to "txt:Text (encoded):ASCII" text.odt

will not include the BOM.

When I follow the STR but explicitly specify ASCII encoding, I receive the exact bytes as in the original text file, no BOM. 

It is a bit odd that it defaulted one way for Linux and another way for Windows, but there is probably some good reason for that, too. So, I think this should be resolved as either NOTABUG or WONTFIX.