Bug Hunting Session
Bug 36313 - CLI: Encoding issue when Converting documents: esp. UTF-8 in headless mode
Summary: CLI: Encoding issue when Converting documents: esp. UTF-8 in headless mode
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
3.3.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:4.4.0 target:4.3.2
Keywords:
: 61960 89452 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-04-16 16:05 UTC by V. A. Zoukos
Modified: 2018-09-08 18:21 UTC (History)
12 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description V. A. Zoukos 2011-04-16 16:05:42 UTC
Users can convert documents in headless mode. For example, you can
convert a CSV file to an ODS file, using the command line.
The problem is that LibreOffice assumes, by default, that the initial
encoding is ISO-8859-1,
and there is no option yet to change this. Documents with other
encodings have the text corrupted.

HOW TO REPLICATE:
a. Create the following test.csv file:

$ cat /tmp/mytest.csv
"First","Second"
"áéŕó","ṫřåiṅ"
$ _

b. Then convert with the following command line:

$ libreoffice -headless -convert-to ods mytest.csv
onvert /tmp/mytest.csv -> /tmp/mytest.ods using OpenDocument
Spreadsheet Flat XML
Warning: at xsl:stylesheet on line 2 of
file:///usr/lib/libreoffice/basis3.3/share/xslt/odfflatxml/odfflatxmlexport.xsl:
Running an XSLT 1.0 stylesheet with an XSLT 2.0 processor
$ _

c. Finally inspect the generated mytest.ods:

<table:table-cell office:value-type="string">
<text:p>áéŕó</text:p>
</table:table-cell>
<table:table-cell office:value-type="string">
<text:p>ṫřåiṅ</text:p>

The text shows that a conversion from ISO-8859-1 to UTF-8 was forced,
which corrupts the text.

WHAT SHOULD HAPPEN:
» There should be an option to specify the initial encoding
so that no forced conversion takes place.

RELEVANT LINKS:
http://listarchives.libreoffice.org/www/users/msg03444.html
Comment 1 Caolán McNamara 2011-04-18 06:52:45 UTC
This is possible with the direct uno call with some extra filter options IIRC. Though presumably not available from the command line.
Comment 2 Kohei Yoshida 2011-04-18 11:48:38 UTC
It would probably be more generally useful if we added a command line option to pass extra filter options, rather than explicitly adding an encoding option which would only benefit a few filters.

Anyway, this is a feature request.
Comment 3 Brandon Simmons 2011-07-11 06:30:59 UTC
Can anyone comment if this is the same bug as in 38311?

https://bugs.freedesktop.org/show_bug.cgi?id=38311
Comment 4 Simos Xenitellis 2011-07-11 07:17:30 UTC
(In reply to comment #3)
> Can anyone comment if this is the same bug as in 38311?
> 
> https://bugs.freedesktop.org/show_bug.cgi?id=38311

It's not the same bug. 

The bug here has to do with the encoding being messed up when converting (at least) from csv to ods. 

The problem you are facing has to do with font selection; in the problematic PDF output, you can easily see that different fonts are chosen. You can check with your PDF viewer (see in the Properties) that in the problematic PDF, different fonts are being selected.

However, both bugs relate to how to pass parameters when performing conversions in headless mode, so they have something in common.
Comment 5 Björn Michaelsen 2011-12-23 12:07:20 UTC
[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:
http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1

more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
Comment 6 Florian Reisinger 2012-08-14 14:00:32 UTC
Dear bug submitter!

Due to the fact, that there are a lot of NEEDINFO bugs with no answer within the last six months, we close all of these bugs.

To keep this message short, more infos are available @ https://wiki.documentfoundation.org/QA/NeedinfoClosure#Statement

Thanks for understanding and hopefully updating your bug, so that everything is prepared for developers to fix your problem.

Yours!

Florian
Comment 7 Florian Reisinger 2012-08-14 14:01:40 UTC Comment hidden (obsolete)
Comment 8 Florian Reisinger 2012-08-14 14:06:22 UTC Comment hidden (obsolete)
Comment 9 Florian Reisinger 2012-08-14 14:08:24 UTC Comment hidden (obsolete)
Comment 10 Jehan 2012-08-28 03:29:21 UTC
Hello,

I can tell this is still happening in LibreOffice 3.6.
I have been trying to convert various documents from various languages (so with non-ASCII characters, like from Chinese, Japanese, French). Non ASCII character often fail to convert.

For xls or xlsx characters, I use the command:
$ oowriter --headless --convert-to csv <some file>.xls

For docs:
$ oowriter --headless --convert-to txt:TEXT <some file>.doc

And I get the Japanese or Chinese characters unconverted.
Actually sometimes it works fine. If for instance I create a file from scratch in LibreOffice and writes down some Japanese in it. It would convert fine (at least in my basic test). Yet if I take a doc/xls from someone else (most probably created in Microsoft Word/Excel), Japanese characters ends up all being '?' character.

Note that if I convert the same document to pdf, the Japanese characters are well displayed.
What kind of information do you need to debug this? I would have hard time to send the documents which fail because we are not allowed. And I don't have easy access to a Windows machine to try and create a doc there, in order to reproduce the issue. But I may still try.

I guess I cannot change the status myself, to reopen the ticket, can I? The statement link above seems to say that a QA member has to do it.
Comment 11 sasha.libreoffice 2012-08-31 09:48:41 UTC
it is RFE, reopening
Comment 12 orange47 2012-12-04 10:55:31 UTC
problem same in 3.6.3
can someone please help?
Comment 13 sasha.libreoffice 2013-06-07 05:22:24 UTC
Thanks for additional testing
Sorry, but "Version" is where bug appears. Not a current version of LO. If bug disappears, we just closing bugreport.
Changing back to 3.3.2.
Comment 14 Simos Xenitellis 2013-06-07 12:38:43 UTC
(The bug still exists in LibreOffice 4.0.3.3 and probably in master)
Comment 15 Owen Genat (retired) 2013-08-10 10:16:49 UTC
*** Bug 61960 has been marked as a duplicate of this bug. ***
Comment 16 Commit Notification 2014-06-04 10:51:56 UTC
Tomas Hlavaty committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=45ba4d79d968f81f74ef0c4588fd15b1ce91153f

fdo#36313: allow passing FilterOptions via cli



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 17 Stephan Bergmann 2014-06-04 10:57:55 UTC
With the fix from comment 16, what works for me to resolve the problem from comment 0 is to run

  soffice --headless --convert-to ods --infilter=CSV:44,34,UTF8 mytest.csv

where "44" denotes the field separator character (,) and "34" denotes the quote character (").  The problem is that you cannot leave these obscure values out (e.g., --infilter=CSV:,,UTF8) and the "documentation" for the CSV filter's FilterOptions string format is ScAsciiOptions::ReadFromString (sc/source/ui/dbgui/asciiopt.cxx).

That is, I am not sure whether the fix from comment 16 is already a practical-enough solution?
Comment 18 Commit Notification 2014-09-10 10:18:33 UTC
Tomas Hlavaty committed a patch related to this issue.
It has been pushed to "libreoffice-4-3":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=4b8a0159ca80dad05ddcaad5897b786484fc8afb&h=libreoffice-4-3

fdo#36313: allow passing FilterOptions via cli


It will be available in LibreOffice 4.3.2.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 19 Owen Genat (retired) 2014-09-13 14:30:20 UTC
Tested under GNU/Linux using 4.4.0.0.alpha0+
Build ID: 037d03b9facb414ba6be01fa6ee92fc7ca89f70c
TinderBox: Linux-rpm_deb-x86_64@46-TDF, Branch:master, Time: 2014-09-11_00:32:52

$ cat test.csv 
2014-09-13;123;"abc"
$ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:59,34,UTF8 a.csv 
convert /data/temp/LO_test/a.csv -> /data/temp/LO_test/a.ods using calc8
Overwriting: /data/temp/LO_test/a.ods
$ unzip -p a.ods content.xml | xmllint --format - | grep "<text:p>[^<]*"
            <text:p>2014-09-13</text:p>
            <text:p>123</text:p>
            <text:p>abc</text:p>
$ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:59,,UTF8 a.csv 
convert /data/temp/LO_test/a.csv -> /data/temp/LO_test/a.ods using calc8
Overwriting: /data/temp/LO_test/a.ods
$ unzip -p a.ods content.xml | xmllint --format - | grep "<text:p>[^<]*"
            <text:p>2014-09-13</text:p>
            <text:p>123</text:p>
            <text:p>"abc"</text:p>
$ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:,,UTF8 a.csv 
convert /data/temp/LO_test/a.csv -> /data/temp/LO_test/a.ods using calc8
Overwriting: /data/temp/LO_test/a.ods
$ unzip -p a.ods content.xml | xmllint --format - | grep "<text:p>[^<]*"
            <text:p>2014-09-13;123;"abc"</text:p>

$ cat b.csv 
"First","Second"
"áéŕó","ṫřåiṅ"
$ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:44,34,UTF8 b.csv 
convert /data/temp/LO_test/b.csv -> /data/temp/LO_test/b.ods using calc8
Overwriting: /data/temp/LO_test/b.ods
$ unzip -p b.ods content.xml | xmllint --format - | grep "<text:p>[^<]*"
            <text:p>First</text:p>
            <text:p>Second</text:p>
            <text:p>áéŕó</text:p>
            <text:p>ṫřåiṅ</text:p>
$ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:44,,UTF8 b.csv 
convert /data/temp/LO_test/b.csv -> /data/temp/LO_test/b.ods using calc8
Overwriting: /data/temp/LO_test/b.ods
$ unzip -p b.ods content.xml | xmllint --format - | grep "<text:p>[^<]*"
            <text:p>"First"</text:p>
            <text:p>"Second"</text:p>
            <text:p>"áéŕó"</text:p>
            <text:p>"ṫřåiṅ"</text:p>
$ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:,,UTF8 b.csv 
convert /data/temp/LO_test/b.csv -> /data/temp/LO_test/b.ods using calc8
Overwriting: /data/temp/LO_test/b.ods
$ unzip -p b.ods content.xml | xmllint --format - | grep "<text:p>[^<]*"
            <text:p>"First","Second"</text:p>
            <text:p>"áéŕó","ṫřåiṅ"</text:p>

Are only decimal values available for character specification or can hex values be used also? I could not find a hex notation that worked. In any case, well done Tomas and thank you.
Comment 20 Owen Genat (retired) 2014-10-01 00:13:12 UTC
(In reply to comment #19)
> $ cat test.csv 
> 2014-09-13;123;"abc"

Sorry, obviously test.csv should read a.csv.
Comment 21 V. A. Zoukos 2014-10-04 18:29:55 UTC
Dear community members,
    I would like to thank all of you for your work to fix the bug I reported.
    Solving this problem, I am now able to automate fully parts of my work.
                Many thanks again,
                Sincerely yours,

                Vassilios A. Zoukos
                Public Power Corporation Greece, Athens
Comment 22 markling 2014-12-09 11:44:53 UTC
1) User documentation would be very helpful for this patch.

2) I would also like to propose a modification.

--

1) Users are most unlikely to know potential options enough even to discover the operation of this function by trial and error. If the documentation is available somewhere, it has not made it to me. I can see no mention of it in the man page. I am unable to put it to use without help: batch-convert some 60 excel sheets from potentially variable number of unknown windows-encodings to utf-8.

2) This function would be improved greatly it it did as much as possible without being told specifically which encoding/format to input. This is crucial for users. It is possible, after all, to read a file into the OpenOffice GUI without knowing before-hand what the format/encoding is.

If you want to batch-convert a collection of source files from say, a public body's open data scheme, those would have been collated from a variety of sources over a period of years into a single group of files. Their encodings would be differing and most likely unspecified.

This would not matter if the LibreOffice batch convert function used whatever the GUI used to read files without having to be told in advance how they were made.

The batch function might be unable to detect all types/encodings reliably. This would not matter. If only it did the detection as well as the GUI, it could report/pass over/exit on files that could not be detected. But arguably the majority would be handled automatically. And the majority of users then might simply use this tool.
Comment 23 Bill C Riemers 2015-02-18 21:55:25 UTC
*** Bug 89452 has been marked as a duplicate of this bug. ***
Comment 24 Bill C Riemers 2015-02-18 22:09:29 UTC
The following is the most comprehensive webpage I could find that describes the input filters:

https://wiki.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options

This is for openoffice, not libre office.  The most obvious deficiencies are it doesn't list the information you can append character encoding to the end of the options list.   As in the example in this thread:

oocalc --infilter='csv:44,34,UTF8'

It also doesn't list many of the formats available for libre-office, nor that you can usually just use the file extension as in csv.

Libre-office has some very powerful command line options, but they are also fairly useless if users cannot find documentation on how to use them.

Even the command line itself does not printout correct usage information.  For example it lists as an example:

--infilter "Text (encoded):UTF8,LF,,," 

But has been shown here, UTF8 should be the last parameter, not the first.   Also I have no clue even after reading the openoffice document what the parameters for "Text (encoded)" would even mean.  However, I doubt LF is a correct parameters, as be provided as ascii decimal codes.

I'm still struggling to find what are the correct parameters for --convert-to.   For example, how would one specify binary ods output instead of raw xml?
Comment 25 Trond Husø 2015-08-31 07:52:23 UTC
As I would like to convert ODS and Excel files coming in from different sources to CSV or XML before doing more computing on the content, I am also trying to set the correct encoding for a file.

The current file I am working on has Turkish encoding, I see the current conversion is not working correctly.

I have searched for how to set the correct parameters for infilter, but currently without much success.

My last attempt looks like this:
libreoffice --headless --convert-to csv test.ods --infilter='csv:44,34,UTF8'

Hope someone will help figure out how to do this correctly. 

Best
Trond
Comment 26 Stephan Bergmann 2015-09-07 07:58:39 UTC
(In reply to Stephan Bergmann from comment #17)
> With the fix from comment 16, what works for me to resolve the problem from
> comment 0 is to run
> 
>   soffice --headless --convert-to ods --infilter=CSV:44,34,UTF8 mytest.csv
> 
> where "44" denotes the field separator character (,) and "34" denotes the
> quote character (").  The problem is that you cannot leave these obscure
> values out (e.g., --infilter=CSV:,,UTF8) and the "documentation" for the CSV
> filter's FilterOptions string format is ScAsciiOptions::ReadFromString
> (sc/source/ui/dbgui/asciiopt.cxx).

It is no longer clear to me why I assumed that "UTF8" would cause the CSV filter to interpret the input as UTF-8.  Rather, according to ScAsciiOptions::ReadFromString (sc/source/ui/dbgui/asciiopt.cxx) calling ScGlobal::GetCharsetValue (sc/source/core/data/global.cxx), this charset field would be interpreted as follows:

* if it is numeric, interpret it as the corresponding RTL_TEXTENCODING_* from include/rtl/textenc.h;

* it it is one of the (case-ignoring) legacy strings ANSI, MAC, IBMPC, IBMPC_437, IBMPC_850, IBMPC_850, IBMPC_860, IBMPC_861, IBMPC_863, IBMPC_865, interpret it as the corresponding RTL_TEXTENCODING_*;

* otherwise, fall back to the "system encoding" (osl_getThreadTextEncoding; which is typically UTF-8 at least on Linux and Mac, so an input of "UTF8" often will result in causing the CSV filter to interpret the input as UTF-8 "by accident").
Comment 27 Stephan Bergmann 2015-09-07 08:09:27 UTC
(In reply to Trond Husø from comment #25)
> The current file I am working on has Turkish encoding, I see the current
> conversion is not working correctly.

> My last attempt looks like this:
> libreoffice --headless --convert-to csv test.ods --infilter='csv:44,34,UTF8'

There are various text encodings suitable for Turkish, so you need to be more precise in which exact encoding the data is in.  Presumably it is either the global UTF-8, or the Turkish-specific ISO 8859-9 or its close cousin Windows-1254.  According to comment 26, the corresponding value in the --infilter=CSV:44,34,... argument (instead of "UTF8") should be any of

  76

for UTF-8 (i.e., RTL_TEXTENCODING_UTF8),

  20

for ISO 8859-9 (i.e., RTL_TEXTENCODING_ISO_8859_9),

  36

for Windows-1254 (i.e., RTL_TEXTENCODING_MS_1254).
Comment 28 Eike Rathke 2016-04-27 16:34:40 UTC
So let's close this which is not a bug anymore. Incidentally https://cgit.freedesktop.org/libreoffice/core/commit/?id=0445de5e0d9bccd7634911ca3547c0e14f4f47c5 implements UTF8 as an accepted charset value as well. (which of course doesn't help if the Turkish case here was something else)