Users can convert documents in headless mode. For example, you can convert a CSV file to an ODS file, using the command line. The problem is that LibreOffice assumes, by default, that the initial encoding is ISO-8859-1, and there is no option yet to change this. Documents with other encodings have the text corrupted. HOW TO REPLICATE: a. Create the following test.csv file: $ cat /tmp/mytest.csv "First","Second" "áéŕó","ṫřåiṅ" $ _ b. Then convert with the following command line: $ libreoffice -headless -convert-to ods mytest.csv onvert /tmp/mytest.csv -> /tmp/mytest.ods using OpenDocument Spreadsheet Flat XML Warning: at xsl:stylesheet on line 2 of file:///usr/lib/libreoffice/basis3.3/share/xslt/odfflatxml/odfflatxmlexport.xsl: Running an XSLT 1.0 stylesheet with an XSLT 2.0 processor $ _ c. Finally inspect the generated mytest.ods: <table:table-cell office:value-type="string"> <text:p>áéŕó</text:p> </table:table-cell> <table:table-cell office:value-type="string"> <text:p>ṫřåiá¹…</text:p> The text shows that a conversion from ISO-8859-1 to UTF-8 was forced, which corrupts the text. WHAT SHOULD HAPPEN: » There should be an option to specify the initial encoding so that no forced conversion takes place. RELEVANT LINKS: http://listarchives.libreoffice.org/www/users/msg03444.html
This is possible with the direct uno call with some extra filter options IIRC. Though presumably not available from the command line.
It would probably be more generally useful if we added a command line option to pass extra filter options, rather than explicitly adding an encoding option which would only benefit a few filters. Anyway, this is a feature request.
Can anyone comment if this is the same bug as in 38311? https://bugs.freedesktop.org/show_bug.cgi?id=38311
(In reply to comment #3) > Can anyone comment if this is the same bug as in 38311? > > https://bugs.freedesktop.org/show_bug.cgi?id=38311 It's not the same bug. The bug here has to do with the encoding being messed up when converting (at least) from csv to ods. The problem you are facing has to do with font selection; in the problematic PDF output, you can easily see that different fonts are chosen. You can check with your PDF viewer (see in the Properties) that in the problematic PDF, different fonts are being selected. However, both bugs relate to how to pass parameters when performing conversions in headless mode, so they have something in common.
[This is an automated message.] This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it started right out as NEW without ever being explicitly confirmed. The bug is changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases. Details on how to test the 3.5.0 beta1 can be found at: http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1 more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
Dear bug submitter! Due to the fact, that there are a lot of NEEDINFO bugs with no answer within the last six months, we close all of these bugs. To keep this message short, more infos are available @ https://wiki.documentfoundation.org/QA/NeedinfoClosure#Statement Thanks for understanding and hopefully updating your bug, so that everything is prepared for developers to fix your problem. Yours! Florian
Hello, I can tell this is still happening in LibreOffice 3.6. I have been trying to convert various documents from various languages (so with non-ASCII characters, like from Chinese, Japanese, French). Non ASCII character often fail to convert. For xls or xlsx characters, I use the command: $ oowriter --headless --convert-to csv <some file>.xls For docs: $ oowriter --headless --convert-to txt:TEXT <some file>.doc And I get the Japanese or Chinese characters unconverted. Actually sometimes it works fine. If for instance I create a file from scratch in LibreOffice and writes down some Japanese in it. It would convert fine (at least in my basic test). Yet if I take a doc/xls from someone else (most probably created in Microsoft Word/Excel), Japanese characters ends up all being '?' character. Note that if I convert the same document to pdf, the Japanese characters are well displayed. What kind of information do you need to debug this? I would have hard time to send the documents which fail because we are not allowed. And I don't have easy access to a Windows machine to try and create a doc there, in order to reproduce the issue. But I may still try. I guess I cannot change the status myself, to reopen the ticket, can I? The statement link above seems to say that a QA member has to do it.
it is RFE, reopening
problem same in 3.6.3 can someone please help?
Thanks for additional testing Sorry, but "Version" is where bug appears. Not a current version of LO. If bug disappears, we just closing bugreport. Changing back to 3.3.2.
(The bug still exists in LibreOffice 4.0.3.3 and probably in master)
*** Bug 61960 has been marked as a duplicate of this bug. ***
Tomas Hlavaty committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=45ba4d79d968f81f74ef0c4588fd15b1ce91153f fdo#36313: allow passing FilterOptions via cli The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
With the fix from comment 16, what works for me to resolve the problem from comment 0 is to run soffice --headless --convert-to ods --infilter=CSV:44,34,UTF8 mytest.csv where "44" denotes the field separator character (,) and "34" denotes the quote character ("). The problem is that you cannot leave these obscure values out (e.g., --infilter=CSV:,,UTF8) and the "documentation" for the CSV filter's FilterOptions string format is ScAsciiOptions::ReadFromString (sc/source/ui/dbgui/asciiopt.cxx). That is, I am not sure whether the fix from comment 16 is already a practical-enough solution?
Tomas Hlavaty committed a patch related to this issue. It has been pushed to "libreoffice-4-3": http://cgit.freedesktop.org/libreoffice/core/commit/?id=4b8a0159ca80dad05ddcaad5897b786484fc8afb&h=libreoffice-4-3 fdo#36313: allow passing FilterOptions via cli It will be available in LibreOffice 4.3.2. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Tested under GNU/Linux using 4.4.0.0.alpha0+ Build ID: 037d03b9facb414ba6be01fa6ee92fc7ca89f70c TinderBox: Linux-rpm_deb-x86_64@46-TDF, Branch:master, Time: 2014-09-11_00:32:52 $ cat test.csv 2014-09-13;123;"abc" $ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:59,34,UTF8 a.csv convert /data/temp/LO_test/a.csv -> /data/temp/LO_test/a.ods using calc8 Overwriting: /data/temp/LO_test/a.ods $ unzip -p a.ods content.xml | xmllint --format - | grep "<text:p>[^<]*" <text:p>2014-09-13</text:p> <text:p>123</text:p> <text:p>abc</text:p> $ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:59,,UTF8 a.csv convert /data/temp/LO_test/a.csv -> /data/temp/LO_test/a.ods using calc8 Overwriting: /data/temp/LO_test/a.ods $ unzip -p a.ods content.xml | xmllint --format - | grep "<text:p>[^<]*" <text:p>2014-09-13</text:p> <text:p>123</text:p> <text:p>"abc"</text:p> $ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:,,UTF8 a.csv convert /data/temp/LO_test/a.csv -> /data/temp/LO_test/a.ods using calc8 Overwriting: /data/temp/LO_test/a.ods $ unzip -p a.ods content.xml | xmllint --format - | grep "<text:p>[^<]*" <text:p>2014-09-13;123;"abc"</text:p> $ cat b.csv "First","Second" "áéŕó","ṫřåiṅ" $ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:44,34,UTF8 b.csv convert /data/temp/LO_test/b.csv -> /data/temp/LO_test/b.ods using calc8 Overwriting: /data/temp/LO_test/b.ods $ unzip -p b.ods content.xml | xmllint --format - | grep "<text:p>[^<]*" <text:p>First</text:p> <text:p>Second</text:p> <text:p>áéŕó</text:p> <text:p>ṫřåiṅ</text:p> $ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:44,,UTF8 b.csv convert /data/temp/LO_test/b.csv -> /data/temp/LO_test/b.ods using calc8 Overwriting: /data/temp/LO_test/b.ods $ unzip -p b.ods content.xml | xmllint --format - | grep "<text:p>[^<]*" <text:p>"First"</text:p> <text:p>"Second"</text:p> <text:p>"áéŕó"</text:p> <text:p>"ṫřåiṅ"</text:p> $ /opt/libreofficedev4.4/program/soffice --headless --convert-to ods --infilter=CSV:,,UTF8 b.csv convert /data/temp/LO_test/b.csv -> /data/temp/LO_test/b.ods using calc8 Overwriting: /data/temp/LO_test/b.ods $ unzip -p b.ods content.xml | xmllint --format - | grep "<text:p>[^<]*" <text:p>"First","Second"</text:p> <text:p>"áéŕó","ṫřåiṅ"</text:p> Are only decimal values available for character specification or can hex values be used also? I could not find a hex notation that worked. In any case, well done Tomas and thank you.
(In reply to comment #19) > $ cat test.csv > 2014-09-13;123;"abc" Sorry, obviously test.csv should read a.csv.
Dear community members, I would like to thank all of you for your work to fix the bug I reported. Solving this problem, I am now able to automate fully parts of my work. Many thanks again, Sincerely yours, Vassilios A. Zoukos Public Power Corporation Greece, Athens
1) User documentation would be very helpful for this patch. 2) I would also like to propose a modification. -- 1) Users are most unlikely to know potential options enough even to discover the operation of this function by trial and error. If the documentation is available somewhere, it has not made it to me. I can see no mention of it in the man page. I am unable to put it to use without help: batch-convert some 60 excel sheets from potentially variable number of unknown windows-encodings to utf-8. 2) This function would be improved greatly it it did as much as possible without being told specifically which encoding/format to input. This is crucial for users. It is possible, after all, to read a file into the OpenOffice GUI without knowing before-hand what the format/encoding is. If you want to batch-convert a collection of source files from say, a public body's open data scheme, those would have been collated from a variety of sources over a period of years into a single group of files. Their encodings would be differing and most likely unspecified. This would not matter if the LibreOffice batch convert function used whatever the GUI used to read files without having to be told in advance how they were made. The batch function might be unable to detect all types/encodings reliably. This would not matter. If only it did the detection as well as the GUI, it could report/pass over/exit on files that could not be detected. But arguably the majority would be handled automatically. And the majority of users then might simply use this tool.
*** Bug 89452 has been marked as a duplicate of this bug. ***
The following is the most comprehensive webpage I could find that describes the input filters: https://wiki.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options This is for openoffice, not libre office. The most obvious deficiencies are it doesn't list the information you can append character encoding to the end of the options list. As in the example in this thread: oocalc --infilter='csv:44,34,UTF8' It also doesn't list many of the formats available for libre-office, nor that you can usually just use the file extension as in csv. Libre-office has some very powerful command line options, but they are also fairly useless if users cannot find documentation on how to use them. Even the command line itself does not printout correct usage information. For example it lists as an example: --infilter "Text (encoded):UTF8,LF,,," But has been shown here, UTF8 should be the last parameter, not the first. Also I have no clue even after reading the openoffice document what the parameters for "Text (encoded)" would even mean. However, I doubt LF is a correct parameters, as be provided as ascii decimal codes. I'm still struggling to find what are the correct parameters for --convert-to. For example, how would one specify binary ods output instead of raw xml?
As I would like to convert ODS and Excel files coming in from different sources to CSV or XML before doing more computing on the content, I am also trying to set the correct encoding for a file. The current file I am working on has Turkish encoding, I see the current conversion is not working correctly. I have searched for how to set the correct parameters for infilter, but currently without much success. My last attempt looks like this: libreoffice --headless --convert-to csv test.ods --infilter='csv:44,34,UTF8' Hope someone will help figure out how to do this correctly. Best Trond
(In reply to Stephan Bergmann from comment #17) > With the fix from comment 16, what works for me to resolve the problem from > comment 0 is to run > > soffice --headless --convert-to ods --infilter=CSV:44,34,UTF8 mytest.csv > > where "44" denotes the field separator character (,) and "34" denotes the > quote character ("). The problem is that you cannot leave these obscure > values out (e.g., --infilter=CSV:,,UTF8) and the "documentation" for the CSV > filter's FilterOptions string format is ScAsciiOptions::ReadFromString > (sc/source/ui/dbgui/asciiopt.cxx). It is no longer clear to me why I assumed that "UTF8" would cause the CSV filter to interpret the input as UTF-8. Rather, according to ScAsciiOptions::ReadFromString (sc/source/ui/dbgui/asciiopt.cxx) calling ScGlobal::GetCharsetValue (sc/source/core/data/global.cxx), this charset field would be interpreted as follows: * if it is numeric, interpret it as the corresponding RTL_TEXTENCODING_* from include/rtl/textenc.h; * it it is one of the (case-ignoring) legacy strings ANSI, MAC, IBMPC, IBMPC_437, IBMPC_850, IBMPC_850, IBMPC_860, IBMPC_861, IBMPC_863, IBMPC_865, interpret it as the corresponding RTL_TEXTENCODING_*; * otherwise, fall back to the "system encoding" (osl_getThreadTextEncoding; which is typically UTF-8 at least on Linux and Mac, so an input of "UTF8" often will result in causing the CSV filter to interpret the input as UTF-8 "by accident").
(In reply to Trond Husø from comment #25) > The current file I am working on has Turkish encoding, I see the current > conversion is not working correctly. > My last attempt looks like this: > libreoffice --headless --convert-to csv test.ods --infilter='csv:44,34,UTF8' There are various text encodings suitable for Turkish, so you need to be more precise in which exact encoding the data is in. Presumably it is either the global UTF-8, or the Turkish-specific ISO 8859-9 or its close cousin Windows-1254. According to comment 26, the corresponding value in the --infilter=CSV:44,34,... argument (instead of "UTF8") should be any of 76 for UTF-8 (i.e., RTL_TEXTENCODING_UTF8), 20 for ISO 8859-9 (i.e., RTL_TEXTENCODING_ISO_8859_9), 36 for Windows-1254 (i.e., RTL_TEXTENCODING_MS_1254).
So let's close this which is not a bug anymore. Incidentally https://cgit.freedesktop.org/libreoffice/core/commit/?id=0445de5e0d9bccd7634911ca3547c0e14f4f47c5 implements UTF8 as an accepted charset value as well. (which of course doesn't help if the Turkish case here was something else)
*** Bug 101975 has been marked as a duplicate of this bug. ***