Created attachment 101245 [details] minimal csv file containing a 0x16 character Problem description: Steps to reproduce: 1. Import a CSV file that includes a low-value control character (e.g. 0x16) 2. Save resulting spreadsheet in xlsx format 3. Attempt to re-open spreadsheet in LibreOffice or Excel Current behavior: The resulting .xlsx file is treated as corrupt (invalid UTF8) by Excel. LibreOffice truncates columns after the corrupted cell when the xlsx file is reloaded. Expected behavior: CSV import filter could reject the file; could strip out control characters or XLSX export could use an encoding that coped with the corrupt characters. Although the initial csv file is obviously malformed, such corrupt data exists in the wild and is hard to detect. LibreOffice's silent truncation of columns after the corrupt cell is problematic as it can easily be overlooked. (in my original example rows after the corrupt cell stayed but with right-hand columns missing - no error was reported by LibreOffice) Operating System: Windows 7 Version: 4.2.4.2 release
I just tried this in both: 4.2.4.2 4.2.5.2 and calc was able to import this file fine, the 2nd and 3rd row came in fine. I am on Fedora 20. Is there a specific manner in which you brought the file into calc? The method I used was file > open, then accepted the defaults ( character set unicode(utf-8) language Default - English(USA) from row 1 separator: tab,comma,semicolon )
Dear Bug Submitter, This bug has been in NEEDINFO status with no change for at least 6 months. Please provide the requested information as soon as possible and mark the bug as UNCONFIRMED. Due to regular bug tracker maintenance, if the bug is still in NEEDINFO status with no change in 30 days the QA team will close the bug as INVALID due to lack of needed information. For more information about our NEEDINFO policy please read the wiki located here: https://wiki.documentfoundation.org/QA/FDO/NEEDINFO If you have already provided the requested information, please mark the bug as UNCONFIRMED so that the QA team knows that the bug is ready to be confirmed. Thank you for helping us make LibreOffice even better for everyone! Warm Regards, QA Team Message generated on: 10/01/2015
Confirming. The Excel file contains the literal byte 0x16, instead of the required representation as "_x0016_". The side issue is that LO drops that character entirely when saving to ODS.
** Please read this message in its entirety before responding ** To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year. There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present. If you have time, please do the following: Test to see if the bug is still present on a currently supported version of LibreOffice (5.0.4 or later) https://www.libreoffice.org/download/ If the bug is present, please leave a comment that includes the version of LibreOffice and your operating system, and any changes you see in the bug behavior If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a short comment that includes your version of LibreOffice and Operating System Please DO NOT: - Update the version field - Reply via email (please reply directly on the bug tracker) - Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case) If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) http://downloadarchive.documentfoundation.org/libreoffice/old/ 2. Test your bug 3. Leave a comment with your results. 4a. If the bug was present with 3.3 - set version to "inherited from OOo"; 4b. If the bug was not present in 3.3 - add "regression" to keyword Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa Thank you for your help! -- The LibreOffice QA Team This NEW Message was generated on: 2016-01-17
Characters below 0x20, except tab, carriage return and linefeed, are illegal characters in XML, see https://www.w3.org/TR/xml/#charsets EVEN the &#xhhhh; entity reference representation, see https://www.w3.org/TR/xml/#sec-references "Well-formedness constraint: Legal Character". This is the reason why they are not saved to .ods Actually they should also be dropped when saving to .xlsx as you see Excel otherwise stumbles. Excel apparently came up with their own unspecified invention to save such as _x0016_. As you have to be able to distinguish a literal "_x0016_" they write that as "_x005F_x0016_". Oh glory. That's "encoding illegal XML characters in SQL" http://dcx.sap.com/1200/en/dbusage/xmldraftchapter-b-3488944b.html where "_x" is encoded as "_x005F_x", so they borrowed that from their SQL-Server guys.
So this is in fact specified somewhere.. Citing from "ECMA-376 Part 1" (OOXML), page 3732: 22.4.2.4 bstr (Basic String) This element defines a binary basic string variant type, which can store any valid Unicode character. Unicode characters that cannot be directly represented in XML as defined by the XML 1.0 specification, shall be escaped using the Unicode numerical character representation escape character format _xHHHH_, where H represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it shall be escaped as _x0008_. end example] To store the literal form of an escape sequence, the initial underscore shall itself be escaped (i.e. stored as _x005F_). [Example: The string literal _x0008_ would be stored as _x005F_x0008_. end example] The possible values for this element are defined by the W3C XML Schema string datatype.
*** Bug 103828 has been marked as a duplicate of this bug. ***
Should be fixed with commit 8b25b67d5268abbb260da968cc23b6f6c8dd31af for 5.4
Xisco Fauli committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/9f89ee7c5076f700589d3b07f3d6a50f9af7d13a tdf#80149: sc_subsequent_export: Add unittest It will be available in 7.2.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.