I have been trying to convert an xls table to an xlsx or csv table, but the conversion fails without even providing a proper error message. I believe the issue is due to a weird character inside a cell of the xls table. Interestingly, the conversion to html works fine instead. Steps to reproduce the problem: 1) Download problematic xls table: # wget --user-agent="" http://www.sciencedirect.com/science/MiamiMultiMediaURL/1-s2.0-S0140673614617050/1-s2.0-S0140673614617050-mmc2.xls/271074/html/S0140673614617050/8362fd82294e4f65e12123ba7f198387/mmc2.xls 2) Convert table to html: # localc --convert-to html mmc2.xls 3) Identify problematic cell (sheet 5, row 129, column 9): # grep "Nephrotic syndrome, type 2 ." mmc2.xls # grep "Nephrotic syndrome, type 2 ." mmc2.html 4) Convert table to xlsx: # localc --convert-to xlsx mmc2.xls 5) Open converted table # localc mmc2.xlsx You will notice that in sheet 5 column 4 and column 9 after row 129 are all missing data. The file also seems to be malformed itself, as the following command: # xlsx2csv mmc2.xlsx Fails with error "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 2, column 48431"
I believe I have tracked down the issue more precisely. It seems like the problem is that some strings in the problematic xls file are ended with a vertical tab (\v or \x11 or ^K) rather than a line feed (\n or \x10). The following commands: # wget --user-agent="" http://www.sciencedirect.com/science/MiamiMultiMediaURL/1-s2.0-S0140673614617050/1-s2.0-S0140673614617050-mmc2.xls/271074/html/S0140673614617050/8362fd82294e4f65e12123ba7f198387/mmc2.xls # sed -i -e 's/Nephrotic syndrome, type 2 \v/Nephrotic syndrome, type 2 \n/g' \ -e 's/Spinocerebellar ataxia 29, congenital nonprogressive\v/Spinocerebellar ataxia 29, congenital nonprogressive\n/g' \ -e 's/Congenital disorder of glycosylation, type IIk \v/Congenital disorder of glycosylation, type IIk \n/g' \ -e 's/Charcot-Marie-Tooth disease, type 4J \v/Charcot-Marie-Tooth disease, type 4J \n/g' \ -e 's/Diaphragmatic hernia 3 \v/Diaphragmatic hernia 3 \n/g' \ -e 's/Tetralogy of Fallot \v/Tetralogy of Fallot \n/g' \ -e 's/Phosphoserine aminotransferase deficiency\v/Phosphoserine aminotransferase deficiency\n/g' \ -e 's/Deafness, autosomal dominant 48 \v/Deafness, autosomal dominant 48 \n/g' \ -e 's/Periventricular heterotopia with microcephaly\v/Periventricular heterotopia with microcephaly\n/g' \ -e 's/Epileptic encephalopathy, early infantile, 8 \v/Epileptic encephalopathy, early infantile, 8 \n/g' \ -e 's/Congenital disorder of glycosylation, type Is \v/Congenital disorder of glycosylation, type Is \n/'g mmc2.xls # localc --convert-to xlsx mmc2.xls Do work fine and the file is properly converted. The sed commands replace the offending vertical tabs with the appropriate line feeds. The issue remains that localc is unable to process the vertical tabs during conversion to xlsx but it is perfectly capable of processing the vertical tabs when opening the xls file or when converting to html.
Hello, could you prepare a minimal example? Thanks
I am unable to generate an xls file from scratch with the same issue. However, if I take the original xls file with the issue (http://www.sciencedirect.com/science/MiamiMultiMediaURL/1-s2.0-S0140673614617050/1-s2.0-S0140673614617050-mmc2.xls/271074/html/S0140673614617050/8362fd82294e4f65e12123ba7f198387/mmc2.xls), I copy the cell from sheet 5, row 129, column 9 (the one with the vertical tab at the end of the text) and I copy it into a new spreadsheet and save it as a new xls file with localc (release 5.0.5.2), then I now have a minimalist xls file with the same issue. You can see the result here: https://www.dropbox.com/sh/6a6a2dz7dpfcesd/AAB8d-Uv2CvxSnxc2lZJg2QRa?dl=0 This works fine: # localc test.xls This shows that the text disappeared: # localc --convert-to xlsx test.xls # localc test.xlsx I want to add that this is indeed a very serious issue as the problem arises also when you convert the file using "Save As..." from the GUI, that is, no error message is displayed and yet several cells of the spreadsheet disappear.
I have actually found an easier way to reproduce the problem from scratch. It seems to be related to mishandling the vertical tab character when converting to xlsx. I don't know whether other characters are also causing similar issues. Here is an example. Generate a minimal table: # echo -e "abcde\v" > test.tsv Convert the table to xls and then back to csv: # localc --convert-to xls test.tsv # localc --convert-to csv test.xls # cat test.csv The output is > abcde Convert the table to xlsx and then back to csv: # localc --convert-to xlsx test.tsv # localc --convert-to csv test.xlsx # cat test.csv The output is empty.
Apparently 28 ASCII control characters cause the conversion to xlsx to fail, that is, all control characters with the exception of \x00, \x09, \x0A, and \x0D, corresponding to \0, \t, \n, and \r. The following bash script shows that the xlsx file generated by localc is not well-formed due to an invalid token: for c in {0,1}{{0..9},{A..F}}; do echo -en "\\\\x$c\t" echo -en "abcde \x$c vwxyz" > test.csv localc --convert-to xlsx test.csv > /dev/null xlsx2csv test.xlsx 2>&1 | tail -n1 done
The bug is still present with localc version 5.1.2.2
The bug is still present with localc version 5.2.2.2
Created attachment 132335 [details] tsv file with vertical tabs inside some cells
Created attachment 132336 [details] csv file with vertical tabs inside some cells
Can anyone generate a MS Excel 2003 file from my attachments? It will make it easier for us to verify the bug. I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' once the requested document is provided.
Created attachment 132337 [details] minimal xls file with a vertical tab inside a cell
I can confirm that the bug is present in Version: 5.3.1.2 (x64) Build ID: e80a0e0fd1875e1696614d24c32df0f95f03deb2 CPU Threads: 4; OS Version: Windows 6.19; UI Render: default; Layout Engine: new; Locale: en-US (en_US); Calc: group You can use https://bugs.documentfoundation.org/attachment.cgi?id=132337 to reproduce it or create a minimal file by yourself using freeseek's original link. Note: The conversion (using the GUI) "succeeds" but data is lost.
Changing version back to the earliest version affected.
Please try with 5.4.0.0.beta1 or later, this should be fixed now. The cause is that in XML 1.0 control characters except carriage return, line feed and horizontal tab are not allowed, not even as entities. OOXML circumvents that by escaping them to _xHHHH, which LibreOffice 5.4 implements.