Bug 98965 - FILESAVE: Conversion to xlsx when spreadsheet cell contains vertical tab character fails
Summary: FILESAVE: Conversion to xlsx when spreadsheet cell contains vertical tab char...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
5.0.5.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:5.4.0
Keywords: dataLoss
Depends on:
Blocks:
 
Reported: 2016-03-29 18:41 UTC by freeseek
Modified: 2018-01-28 04:59 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
tsv file with vertical tabs inside some cells (244 bytes, text/tab-separated-values)
2017-04-04 16:27 UTC, Carlos
Details
csv file with vertical tabs inside some cells (250 bytes, text/csv)
2017-04-04 16:28 UTC, Carlos
Details
minimal xls file with a vertical tab inside a cell (7.00 KB, application/vnd.ms-excel)
2017-04-04 16:58 UTC, Carlos
Details

Note You need to log in before you can comment on or make changes to this bug.
Description freeseek 2016-03-29 18:41:36 UTC
I have been trying to convert an xls table to an xlsx or csv table, but the conversion fails without even providing a proper error message. I believe the issue is due to a weird character inside a cell of the xls table. Interestingly, the conversion to html works fine instead.

Steps to reproduce the problem:
1) Download problematic xls table:
# wget --user-agent="" http://www.sciencedirect.com/science/MiamiMultiMediaURL/1-s2.0-S0140673614617050/1-s2.0-S0140673614617050-mmc2.xls/271074/html/S0140673614617050/8362fd82294e4f65e12123ba7f198387/mmc2.xls
2) Convert table to html:
# localc --convert-to html mmc2.xls
3) Identify problematic cell (sheet 5, row 129, column 9):
# grep "Nephrotic syndrome, type 2 ." mmc2.xls
# grep "Nephrotic syndrome, type 2 ." mmc2.html
4) Convert table to xlsx:
# localc --convert-to xlsx mmc2.xls
5) Open converted table
# localc mmc2.xlsx

You will notice that in sheet 5 column 4 and column 9 after row 129 are all missing data. The file also seems to be malformed itself, as the following command:
# xlsx2csv mmc2.xlsx
Fails with error "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 2, column 48431"
Comment 1 freeseek 2016-03-29 19:46:40 UTC
I believe I have tracked down the issue more precisely.

It seems like the problem is that some strings in the problematic xls file are ended with a vertical tab (\v or \x11 or ^K) rather than a line feed (\n or \x10). The following commands:
# wget --user-agent="" http://www.sciencedirect.com/science/MiamiMultiMediaURL/1-s2.0-S0140673614617050/1-s2.0-S0140673614617050-mmc2.xls/271074/html/S0140673614617050/8362fd82294e4f65e12123ba7f198387/mmc2.xls
# sed -i -e 's/Nephrotic syndrome, type 2 \v/Nephrotic syndrome, type 2 \n/g' \
  -e 's/Spinocerebellar ataxia 29, congenital nonprogressive\v/Spinocerebellar ataxia 29, congenital nonprogressive\n/g' \
  -e 's/Congenital disorder of glycosylation, type IIk \v/Congenital disorder of glycosylation, type IIk \n/g' \
  -e 's/Charcot-Marie-Tooth disease, type 4J \v/Charcot-Marie-Tooth disease, type 4J \n/g' \
  -e 's/Diaphragmatic hernia 3 \v/Diaphragmatic hernia 3 \n/g' \
  -e 's/Tetralogy of Fallot \v/Tetralogy of Fallot \n/g' \
  -e 's/Phosphoserine aminotransferase deficiency\v/Phosphoserine aminotransferase deficiency\n/g' \
  -e 's/Deafness, autosomal dominant 48 \v/Deafness, autosomal dominant 48 \n/g' \
  -e 's/Periventricular heterotopia with microcephaly\v/Periventricular heterotopia with microcephaly\n/g' \
  -e 's/Epileptic encephalopathy, early infantile, 8 \v/Epileptic encephalopathy, early infantile, 8 \n/g' \
  -e 's/Congenital disorder of glycosylation, type Is \v/Congenital disorder of glycosylation, type Is \n/'g mmc2.xls
# localc --convert-to xlsx mmc2.xls

Do work fine and the file is properly converted. The sed commands replace the offending vertical tabs with the appropriate line feeds. The issue remains that localc is unable to process the vertical tabs during conversion to xlsx but it is perfectly capable of processing the vertical tabs when opening the xls file or when converting to html.
Comment 2 raal 2016-03-30 11:34:21 UTC
Hello, could you prepare a minimal example? Thanks
Comment 3 freeseek 2016-03-30 13:55:05 UTC
I am unable to generate an xls file from scratch with the same issue. However, if I take the original xls file with the issue (http://www.sciencedirect.com/science/MiamiMultiMediaURL/1-s2.0-S0140673614617050/1-s2.0-S0140673614617050-mmc2.xls/271074/html/S0140673614617050/8362fd82294e4f65e12123ba7f198387/mmc2.xls), I copy the cell from sheet 5, row 129, column 9 (the one with the vertical tab at the end of the text) and I copy it into a new spreadsheet and save it as a new xls file with localc (release 5.0.5.2), then I now have a minimalist xls file with the same issue. You can see the result here: https://www.dropbox.com/sh/6a6a2dz7dpfcesd/AAB8d-Uv2CvxSnxc2lZJg2QRa?dl=0

This works fine:
# localc test.xls
This shows that the text disappeared:
# localc --convert-to xlsx test.xls
# localc test.xlsx

I want to add that this is indeed a very serious issue as the problem arises also when you convert the file using "Save As..." from the GUI, that is, no error message is displayed and yet several cells of the spreadsheet disappear.
Comment 4 freeseek 2016-03-30 15:16:12 UTC
I have actually found an easier way to reproduce the problem from scratch. It seems to be related to mishandling the vertical tab character when converting to xlsx. I don't know whether other characters are also causing similar issues.

Here is an example. Generate a minimal table:
# echo -e "abcde\v" > test.tsv

Convert the table to xls and then back to csv:
# localc --convert-to xls test.tsv
# localc --convert-to csv test.xls
# cat test.csv
The output is
> abcde

Convert the table to xlsx and then back to csv:
# localc --convert-to xlsx test.tsv
# localc --convert-to csv test.xlsx
# cat test.csv
The output is empty.
Comment 5 freeseek 2016-03-30 15:48:04 UTC
Apparently 28 ASCII control characters cause the conversion to xlsx to fail, that is, all control characters with the exception of \x00, \x09, \x0A, and \x0D, corresponding to \0, \t, \n, and \r. The following bash script shows that the xlsx file generated by localc is not well-formed due to an invalid token:

for c in {0,1}{{0..9},{A..F}}; do
  echo -en "\\\\x$c\t"
  echo -en "abcde \x$c vwxyz" > test.csv
  localc --convert-to xlsx test.csv > /dev/null
  xlsx2csv test.xlsx 2>&1 | tail -n1
done
Comment 6 freeseek 2016-04-22 03:23:01 UTC
The bug is still present with localc version 5.1.2.2
Comment 7 freeseek 2016-11-12 22:10:31 UTC
The bug is still present with localc version 5.2.2.2
Comment 8 Carlos 2017-04-04 16:27:45 UTC
Created attachment 132335 [details]
tsv file with vertical tabs inside some cells
Comment 9 Carlos 2017-04-04 16:28:50 UTC
Created attachment 132336 [details]
csv file with vertical tabs inside some cells
Comment 10 Carlos 2017-04-04 16:33:10 UTC Comment hidden (obsolete)
Comment 11 Carlos 2017-04-04 16:58:49 UTC
Created attachment 132337 [details]
minimal xls file with a vertical tab inside a cell
Comment 12 Carlos 2017-04-04 17:02:56 UTC
I can confirm that the bug is present in

Version: 5.3.1.2 (x64)
Build ID: e80a0e0fd1875e1696614d24c32df0f95f03deb2
CPU Threads: 4; OS Version: Windows 6.19; UI Render: default; Layout Engine: new; 
Locale: en-US (en_US); Calc: group

You can use https://bugs.documentfoundation.org/attachment.cgi?id=132337
to reproduce it or create a minimal file by yourself using freeseek's original link. 

Note: The conversion (using the GUI) "succeeds" but data is lost.
Comment 13 Xisco Faulí 2017-06-12 11:03:53 UTC
Changing version back to the earliest version affected.
Comment 14 Eike Rathke 2017-06-16 21:04:38 UTC
Please try with 5.4.0.0.beta1 or later, this should be fixed now.
The cause is that in XML 1.0 control characters except carriage return, line feed and horizontal tab are not allowed, not even as entities. OOXML circumvents that by escaping them to _xHHHH, which LibreOffice 5.4 implements.