Created attachment 109452 [details]
Minimal example of a working file, and one non-working file.
An .xlsx document is created in MS Excel 2010. When opened in Libreoffice, and saved again as .xlsx document (MS Excel 2007/'10/'13 xml format), it seems perfectly readable and writeable in both Excel and LibreOffice.
However, opening of said .xlsx file fails using the scientific python package, pandas (which uses the python "xlrd" module to read/write .xlsx files).
WORK AROUND: The non-working .xlsx file can be made working by selecting all not used columns (from C to AMZ or whatever is the last column name), then rightclick and "delete columns...", and followed by saving again as .xlsx file.
I have attached a zip file with two documents: "works_no.xlsx" does not open in Python pandas, whereas "works_yes.xlsx" does work. File "works_no.xlsx" can be made to open correctly by applying the work-around.
The test to perform is, using Python (I use the IPython notebook and interactive shell) and the pandas package:
import pandas as pd
myFile = pd.ExcelFile("works_no.xlsx")
The result of this operation is an AssertionError.
When the "works_yes.xlsx" file is used, the correct output (that is, the sheet names for the file) are correctly displayed.
It seems that LibreOffice does write some additional data into some of the empty columns, but I was unable to find any hidden character.
For the test, could you give a try to last stable LO version 4.3.3?
Tested on Ubuntu 14.10 (fully updated 2014-11-14).
Libreoffice: Version: 188.8.131.52, Build ID: 430m0(Build:2)
I have opened both attached documents in LO 184.108.40.206, saved as .xlsx, and closed. The "works_no" file cannot be loaded from python/pandas, whereas "works_yes" will do so. The workaround works as expected.
Thus: similar behavior with LO 220.127.116.11 on my system.
Ad1 -- When I open the "works_no.xlsx" container using an archive manager, in the file "/xl/worksheets/sheet1.xml" there is a large string of information on extra columns. No clue if this is significant, but this line is removed after applyign workaround and saving file again. The string looks like: <c r="C1" s="0"/><c r="D1" s="0"/><c r="E1" s="0"/><c r="F1" s="0"/><c r="G1" s="0"/><c r="H1" s="0"/> etc etc.
Ad2 -- I am able to convert the "works_no.xlsx" file, but I haven't been able to reproduce the "works_no.xlsx" file from scratch. It appeared somewhere in my workflow from Excel to LibreOffice. Not sure how much time to put into this bug, but I'm reporting as it might point to differences between LO/Excel.
I put it as UNCONFIRMED since I don't have more questions.
Created attachment 109820 [details]
Add test file individually
Created attachment 109821 [details]
Add attachment individually
TESTING with LO 18.104.22.168.beta1 + Ubuntu 14.04 (and Pandas 0.13.1, for both Python 2 and Python 3)
(In reply to douwe van der veen from comment #0)
> An .xlsx document is created in MS Excel 2010. When opened in Libreoffice,
> and saved again as .xlsx document (MS Excel 2007/'10/'13 xml format), it
> seems perfectly readable and writeable in both Excel and LibreOffice.
> However, opening of said .xlsx file fails using the scientific python
> package, pandas (which uses the python "xlrd" module to read/write .xlsx
Given that the file continues to work in Excel and LibreOffice, could this be a bug in Pandas?
1) $ sudo apt-get install python-pandas python3-pandas python3-xlrd
2) Test against pandas
$ python # or python3
>>> import pandas as pd
>>> myFile = pd.ExcelFile("works_no.xlsx")
Result: PARTIAL CONFIRMATION -- This test actually throws an Assertion Error BEFORE we even query for sheet_names.
File "/usr/lib/python2.7/dist-packages/xlrd/xlsx.py", line 89, in cell_name_to_rowx_colx
assert 0 <= colx < X12_MAX_COLS
Same error with Python3:
File "/usr/lib/python3/dist-packages/xlrd/xlsx.py", line 89, in cell_name_to_rowx_colx
assert 0 <= colx < X12_MAX_COLS
> The result of this operation is an AssertionError.
> When the "works_yes.xlsx" file is used, the correct output (that is, the
> sheet names for the file) are correctly displayed.
I tested the two XLSX files using Office-o-tron, and both passed validation:
Given that the files validate, it looks like this a bug in Pandas. Given that it's the one throwing the error, please talk to their devs first, and see if they have an idea of how the file is malformed. If they think it's a bug in LibreOffice, please provide a comment with reasoning and change the status back to UNCONFIRMED.
Status -> RESOLVED NOTOURBUG
Thanks for testing both .xlsx files.
I have reported this issue to the 'python-excel' xlrd package, which is the Python module that is used under the hood by pandas package in which I came across this error.
It appears that this bug/issue has been reported there before, so I have added a reference to the earlier reported issue in the xlrd bug tracker:
For information only:
the python-excel package people appears to have fixed this issue, see https://github.com/python-excel/xlrd/issues/56