86277 – Filesave: Libreoffice vs Excel 2010 worksheet saved as .xlsx not opened correctly in (python) pandas/xlrd: hidden column information

Bug 86277 - Filesave: Libreoffice vs Excel 2010 worksheet saved as .xlsx not opened correctly in (python) pandas/xlrd: hidden column information

Summary: Filesave: Libreoffice vs Excel 2010 worksheet saved as .xlsx not opened corre...

Status:	RESOLVED NOTOURBUG

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	4.2.6.3 release
Hardware:	x86-64 (AMD64) All

Importance:	medium minor
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-11-14 10:53 UTC by douwe van der veen
Modified:	2016-06-03 19:02 UTC (History)
CC List:	3 users (show)

See Also:
Crash report or crash signature:

Attachments
Minimal example of a working file, and one non-working file. (8.35 KB, application/zip) 2014-11-14 10:53 UTC, douwe van der veen	Details
Works_No (4.75 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) 2014-11-21 22:18 UTC, Robinson Tryon (qubit)	Details
Works_Yes (4.62 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) 2014-11-21 22:18 UTC, Robinson Tryon (qubit)	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description douwe van der veen 2014-11-14 10:53:51 UTC

Created attachment 109452 [details]
Minimal example of a working file, and one non-working file.

An .xlsx document is created in MS Excel 2010. When opened in Libreoffice, and saved again as .xlsx document (MS Excel 2007/'10/'13 xml format), it seems perfectly readable and writeable in both Excel and LibreOffice.

However, opening of said .xlsx file fails using the scientific python package, pandas (which uses the python "xlrd" module to read/write .xlsx files).

WORK AROUND: The non-working .xlsx file can be made working by selecting all not used columns (from C to AMZ or whatever is the last column name), then rightclick and "delete columns...", and followed by saving again as .xlsx file.

I have attached a zip file with two documents: "works_no.xlsx" does not open in Python pandas, whereas "works_yes.xlsx" does work. File "works_no.xlsx" can be made to open correctly by applying the work-around.

The test to perform is, using Python (I use the IPython notebook and interactive shell) and the pandas package:

import pandas as pd
myFile = pd.ExcelFile("works_no.xlsx")
myFile.sheet_names

The result of this operation is an AssertionError. 
When the "works_yes.xlsx" file is used, the correct output (that is, the sheet names for the file) are correctly displayed.


It seems that LibreOffice does write some additional data into some of the empty columns, but I was unable to find any hidden character.

Comment 1 Julien Nabet 2014-11-14 11:17:37 UTC

For the test, could you give a try to last stable LO version 4.3.3?

Comment 2 douwe van der veen 2014-11-14 20:27:37 UTC

Tested on Ubuntu 14.10 (fully updated 2014-11-14). 
Libreoffice: Version: 4.3.3.2,  Build ID: 430m0(Build:2)

I have opened both attached documents in LO 4.3.3.2, saved as .xlsx, and closed. The "works_no" file cannot be loaded from python/pandas, whereas "works_yes" will do so. The workaround works as expected.

Thus: similar behavior with LO 4.3.3.2 on my system.

Ad1 -- When I open the "works_no.xlsx" container using an archive manager, in the file "/xl/worksheets/sheet1.xml" there is a large string of information on extra columns. No clue if this is significant, but this line is removed after applyign workaround and saving file again. The string looks like: <c r="C1" s="0"/><c r="D1" s="0"/><c r="E1" s="0"/><c r="F1" s="0"/><c r="G1" s="0"/><c r="H1" s="0"/> etc etc.

Ad2 -- I am able to convert the "works_no.xlsx" file, but I haven't been able to reproduce the  "works_no.xlsx" file from scratch. It appeared somewhere in my workflow from Excel to LibreOffice. Not sure how much time to put into this bug, but I'm reporting as it might point to differences between LO/Excel.

Thanks!

Comment 3 Julien Nabet 2014-11-17 21:07:19 UTC

I put it as UNCONFIRMED since I don't have more questions.

Comment 4 Robinson Tryon (qubit) 2014-11-21 22:18:25 UTC

Created attachment 109820 [details]
Works_No

Add test file individually

Comment 5 Robinson Tryon (qubit) 2014-11-21 22:18:55 UTC

Created attachment 109821 [details]
Works_Yes

Add attachment individually

Comment 6 Robinson Tryon (qubit) 2014-11-21 23:16:24 UTC

TESTING with LO 4.4.0.0.beta1 + Ubuntu 14.04 (and Pandas 0.13.1, for both Python 2 and Python 3)

(In reply to douwe van der veen from comment #0)
> An .xlsx document is created in MS Excel 2010. When opened in Libreoffice,
> and saved again as .xlsx document (MS Excel 2007/'10/'13 xml format), it
> seems perfectly readable and writeable in both Excel and LibreOffice.
> 
> However, opening of said .xlsx file fails using the scientific python
> package, pandas (which uses the python "xlrd" module to read/write .xlsx
> files).

Given that the file continues to work in Excel and LibreOffice, could this be a bug in Pandas?

REPRO steps:

1) $ sudo apt-get install python-pandas python3-pandas python3-xlrd
2) Test against pandas

$ python  # or python3
>>> import pandas as pd
>>> myFile = pd.ExcelFile("works_no.xlsx")
>>> myFile.sheet_names

Result: PARTIAL CONFIRMATION -- This test actually throws an Assertion Error BEFORE we even query for sheet_names.

---
 File "/usr/lib/python2.7/dist-packages/xlrd/xlsx.py", line 89, in cell_name_to_rowx_colx
    assert 0 <= colx < X12_MAX_COLS
AssertionError
---

Same error with Python3:
---
  File "/usr/lib/python3/dist-packages/xlrd/xlsx.py", line 89, in cell_name_to_rowx_colx
    assert 0 <= colx < X12_MAX_COLS
AssertionError
---

> The result of this operation is an AssertionError. 
> When the "works_yes.xlsx" file is used, the correct output (that is, the
> sheet names for the file) are correctly displayed.

Yes, CONFIRMED.


I tested the two XLSX files using Office-o-tron, and both passed validation:
https://wiki.documentfoundation.org/ODF#Office-o-tron

Given that the files validate, it looks like this a bug in Pandas. Given that it's the one throwing the error, please talk to their devs first, and see if they have an idea of how the file is malformed. If they think it's a bug in LibreOffice, please provide a comment with reasoning and change the status back to UNCONFIRMED.

Status -> RESOLVED NOTOURBUG

Comment 7 douwe van der veen 2014-11-23 13:37:56 UTC

Thanks for testing both .xlsx files.

I have reported this issue to the 'python-excel' xlrd package, which is the Python module that is used under the hood by pandas package in which I came across this error.

It appears that this bug/issue has been reported there before, so I have added a reference to the earlier reported issue in the xlrd bug tracker: 

https://github.com/python-excel/xlrd/issues/56


Best regards,
Douwe

Comment 8 douwe van der veen 2016-06-03 19:02:36 UTC

For reference:
For information only:

the python-excel package people appears to have fixed this issue, see https://github.com/python-excel/xlrd/issues/56