Bug 144732 - [FILEOPEN] Missing content from bare BIFF8 XLS Files
Summary: [FILEOPEN] Missing content from bare BIFF8 XLS Files
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.3.0 target:7.2.3
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-26 07:05 UTC by SheetJS
Modified: 2021-09-28 12:13 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Demonstration of issue (266 bytes, application/vnd.ms-excel)
2021-09-26 07:05 UTC, SheetJS
Details

Note You need to log in before you can comment on or make changes to this bug.
Description SheetJS 2021-09-26 07:05:02 UTC
Description:
LO does not complain, but the files show up as blank

Steps to Reproduce:
Open attached file

Actual Results:
Worksheet is blank

Expected Results:
A1 should be 1, B1 should be 2, etc (compare with Excel 2019)


Reproducible: Always


User Profile Reset: Yes



Additional Info:
Originally noticed in a file that was reported in https://github.com/SheetJS/sheetjs/issues/2390

It appears to have been generated by a third party tool
Comment 1 SheetJS 2021-09-26 07:05:46 UTC
Created attachment 175276 [details]
Demonstration of issue
Comment 2 Mike Kaganski 2021-09-26 08:02:56 UTC
The file is indeed not a proper BIFF8 file as described in [MS-XLS]. The documentation requires (2.1.1 Compound File) that "A file of the type specified by this document MUST be an OLE compound file as specified in [MS-CFB]".

The latter documentation requires (2.2 Compound File Header):

> The Compound File Header structure MUST be at the beginning of the file (offset 0).
> ...
> Header Signature (8 bytes): Identification signature for the compound file structure,
and MUST be set to the value 0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1.

Attachment 175276 [details] starts with 0x09 0x08 0x08 0x00, is it some separate record of BIFF? Is there a specification regarding such partial files somewhere?

[MS-XLS] https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/cd03cb5f-ca02-4934-a391-bb674cb8aa06
[MS-CFB] https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb/53989ce4-7b05-4f8d-829b-d08d6148375b
Comment 3 Roman Kuznetsov 2021-09-26 10:07:00 UTC
>It appears to have been generated by a third party tool

I think it's NOTOURBUG
Comment 4 Mike Kaganski 2021-09-26 10:32:26 UTC
(In reply to Roman Kuznetsov from comment #3)

It's not granted. Excel opens that; and I have tested that:

1. Opening in Excel;
2. Saving as XLS (97);
3. Opening the resulting compound file with 7-zip;
4. Extracting "Workbook" stream;
5. Checking that it resembles the original file in structure (starts with 09 08; ends with 0a 00 00 00);
6. Opening in in Calc

results in normal opening. So we obviously support such files for some extent - and need to make sure if the file is indeed broken, or if it's our bug...
Comment 5 Julien Nabet 2021-09-26 11:37:10 UTC
On pc Debian x86-64 with master sources updated today, I could reproduce this.
I noticed this on console:
warn:sc:38679:38679:sc/source/filter/excel/xlroot.cxx:158: XclRootData::XclRootData - cannot get output device info com.sun.star.uno.RuntimeException message: invalid attempt to assign an empty interface of type com.sun.star.frame.XFrame! /home/julien/lo/libreoffice/include/com/sun/star/uno/Reference.hxx:105

Since it's a third party tool + following Mike's comment, should it be NOTOURBUG?
Or if Excel can open it, should LO absolutely be able to do it too even if it doesn't respect specs from MS?
Comment 6 Julien Nabet 2021-09-26 11:38:18 UTC
argh, I had written my comment before seeing the very last Mike's comment.
Forget my question in my comment then.
Comment 7 Mike Kaganski 2021-09-27 06:03:06 UTC
Code pointer:

ScFormatFilterPluginImpl::ScImportExcel starts importing the file.
In it, a call to XclImpStream::DetectBiffVersion returns EXC_BIFF5 (because the file does not define a specific BIFF version). Later, in ImportExcel::Bof5, it also reads subtype (bytes 06 and 07), that is 0x1000 in the bugdoc, while the expected value is in the range 0x05 - 0x40. The result state in ImportExcel::Read is Z_BiffNull, and nothing is read.

I tend to agree with comment 3, comment 5, and comment in XclImpStream::DetectBiffVersion:

> there are some *really* broken documents out there...

So I close it as NOTOURBUG. The generator is just extra lazy, creating something that only Excel seems to be able to comprehend (so they rely on "if Excel didn't choke, it's good enough" approach, instead of implementing a standardized output using published standards). If OP has any documentation about ways to read such data that exercises Excel's extra-permissive rules of the non-standardized files, please provide them and reopen the bug. Thanks.
Comment 8 Mike Kaganski 2021-09-27 08:22:33 UTC
By the way, forgot to add: returning "Biff5" in pExcRoot->eDateiTyp from ImportExcel::Bof5 when there's no match for the subtype, looks to "fix" the issue. However, no idea if that uncovers another can of worms. I have no insight into this format, unfortunately.
Comment 9 Mike Kaganski 2021-09-27 08:33:05 UTC
OTOH, ImportExcel::Bof{2..4} explicitly do just that:

> // #i51490# Excel interprets invalid indexes as worksheet

So let me fix it this way for ImportExcel::Bof5.
Comment 10 SheetJS 2021-09-27 18:51:52 UTC
The spec situation is confusing.  The old 97-2007 spec [1] (page 18) refers to a "Simple Save" structure.  Our initial interpretation was that the Simple files have no mini FAT.

Daniel Rentz and the OpenOffice developers [2] (page 14) may have interpreted that as bare workbook streams:

> All document types and BIFF versions can be stored in a simple stream file, most of them are always stored this way.  The only exception are BIFF5-BIFF8 workbook documents, which are usually stored as compound document files (see below). If these documents are stored as stream files, the entire file consists of the “Book” stream (BIFF5) or “Workbook” stream (BIFF8) only.



  [1] https://www.loc.gov/preservation/digital/formats/digformatspecs/Excel97-2007BinaryFileFormat%28xls%29Specification.pdf

  [2] https://www.openoffice.org/sc/excelfileformat.pdf
Comment 11 Commit Notification 2021-09-28 03:55:22 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/e753de4630fb74ac837d0339be23a305f18d0725

tdf#144732: treat invalid subtype value as worksheet

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 12 Commit Notification 2021-09-28 12:13:38 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "libreoffice-7-2":

https://git.libreoffice.org/core/commit/a3010249f185b8d2b7ade2cbf10600255bb935fb

tdf#144732: treat invalid subtype value as worksheet

It will be available in 7.2.3.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.