Bug 163384 - CALC shows a POI SXSSF file as corrupted unless Zip64Mode.AlwaysWithCompatibility is used.
Summary: CALC shows a POI SXSSF file as corrupted unless Zip64Mode.AlwaysWithCompatibi...
Status: RESOLVED DUPLICATE of bug 162944
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
24.8.2.1 release
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Not Assigned
URL: https://lists.apache.org/thread/fr1dd...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-10-11 00:30 UTC by Andreas Reichel
Modified: 2024-11-07 12:23 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
A sample from https://ask.libreoffice.org/t/is-spreadsheet-corrupt-or-not/113519 (3.58 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2024-11-06 14:28 UTC, Mike Kaganski
Details
The testfile from comment 0 (5.01 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2024-11-06 14:33 UTC, Mike Kaganski
Details
File with holes identified (5.01 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2024-11-06 22:29 UTC, Mike Kaganski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Reichel 2024-10-11 00:30:48 UTC
Description:
Greetings.

We write XLSX files using Apache POI in XSSF and SXSSF mode. The XML content of those files is the same, but LibreOffice shows a "Corrupted" message when opening the SXSSF version. It can be repaired though without any damage or losses.

Excel and Gnumeric and Google Sheet open those files without any complains.

Forcing Zip64Mode.AlwaysWithCompatibility also works around the problem, but increases the file size.
Careful: Zip64Mode.Never will freeze LibreCalc in an endless loop forever!

Sample java code to reproduce the problem is shown below.
Sample XLSX file is here: https://manticore-projects.com/download/manticore_7841765197550883476.xlsx

Version: 24.2.6.2 (X86_64) / LibreOffice Community
Build ID: 5d815fb18c57fdadb2819d0f77b22a22936c58ed
CPU threads: 12; OS: Linux 6.11; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
24.2.6-5.1
Calc: threaded




Steps to Reproduce:
// Sample Java code to produce such files
import org.apache.commons.compress.archivers.zip.Zip64Mode;
import org.apache.poi.xssf.streaming.SXSSFSheet;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

public class Main {
    public static void main(String[] args) {
        // this will cause corruption
        SXSSFWorkbook wb  = new SXSSFWorkbook(new XSSFWorkbook(), 100, true, true);
        SXSSFSheet sheet = wb.createSheet("test");
        // wb.setZip64Mode(Zip64Mode.AlwaysWithCompatibility);

        // this will work
        // XSSFWorkbook wb  = new XSSFWorkbook();
        // XSSFSheet sheet = wb.createSheet("test");

        File outputFile = null;
        try {
            outputFile = File.createTempFile("poitest_", ".xlsx");
            FileOutputStream fileOutputStream = new FileOutputStream(outputFile);
            wb.write(fileOutputStream);
            wb.close();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
}

Actual Results:
File shows Corruption Warning when opening in LibreCalc.

Expected Results:
File should be opened as valid (since it can be unzipped and also open in Excel, Gnumeric and Google Sheets)


Reproducible: Always


User Profile Reset: No

Additional Info:
Pl
Comment 1 Andreas Reichel 2024-10-11 00:31:42 UTC
Also observed with latest:

Version: 24.8.2.1 (X86_64) / LibreOffice Community
Build ID: d9536723ce640b70ba821dfcf645a2e946c0bf76
CPU threads: 12; OS: Linux 6.11; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
24.8.2-1.1
Calc: threaded
Comment 2 Mike Kaganski 2024-11-06 14:28:34 UTC
Created attachment 197442 [details]
A sample from https://ask.libreoffice.org/t/is-spreadsheet-corrupt-or-not/113519

It may or may not be the same issue; but in that specific case, the error detected after commit efae4fc42d5fe3c0a69757226f38efc10d101194 was: "Zip file has holes! It will leak!" - likely meaning that the file has "unused" areas, which is strange / bad for a ZIP.
Comment 3 Mike Kaganski 2024-11-06 14:33:04 UTC
Created attachment 197444 [details]
The testfile from comment 0

Indeed, it is the same "Zip file has holes" detection from https://opengrok.libreoffice.org/xref/core/package/source/zipapi/ZipFile.cxx?r=279f42fa&mo=61140&fi=1527#1527.

IMO, it's a bug in the generator.
Comment 4 Andreas Reichel 2024-11-06 14:59:17 UTC
Thank you for your feedback.
Please what does "bug in the generator" mean in this context? Do you refer to Apache Commons Compress, which is used for Apache POI SXSSF?

If so, why the file can be Unzipped without any problems and also be opened in Excel, Gnumeric and Google Sheet? (It also could be opened in LibreOffice before 2nd Quarter 2024 or so).
Comment 5 Mike Kaganski 2024-11-06 15:48:23 UTC
(In reply to Andreas Reichel from comment #4)
> Please what does "bug in the generator" mean in this context? Do you refer
> to Apache Commons Compress, which is used for Apache POI SXSSF?

Yes.

> If so, why the file can be Unzipped without any problems and also be opened
> in Excel, Gnumeric and Google Sheet? (It also could be opened in LibreOffice
> before 2nd Quarter 2024 or so).

And in current LibreOffice, too - but only after it informed you about the problem it found.
The commit mentioned in comment 3 was "package: add additional consistency checks for local file header". And it does what it says: it checks the ZIP for additional inconsistencies, among others, for "gaps and overlaps".

And the files generated by Apache POI indeed have gaps - some bytes not used in the content. Since they are unused, indeed they don't *prevent* (break) the content from reading; but their existence means that they may contain arbitrary content. And the hardened check in newer LibreOffice warns about it, because it has no idea what the creator had in mind, when creating those suspicious gaps. When you learned it, you are free to try to continue loading it in repair mode.

Specifically attachment 197444 [details] (i.e., https://manticore-projects.com/download/manticore_7841765197550883476.xlsx) has the gaps at offsets

  017C to 0184
  02A2 to 02AA
  0379 to 0381
  04C6 to 04CE
  0588 to 0590
  0811 to 0819
  0931 to 0939
  0A56 to 0A5E
  11B2 to 11BA

(the values are from the unallocated object in the patch; I didn't check if it uses some additional offset maybe). Anyway, these numbers or slightly different, the program tells you that it found them, and they are suspicious. A good generator should not create such gaps.
Comment 6 Mike Kaganski 2024-11-06 16:03:24 UTC
Michael, what do you think - maybe it's reasonable to only fail this check after testing if the gaps are all zeroes?
Comment 7 Martin Leiblinger 2024-11-06 21:35:09 UTC
I also investigated the problem, and I am pretty sure there are no loopholes in the mentioned Excel file, but there is an issue that the data descriptor in the ZIP file should look like this:

4.3.9  Data descriptor (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT)
compressed size                 4 bytes
uncompressed size               4 bytes

However, in the corrupted file, they are actually stored in 8 bytes, as is typical in the ZIP64 format. 


When I modified this line https://github.com/LibreOffice/core/blob/master/package/source/zipapi/ZipFile.cxx#L1077 to always true, I was able to open the mentioned Excel file in LibreOffice without errors.

I also tried to create a standard ZIP64 package using the commons-compress library, but that fails as corrupted in MS Excel :(

So It seems that Excel uses a non-standard implementation of ZIP64, according to this article https://rzymek.github.io/post/excel-zip64/
Comment 8 Mike Kaganski 2024-11-06 22:29:39 UTC
Created attachment 197463 [details]
File with holes identified

(In reply to Martin Leiblinger from comment #7)
> I also investigated the problem, and I am pretty sure there are no loopholes
> in the mentioned Excel file

The attachment is a copy of your file, where I replaced zero bytes in the hole places with bytes representing ASCII word "hole". They have my "arbitrary byte changes", yet they open fine with Excel, and ZIP tools say they are OK. So yes, your file (and in general, files generated by the library) have holes.
Comment 9 Andreas Reichel 2024-11-06 23:21:21 UTC
@Mike: thank you very much for your time and excellent analysis and explanation. I am discussing this further with the POI team. So far we have gathered this understanding:

1) Zip64Mode.Always is a customized Zip64 implementation in POI (which likely writes those holes)

2) Any other modes are forwarded to Commons Compress and fail with MS Excel -- so I can't use those. However, they succeed with LibreOffice.


I do understand, why you are checking for those "holes" and that it could carry dangerous payload. However, this extra dialog is super annoying -- especially since the UI Dialog does not really provide any useful information.

Was there any way to disable those checks (yes, of course I can compile from source myself and disable the switches) at least until POI can be fixed/improved? My argument is the involved complexity as well as the fact that more common Spreadsheet applications read those files without any problems.
Comment 10 Mike Kaganski 2024-11-07 07:12:19 UTC
(In reply to Andreas Reichel from comment #9)

If you use LibreOffice API to load the files (as opposed to simple 'soffice path/to/file' command line), then you can pass RepairPackage flag in the MediaDescriptor [1]. Note, however, that files opened in repair mode (either with the explicitly set flag, or by accepting the repairment prompt during the load), are in fact copies of the original file; and they will ask to choose a name, if saved.

If the use of API (and of repaired files) is not satisfactory, you may comment the throw out in your build. The code pointer is in comment 3.

[1] https://api.libreoffice.org/docs/idl/ref/servicecom_1_1sun_1_1star_1_1document_1_1MediaDescriptor.html#ab5ae6f2c9a82bcb8f006f4b46fee1691
Comment 11 Michael Stahl (allotropia) 2024-11-07 12:04:50 UTC
apparently bug 162944 was also about Apache POI, let's see...

so if i look here:

00000160: 0144 312c e343 0e31 7cef e937 504b 0708  .D1,.C.1|..7PK..
00000170: 912c 28bc 3b01 0000 0000 0000 1d04 0000  .,(.;...........
00000180: 0000 0000 504b 0304 2d00 0800 0800 0000  ....PK..-.......

at 016C the data descriptor signature for the zip entry, at 0184 the signature of the local file header of the next zip entry...

0170-0173 CRC
0174-017B 64-bit size, LE: 13b
017C-0183 64-bit size, LE: 41d

looks like a Zip64 DD?

then we have the local file header:

00000000: 504b 0304 2d00 0800 0800 0000 0000 0000  PK..-...........
00000010: 0000 0000 0000 0000 0000 1300 0000 5b43  ..............[C

the extra field length at 001C-001D is 0 - so there is no Zip64 extra field, and no indication that this is a Zip64 entry.

so the problem is the same as bug 162944.

IMHO if there is no Zip64 entry field, a Zip consumer cannot be required to "guess" how long the data descriptor is; you either provide a Zip64 extra field and then the sizes are 64-bit, or you don't and then the sizes are 32-bit.

also, see the Apache commons-compress code i found and pasted in https://bugs.documentfoundation.org/show_bug.cgi?id=162944#c7 ... it's unclear to me how such files can even be created.

*** This bug has been marked as a duplicate of bug 162944 ***
Comment 12 Michael Stahl (allotropia) 2024-11-07 12:23:16 UTC
ah, now i read https://rzymek.github.io/post/excel-zip64/ - checking for the version number 45 in the local file header might work to turn on 64-bit mode, that's a bit of a non-obvious solution... but apparently Excel declared it to be the industry standard, so what can you do... i'll experiment and reopen bug 162944 and keep this as duplicate...

oh, another thing:

> Careful: Zip64Mode.Never will freeze LibreCalc in an endless loop forever!

sounds concerning, could you please attach a reproducer?