Description: Greetings. We write XLSX files using Apache POI in XSSF and SXSSF mode. The XML content of those files is the same, but LibreOffice shows a "Corrupted" message when opening the SXSSF version. It can be repaired though without any damage or losses. Excel and Gnumeric and Google Sheet open those files without any complains. Forcing Zip64Mode.AlwaysWithCompatibility also works around the problem, but increases the file size. Careful: Zip64Mode.Never will freeze LibreCalc in an endless loop forever! Sample java code to reproduce the problem is shown below. Sample XLSX file is here: https://manticore-projects.com/download/manticore_7841765197550883476.xlsx Version: 24.2.6.2 (X86_64) / LibreOffice Community Build ID: 5d815fb18c57fdadb2819d0f77b22a22936c58ed CPU threads: 12; OS: Linux 6.11; UI render: default; VCL: gtk3 Locale: en-US (en_US.UTF-8); UI: en-US 24.2.6-5.1 Calc: threaded Steps to Reproduce: // Sample Java code to produce such files import org.apache.commons.compress.archivers.zip.Zip64Mode; import org.apache.poi.xssf.streaming.SXSSFSheet; import org.apache.poi.xssf.streaming.SXSSFWorkbook; import org.apache.poi.xssf.usermodel.XSSFSheet; import org.apache.poi.xssf.usermodel.XSSFWorkbook; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; public class Main { public static void main(String[] args) { // this will cause corruption SXSSFWorkbook wb = new SXSSFWorkbook(new XSSFWorkbook(), 100, true, true); SXSSFSheet sheet = wb.createSheet("test"); // wb.setZip64Mode(Zip64Mode.AlwaysWithCompatibility); // this will work // XSSFWorkbook wb = new XSSFWorkbook(); // XSSFSheet sheet = wb.createSheet("test"); File outputFile = null; try { outputFile = File.createTempFile("poitest_", ".xlsx"); FileOutputStream fileOutputStream = new FileOutputStream(outputFile); wb.write(fileOutputStream); wb.close(); } catch (IOException e) { throw new RuntimeException(e); } } } Actual Results: File shows Corruption Warning when opening in LibreCalc. Expected Results: File should be opened as valid (since it can be unzipped and also open in Excel, Gnumeric and Google Sheets) Reproducible: Always User Profile Reset: No Additional Info: Pl
Also observed with latest: Version: 24.8.2.1 (X86_64) / LibreOffice Community Build ID: d9536723ce640b70ba821dfcf645a2e946c0bf76 CPU threads: 12; OS: Linux 6.11; UI render: default; VCL: gtk3 Locale: en-US (en_US.UTF-8); UI: en-US 24.8.2-1.1 Calc: threaded
Created attachment 197442 [details] A sample from https://ask.libreoffice.org/t/is-spreadsheet-corrupt-or-not/113519 It may or may not be the same issue; but in that specific case, the error detected after commit efae4fc42d5fe3c0a69757226f38efc10d101194 was: "Zip file has holes! It will leak!" - likely meaning that the file has "unused" areas, which is strange / bad for a ZIP.
Created attachment 197444 [details] The testfile from comment 0 Indeed, it is the same "Zip file has holes" detection from https://opengrok.libreoffice.org/xref/core/package/source/zipapi/ZipFile.cxx?r=279f42fa&mo=61140&fi=1527#1527. IMO, it's a bug in the generator.
Thank you for your feedback. Please what does "bug in the generator" mean in this context? Do you refer to Apache Commons Compress, which is used for Apache POI SXSSF? If so, why the file can be Unzipped without any problems and also be opened in Excel, Gnumeric and Google Sheet? (It also could be opened in LibreOffice before 2nd Quarter 2024 or so).
(In reply to Andreas Reichel from comment #4) > Please what does "bug in the generator" mean in this context? Do you refer > to Apache Commons Compress, which is used for Apache POI SXSSF? Yes. > If so, why the file can be Unzipped without any problems and also be opened > in Excel, Gnumeric and Google Sheet? (It also could be opened in LibreOffice > before 2nd Quarter 2024 or so). And in current LibreOffice, too - but only after it informed you about the problem it found. The commit mentioned in comment 3 was "package: add additional consistency checks for local file header". And it does what it says: it checks the ZIP for additional inconsistencies, among others, for "gaps and overlaps". And the files generated by Apache POI indeed have gaps - some bytes not used in the content. Since they are unused, indeed they don't *prevent* (break) the content from reading; but their existence means that they may contain arbitrary content. And the hardened check in newer LibreOffice warns about it, because it has no idea what the creator had in mind, when creating those suspicious gaps. When you learned it, you are free to try to continue loading it in repair mode. Specifically attachment 197444 [details] (i.e., https://manticore-projects.com/download/manticore_7841765197550883476.xlsx) has the gaps at offsets 017C to 0184 02A2 to 02AA 0379 to 0381 04C6 to 04CE 0588 to 0590 0811 to 0819 0931 to 0939 0A56 to 0A5E 11B2 to 11BA (the values are from the unallocated object in the patch; I didn't check if it uses some additional offset maybe). Anyway, these numbers or slightly different, the program tells you that it found them, and they are suspicious. A good generator should not create such gaps.
Michael, what do you think - maybe it's reasonable to only fail this check after testing if the gaps are all zeroes?
I also investigated the problem, and I am pretty sure there are no loopholes in the mentioned Excel file, but there is an issue that the data descriptor in the ZIP file should look like this: 4.3.9 Data descriptor (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) compressed size 4 bytes uncompressed size 4 bytes However, in the corrupted file, they are actually stored in 8 bytes, as is typical in the ZIP64 format. When I modified this line https://github.com/LibreOffice/core/blob/master/package/source/zipapi/ZipFile.cxx#L1077 to always true, I was able to open the mentioned Excel file in LibreOffice without errors. I also tried to create a standard ZIP64 package using the commons-compress library, but that fails as corrupted in MS Excel :( So It seems that Excel uses a non-standard implementation of ZIP64, according to this article https://rzymek.github.io/post/excel-zip64/
Created attachment 197463 [details] File with holes identified (In reply to Martin Leiblinger from comment #7) > I also investigated the problem, and I am pretty sure there are no loopholes > in the mentioned Excel file The attachment is a copy of your file, where I replaced zero bytes in the hole places with bytes representing ASCII word "hole". They have my "arbitrary byte changes", yet they open fine with Excel, and ZIP tools say they are OK. So yes, your file (and in general, files generated by the library) have holes.
@Mike: thank you very much for your time and excellent analysis and explanation. I am discussing this further with the POI team. So far we have gathered this understanding: 1) Zip64Mode.Always is a customized Zip64 implementation in POI (which likely writes those holes) 2) Any other modes are forwarded to Commons Compress and fail with MS Excel -- so I can't use those. However, they succeed with LibreOffice. I do understand, why you are checking for those "holes" and that it could carry dangerous payload. However, this extra dialog is super annoying -- especially since the UI Dialog does not really provide any useful information. Was there any way to disable those checks (yes, of course I can compile from source myself and disable the switches) at least until POI can be fixed/improved? My argument is the involved complexity as well as the fact that more common Spreadsheet applications read those files without any problems.
(In reply to Andreas Reichel from comment #9) If you use LibreOffice API to load the files (as opposed to simple 'soffice path/to/file' command line), then you can pass RepairPackage flag in the MediaDescriptor [1]. Note, however, that files opened in repair mode (either with the explicitly set flag, or by accepting the repairment prompt during the load), are in fact copies of the original file; and they will ask to choose a name, if saved. If the use of API (and of repaired files) is not satisfactory, you may comment the throw out in your build. The code pointer is in comment 3. [1] https://api.libreoffice.org/docs/idl/ref/servicecom_1_1sun_1_1star_1_1document_1_1MediaDescriptor.html#ab5ae6f2c9a82bcb8f006f4b46fee1691
apparently bug 162944 was also about Apache POI, let's see... so if i look here: 00000160: 0144 312c e343 0e31 7cef e937 504b 0708 .D1,.C.1|..7PK.. 00000170: 912c 28bc 3b01 0000 0000 0000 1d04 0000 .,(.;........... 00000180: 0000 0000 504b 0304 2d00 0800 0800 0000 ....PK..-....... at 016C the data descriptor signature for the zip entry, at 0184 the signature of the local file header of the next zip entry... 0170-0173 CRC 0174-017B 64-bit size, LE: 13b 017C-0183 64-bit size, LE: 41d looks like a Zip64 DD? then we have the local file header: 00000000: 504b 0304 2d00 0800 0800 0000 0000 0000 PK..-........... 00000010: 0000 0000 0000 0000 0000 1300 0000 5b43 ..............[C the extra field length at 001C-001D is 0 - so there is no Zip64 extra field, and no indication that this is a Zip64 entry. so the problem is the same as bug 162944. IMHO if there is no Zip64 entry field, a Zip consumer cannot be required to "guess" how long the data descriptor is; you either provide a Zip64 extra field and then the sizes are 64-bit, or you don't and then the sizes are 32-bit. also, see the Apache commons-compress code i found and pasted in https://bugs.documentfoundation.org/show_bug.cgi?id=162944#c7 ... it's unclear to me how such files can even be created. *** This bug has been marked as a duplicate of bug 162944 ***
ah, now i read https://rzymek.github.io/post/excel-zip64/ - checking for the version number 45 in the local file header might work to turn on 64-bit mode, that's a bit of a non-obvious solution... but apparently Excel declared it to be the industry standard, so what can you do... i'll experiment and reopen bug 162944 and keep this as duplicate... oh, another thing: > Careful: Zip64Mode.Never will freeze LibreCalc in an endless loop forever! sounds concerning, could you please attach a reproducer?