Bug 162944 - opening Zip64 files produced by Apache POI is indicated as corrupted
Summary: opening Zip64 files produced by Apache POI is indicated as corrupted
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
24.8.1.2 release
Hardware: All Linux (All)
: medium normal
Assignee: Michael Stahl (allotropia)
URL:
Whiteboard: target:25.2.0
Keywords: bibisected, bisected, regression
: 163384 (view as bug list)
Depends on:
Blocks: XLSX
  Show dependency treegraph
 
Reported: 2024-09-13 08:41 UTC by saveurlinux
Modified: 2024-11-08 18:34 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
Exemple file (18.61 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2024-09-13 08:41 UTC, saveurlinux
Details
screenshot (27.96 KB, image/png)
2024-09-13 08:42 UTC, saveurlinux
Details

Note You need to log in before you can comment on or make changes to this bug.
Description saveurlinux 2024-09-13 08:41:39 UTC
Created attachment 196421 [details]
Exemple file

opening certain files with the extension ‘xlsx’ is indicated as corrupted

Name        : libreoffice-calc
Epoch       : 1
Version     : 24.8.0.3
Release     : 1bdk_mga9
Architecture: x86_64
Install Date: ven. 23 août 2024 12:20:31
Group       : Office/Spreadsheet
Size        : 40604549
License     : MPL-2.0 and Apache-2.0 and LGPL-3.0-only and LGPL-3.0-or-later and CC0-1.0 and BSD-3-Clause and (LGPL-2.1-only or SISSL) and (MPL-2.0 or LGPL-3.0-or-later) and (MPL-2.0 or LGPL-2.1-or-later) and (MPL-1.1 or GPL-2.0-only or LGPL-2.1-only)
Signature   : DSA/SHA1, ven. 23 août 2024 03:53:19, Key ID d1e9294d2d9835d8
Source RPM  : libreoffice-24.8.0.3-1bdk_mga9.src.rpm
Build Date  : ven. 23 août 2024 02:43:17
Build Host  : GamerRyzen7
Packager    : katnatek
Vendor      : BDK-packagers
URL         : https://www.libreoffice.org/
Summary     : LibreOffice Spreadsheet Application
Description :
The LibreOffice Spreadsheet application.
Comment 1 saveurlinux 2024-09-13 08:42:05 UTC
Created attachment 196422 [details]
screenshot
Comment 2 saveurlinux 2024-09-13 08:46:19 UTC
The same file does not appear corrupted on version 7.6.7.2
Comment 3 Xisco Faulí 2024-09-13 09:20:10 UTC
Regression introduced by:

commit efae4fc42d5fe3c0a69757226f38efc10d101194	[log]
author	Michael Stahl <michael.stahl@allotropia.de>	Tue Jul 16 12:12:09 2024 +0200
committer	Michael Stahl <michael.stahl@allotropia.de>	Tue Jul 16 15:57:43 2024 +0200
tree 5e7fe7051a76f04b1b8b2ab9c46c271e3f8ff666
parent 2f81046033bb4082f888edfa94685d2dcc2689aa [diff]

package: add additional consistency checks for local file header

Bisected with: bibisect-linux64-25.2
Comment 4 Xisco Faulí 2024-09-13 09:20:32 UTC
I tried to open the document with Excel 2016 and it opens it without any complain
Comment 5 saveurlinux 2024-09-13 09:21:48 UTC
(In reply to Xisco Faulí from comment #4)
> I tried to open the document with Excel 2016 and it opens it without any
> complain

Yes, the problem occurs only with libre office
Comment 6 Michael Stahl (allotropia) 2024-09-16 19:23:37 UTC
hmm ... apparently this was produced by "Apache POI"?

the problem is we detect an 8 byte gap following the data descriptor of every zip entry...

it looks like the data descriptor uses 64-bit sizes, but there is no Zip64 extra field on the local header, the extension length is 0...

there does not appear to be a Zip64 extra field anywhere in the file, nor is there a Zip64 end of central directory record ... how is one supposed to know these sizes are 64-bit?
Comment 7 Michael Stahl (allotropia) 2024-09-17 11:07:34 UTC
the file does look invalid to me, 64-bit data descriptor but no zip64 extra field:

      4.3.9.2 When compressing files, compressed and uncompressed sizes 
      SHOULD be stored in ZIP64 format (as 8 byte values) when a 
      file's size exceeds 0xFFFFFFFF.   However ZIP64 format MAY be 
      used regardless of the size of a file.  When extracting, if 
      the zip64 extended information extra field is present for 
      the file the compressed and uncompressed sizes will be 8
      byte values.  

and in any case, the file is opened by LO in "Repair" mode, so i think that's good enough, resolving NOTABUG for now.

(the Repair mode appears to "guess" if it's zip64 based on a following signature)

POI would be using Apache Commons-Compress; the code to write the data descriptor is
in https://github.com/apache/commons-compress/blob/master/src/main/java/org/apache/commons/compress/archivers/zip/ZipArchiveOutputStream.java

    protected void writeDataDescriptor(final ZipArchiveEntry ze) throws IOException {
        if (!usesDataDescriptor(ze.getMethod(), false)) {
            return;
        }
        writeCounted(DD_SIG);
        writeCounted(ZipLong.getBytes(ze.getCrc()));
        if (!hasZip64Extra(ze)) {
            writeCounted(ZipLong.getBytes(ze.getCompressedSize()));
            writeCounted(ZipLong.getBytes(ze.getSize()));
        } else {
            writeCounted(ZipEightByteInteger.getBytes(ze.getCompressedSize()));
            writeCounted(ZipEightByteInteger.getBytes(ze.getSize()));
        }
    }

contains the obvious check that there is a Zip64 extra field - which the attached file doesn't have.

this has been substantially changed since 2011 when Zip64 support was introduced:

https://issues.apache.org/jira/browse/COMPRESS-150

really not clear how this file was produced...
Comment 8 saveurlinux 2024-09-23 11:26:53 UTC
Thanks for the detail.
Actually, this file is generated by Apache POI
I solve this issue using
org.apache.poi.xssf.streaming.SXSSFWorkbook#setZip64Mode 
and setting it to Zip64Mode.Never, to force not compress.
Now LO do not complain any more.
Comment 9 Michael Stahl (allotropia) 2024-11-07 12:04:50 UTC
*** Bug 163384 has been marked as a duplicate of this bug. ***
Comment 10 Andreas Reichel 2024-11-07 12:24:37 UTC
I believe I can explain what exactly happens here:

1) Commons Compress Zip64 files are correct but can not be read by Excel 

2) As a work around, fir SXSSF a customized Zip64 compressor was adopted, which produces readable files, but with those holes

Everything was very fine until this additional check was introduced.

So the way forward is likely adopting the Common's Compress `writeDataDescriptor()` method. I will try to work on it over the weekend.

Thank a lot for explanation and analysis.
The biggest challenge here was to understand at first what causes the problem and which part of the software was to blame. You helped me a lot with establishing this understanding.
Comment 11 Michael Stahl (allotropia) 2024-11-07 13:07:14 UTC
reopening based on new info in duplicate bug - it may be possible to use the "version needed to extract" in local file header to distinguish Zip64.
Comment 12 Commit Notification 2024-11-07 14:48:16 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/0f39e6fbb48dae29778c305ddd576d698a8251ad

tdf#162944 package: try to detect Zip64 via version

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 13 Michael Stahl (allotropia) 2024-11-07 15:53:38 UTC
the CI passed with this change, let's hope this is fixed...
Comment 14 Andreas Reichel 2024-11-08 00:51:26 UTC
Thank you so much!

I have built from source and tested with POI's SXSSF files and it works now (again).

Recommendation: can you add the provided example to your test suite in order to avoid such regressions in the future? In my opinion, POI plays a large role on server generated XLS/XLSX files and so deserves to be part of the tests.

Thank you again, a lot and cheers!


Version: 25.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 42533c94ec1a52c49b2587e53ab55e67fc4a449a
CPU threads: 12; OS: Linux 6.11; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded
Comment 15 Commit Notification 2024-11-08 18:34:45 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/d79790da8a4de4758f46ae4a8573382c681af974

tdf#162944 package: add test file

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.