Bug 96401 - FILEOPEN: DOCX - Specific file reported as corrupted (openable in MSO but not in other programs because of unzip error, backslash "\" as filename separator)
Summary: FILEOPEN: DOCX - Specific file reported as corrupted (openable in MSO but not...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: lowest normal
Assignee: Mike Kaganski
URL:
Whiteboard: interoperability target:24.2.0
Keywords: filter:docx
: 97379 (view as bug list)
Depends on:
Blocks: DOCX-Opening
  Show dependency treegraph
 
Reported: 2015-12-11 07:51 UTC by petur
Modified: 2023-12-14 10:50 UTC (History)
10 users (show)

See Also:
Crash report or crash signature:


Attachments
This file is reported as corrupted (9.32 KB, application/vnd.ms-word.document.12)
2015-12-11 07:51 UTC, petur
Details

Note You need to log in before you can comment on or make changes to this bug.
Description petur 2015-12-11 07:51:35 UTC
Created attachment 121217 [details]
This file is reported as corrupted

From time to time I get a .docx file that LibreOffice reports as corrupted and refuses to open, the file opens fine in MS Office.

I thought I'd take the time to submit one for debugging.

It is only half a page of text, A quick look at the inside showed no problems to me (can unzip it and open each entry inside with a text editor)
Comment 1 raal 2015-12-11 17:38:46 UTC
I can confirm with Version: 5.2.0.0.alpha0+
Build ID: de9d0e797903e7ecc19be2b05c7e89d5936ae02d
Threads 4; Ver: Linux 4.2; Render: default; 
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2015-12-03_04:13:00
and 4.4.0.0.alpha2+

from command line:
:1: parser error : Document is empty
PK
^


I can open file with word 2010
Comment 2 Oliver Specht (CIB) 2015-12-16 15:00:05 UTC
The filter detection in oox/source/core/filterdetect.cxx tries to parse the stream in "_rels/.rels" but cannot open it ("_rels" )

aParser.parseStream( aZipStorage, "_rels/.rels" );

Unzipping + rezipping the docx fixes the problem.
Comment 3 Cor Nouws 2016-09-06 09:59:57 UTC Comment hidden (obsolete)
Comment 4 Telesto 2016-12-07 19:48:02 UTC
Confirming with:
Version: 5.4.0.0.alpha0+
Build ID: a9f56091b6422ec8c42f09b8472200ae4ab12548
CPU Threads: 4; OS Version: Windows 6.19; UI Render: default; 
TinderBox: Win-x86@42, Branch:master, Time: 2016-12-05_23:12:26
Locale: nl-NL (nl_NL); Calc: CL
Comment 5 Timur 2017-07-18 17:29:51 UTC
This DOCX is not correct and is corrupted. Not only LO but some other programs refuse to open it, complaining on unzip error, as Oliver noted. Even "Open XML SDK 2.5 Productivity Tool" (http://www.microsoft.com/en-us/download/details.aspx?id=30425). 
Looks like it's 2007. Some MSO bug? Saved in MSO again, opens fine. 
Is bug valid? Maybe, if MSO has workaround, LO might also have it. But this bug was confirmed too soon, without a decision whether it's worth fixing.
Comment 6 petur 2017-07-18 18:00:08 UTC Comment hidden (no-value)
Comment 7 Mike Kaganski 2017-07-18 18:43:53 UTC
Most probably, the problem is related to the version of package (ZIP) - APPNOTE-2.0 - is wrong as per ECMA-376 and ISO/IEC 29500, which mandate that OOXML package as per PKWARE Inc. Zip APPNOTE Version 6.2.0. Why is that so, i.e. was there some repacking happening on the route from source to destination, or if generating software (that is claimed to be MS Word in app.xml content) makes that under some circumstances, is unclear.

I suppose that LO *could* allow such packages. But please note that trying to mimic any non-standard behavior of MS Word (and be bug-to-bug compatible with it) is generally not in LO goals list.
Comment 8 petur 2017-07-18 19:02:05 UTC Comment hidden (off-topic)
Comment 9 Mike Kaganski 2017-07-18 19:07:14 UTC Comment hidden (off-topic)
Comment 10 QA Administrators 2018-07-19 02:41:32 UTC Comment hidden (noise, obsolete)
Comment 11 Timur 2018-07-19 08:32:45 UTC
Repro 6.2+. LO asks if it should repair the file, but fails.
Comment 12 Julien Nabet 2020-01-01 15:18:09 UTC Comment hidden (obsolete)
Comment 13 petur 2020-01-01 16:08:27 UTC
OP here... since less and less people use the MSO version that had this specific quirk, be my guest and close this. 4 years waiting has been enough anyway.

And for the last time, it is not corrupt, you can unzip the file perfectly. It merely doesn't follow the standard.

I am removing myself of this thread and community
Comment 14 Julien Nabet 2020-01-01 16:17:42 UTC Comment hidden (obsolete)
Comment 15 md-work 2020-02-10 17:06:55 UTC
I have an identical bug for xlsx files. I didn't create those files, but they seem to be by this software:
Bio-Rad CFX Maestro 1.1 Version 4.1.2433.1219

It looks like this is problem in the zip implementation which was used to create those OfficeOpenXml files. The delimiter for folders is being stored as backslash \ instead of a slash /. And although a slash seems to the default folder delimiter for zip files, Microsoft products open those zip files flawlessly.

The Windows-Explorer (Windows 7) is able to extract those files. And the Microsoft OneDrive online-office can also open them.
https://onedrive.live.com

You can actually take a OfficeOpenXML file created by LibreOffice, extract it on Linux and convert it into such a messed up file. Just rename and convert all slashes (directories) to backslashes.
mv _rels/.rels _rels\\.rels
Repeat for all files, delete the empty folders, repack the zip, rename to docx/xlsx and upload to OneDrive.

In the end, I think this shouldn't be hard to fix. Especially because there shouldn't be a legal case for "real" backslashes inside filenames, inside OfficeOpenXML files.
So LibreOffice can simply interpret all backslashes inside filenames as slashes.

Note: I also opened a ticket for 7-Zip, to see what those zip experts say.
https://sourceforge.net/p/p7zip/bugs/227/
Comment 16 Julien Nabet 2020-02-28 09:58:01 UTC
About "\", I had proposed a patch here: https://bugs.documentfoundation.org/show_bug.cgi?id=97379#c9
I just wonder if we should be strict when writing but also when reading zips or should we be strict only when reading zips.
Also, perhaps the other apps should just read the standard and follow it.
Comment 17 Kevin Suo 2021-11-04 08:58:43 UTC
1. The file uses backslashes as file name separator:

$ zipinfo /home/suokunlong/下载/tmp/failing_doc.docx 
Archive:  /home/suokunlong/下载/tmp/failing_doc.docx
Zip file size: 9547 bytes, number of entries: 13
-rw----     2.0 fat     1576 b- defN 80-Jan-01 00:00 [Content_Types].xml
-rw----     2.0 fat      685 b- defN 15-Dec-08 10:52 docProps\app.xml
-rw----     2.0 fat      619 b- defN 15-Dec-08 10:52 docProps\core.xml
-rw----     2.0 fat     4188 b- defN 15-Dec-08 10:52 word\document.xml
-rw----     2.0 fat      971 b- defN 15-Dec-08 10:52 word\endnotes.xml
-rw----     2.0 fat     1595 b- defN 80-Jan-01 00:00 word\fontTable.xml
-rw----     2.0 fat      977 b- defN 15-Dec-08 10:52 word\footnotes.xml
-rw----     2.0 fat     2440 b- defN 80-Jan-01 00:00 word\settings.xml
-rw----     2.0 fat    16648 b- defN 80-Jan-01 00:00 word\styles.xml
-rw----     2.0 fat      260 b- defN 80-Jan-01 00:00 word\webSettings.xml
-rw----     2.0 fat     6999 b- defN 80-Jan-01 00:00 word\theme\theme1.xml
-rw----     2.0 fat     1081 b- defN 80-Jan-01 00:00 word\_rels\document.xml.rels
-rw----     2.0 fat      590 b- defN 80-Jan-01 00:00 _rels\.rels
13 files, 38629 bytes uncompressed, 8069 bytes compressed:  79.1%

2. Backslash is not allowed by PK ZIP Specs:
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

   4.4.17 file name: (Variable)

       4.4.17.1 The name of the file, with optional relative path.
       The path stored MUST NOT contain a drive or
       device letter, or a leading slash.  All slashes
       MUST be forward slashes '/' as opposed to
       backwards slashes '\' for compatibility with Amiga
       and UNIX file systems etc.  If input came from standard
       input, there is no file name field.

3. Actually a lot of third-party software still uses backslashes. See e.g. bug 76115 (which has a duplicate bug 131575).

This bug is for docx, bug 76115 is for xlsx. But I think they use the same package/source/zippackage code. For bug triaging purpose, should this be marked as a duplicate of bug 76115?
Comment 18 Kevin Suo 2021-11-04 09:06:00 UTC
*** Bug 97379 has been marked as a duplicate of this bug. ***
Comment 19 Kevin Suo 2021-11-04 09:11:26 UTC
As explained in https://bugs.documentfoundation.org/show_bug.cgi?id=97379#c9
the code pointer would be in function OStorageHelper::IsValidZipEntryFileName()
in comphelper/source/misc/storagehelper.cxx:536
Comment 20 QA Administrators 2023-11-05 03:13:26 UTC Comment hidden (noise)
Comment 21 petur 2023-11-05 09:54:37 UTC
As per request, just verified that the issue is still present, though I have lately not received any such problematic documents do there is a slight chance that Microsoft has changed something on their end.

However, the file I originally included with this report still fails to open

Tested on a fully updated Debian Sid with:

Version: 7.5.8.2 (X86_64) / LibreOffice Community
Build ID: 50(Build:2)
CPU threads: 8; OS: Linux 6.5; UI render: default; VCL: gtk3
Locale: nl-BE (en_GB.UTF-8); UI: en-GB
Debian package version: 4:7.5.8-1
Calc: threaded
Comment 22 Mike Kaganski 2023-11-25 14:32:04 UTC
Since commit fa66eeb587f11bea88ab5950ffd94aee221d6b31, there is a "recovery mode" in ZIP package, triggered by "RepairPackage" media descriptor property [1].

Since commit 426a2f22678f89706b4db474243ab27b4a4d6c06 (for #i104759#), this mode also handles the backslashes in packages (it is done explicitly to handle this problem).

The missing bit is to make sure that, when such a situation is detected during the load, and a warning is shown to the user asking to try to *repair*, we don't switch to the recovery mode.

[1] https://api.libreoffice.org/docs/idl/ref/servicecom_1_1sun_1_1star_1_1document_1_1MediaDescriptor.html#ab5ae6f2c9a82bcb8f006f4b46fee1691
Comment 23 Mike Kaganski 2023-11-25 14:41:48 UTC
(In reply to Mike Kaganski from comment #22)
> The missing bit is to make sure that, when such a situation is detected
> during the load, and a warning is shown to the user asking to try to
> *repair*, we don't switch to the recovery mode.

Hmm. filter/source/storagefilterdetect/filterdetect.cxx has the code to do exactly that [1]; and so, it is the wrong implementation (or a breakage) of what was implemented in 426a2f22678f89706b4db474243ab27b4a4d6c06.

[1] https://opengrok.libreoffice.org/xref/core/filter/source/storagefilterdetect/filterdetect.cxx?r=b1560344#121
Comment 24 Mike Kaganski 2023-11-26 06:40:15 UTC
So: you are able to open the file, *if* you open it using an *explicitly selected* DOCX filter in the Open dialog.

Why it fails when opened normally:
1. It uses a normal auto-detect procedure.
2. In it, it iterates all filters, asking each to try to detect the file.
3. In the list, DOCX filters happen to come prior to ODF ones ...
4. But DOCX filters, when encountering the ZIP error, fail silently
5. While ODF ones, when see the same ZIP error, produce the warning, and then proceed with the procedure from comment 23 - which is described by "We don't do any type detection on broken packages (f.e. because it might be impossible),
so for repairing we'll use the requested type, which was detected by the flat detection" comment there.

Which makes LibreOffice use an ODF filter unconditionally on this file, which finally expectedly fails elsewhere.

The problem is: we need to handle the ZIP error early, and introduce the repair mode early, still keeping the autodetection with it. Because it won't help to allow DOCX filters do the same as ODF, which would then disallow broken ODF opening - DOCX would intercept them then, and the problem would be reversed.
Comment 25 Commit Notification 2023-11-26 20:04:58 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/86c682273d907c77404637c89e584047de1c1099

tdf#96401: allow to detect a broken ZIP package

It will be available in 24.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 26 Commit Notification 2023-11-27 10:48:08 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/657f98d9272dd97e4f4c6e03cce4a0fa9f526819

Related: tdf#96401 Set PROP_ASTEMPLATE for broken ZIP package

It will be available in 24.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 27 Commit Notification 2023-12-01 06:24:10 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/93357349ff1998b41ea1ebedf09dc1cc5da316f7

Related: tdf#96401 Check ZIP magic number, to avoid false detections

It will be available in 24.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 28 petur 2023-12-14 10:50:59 UTC
Thanks Mike, I just tried 24.2.0 (dev) and it now complains about the file being corrupted, and after choosing to fix the file opens correctly.

Thanks for taking care!!