Bug 96401 - FILEOPEN: DOCX - Specific file reported as corrupted (openable in MSO but not in other programs because of unzip error)
Summary: FILEOPEN: DOCX - Specific file reported as corrupted (openable in MSO but not...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: lowest normal
Assignee: Not Assigned
URL:
Whiteboard: interoperability
Keywords: filter:docx
Depends on:
Blocks: DOCX-Opening
  Show dependency treegraph
 
Reported: 2015-12-11 07:51 UTC by petur
Modified: 2020-02-28 09:58 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
This file is reported as corrupted (9.32 KB, application/vnd.ms-word.document.12)
2015-12-11 07:51 UTC, petur
Details

Note You need to log in before you can comment on or make changes to this bug.
Description petur 2015-12-11 07:51:35 UTC
Created attachment 121217 [details]
This file is reported as corrupted

From time to time I get a .docx file that LibreOffice reports as corrupted and refuses to open, the file opens fine in MS Office.

I thought I'd take the time to submit one for debugging.

It is only half a page of text, A quick look at the inside showed no problems to me (can unzip it and open each entry inside with a text editor)
Comment 1 raal 2015-12-11 17:38:46 UTC
I can confirm with Version: 5.2.0.0.alpha0+
Build ID: de9d0e797903e7ecc19be2b05c7e89d5936ae02d
Threads 4; Ver: Linux 4.2; Render: default; 
TinderBox: Linux-rpm_deb-x86_64@70-TDF, Branch:master, Time: 2015-12-03_04:13:00
and 4.4.0.0.alpha2+

from command line:
:1: parser error : Document is empty
PK
^


I can open file with word 2010
Comment 2 Oliver Specht (CIB) 2015-12-16 15:00:05 UTC
The filter detection in oox/source/core/filterdetect.cxx tries to parse the stream in "_rels/.rels" but cannot open it ("_rels" )

aParser.parseStream( aZipStorage, "_rels/.rels" );

Unzipping + rezipping the docx fixes the problem.
Comment 3 Cor Nouws 2016-09-06 09:59:57 UTC Comment hidden (obsolete)
Comment 4 Telesto 2016-12-07 19:48:02 UTC
Confirming with:
Version: 5.4.0.0.alpha0+
Build ID: a9f56091b6422ec8c42f09b8472200ae4ab12548
CPU Threads: 4; OS Version: Windows 6.19; UI Render: default; 
TinderBox: Win-x86@42, Branch:master, Time: 2016-12-05_23:12:26
Locale: nl-NL (nl_NL); Calc: CL
Comment 5 Timur 2017-07-18 17:29:51 UTC
This DOCX is not correct and is corrupted. Not only LO but some other programs refuse to open it, complaining on unzip error, as Oliver noted. Even "Open XML SDK 2.5 Productivity Tool" (http://www.microsoft.com/en-us/download/details.aspx?id=30425). 
Looks like it's 2007. Some MSO bug? Saved in MSO again, opens fine. 
Is bug valid? Maybe, if MSO has workaround, LO might also have it. But this bug was confirmed too soon, without a decision whether it's worth fixing.
Comment 6 petur 2017-07-18 18:00:08 UTC Comment hidden (no-value)
Comment 7 Mike Kaganski 2017-07-18 18:43:53 UTC
Most probably, the problem is related to the version of package (ZIP) - APPNOTE-2.0 - is wrong as per ECMA-376 and ISO/IEC 29500, which mandate that OOXML package as per PKWARE Inc. Zip APPNOTE Version 6.2.0. Why is that so, i.e. was there some repacking happening on the route from source to destination, or if generating software (that is claimed to be MS Word in app.xml content) makes that under some circumstances, is unclear.

I suppose that LO *could* allow such packages. But please note that trying to mimic any non-standard behavior of MS Word (and be bug-to-bug compatible with it) is generally not in LO goals list.
Comment 8 petur 2017-07-18 19:02:05 UTC Comment hidden (off-topic)
Comment 9 Mike Kaganski 2017-07-18 19:07:14 UTC Comment hidden (off-topic)
Comment 10 QA Administrators 2018-07-19 02:41:32 UTC Comment hidden (obsolete)
Comment 11 Timur 2018-07-19 08:32:45 UTC
Repro 6.2+. LO asks if it should repair the file, but fails.
Comment 12 Julien Nabet 2020-01-01 15:18:09 UTC Comment hidden (obsolete)
Comment 13 petur 2020-01-01 16:08:27 UTC
OP here... since less and less people use the MSO version that had this specific quirk, be my guest and close this. 4 years waiting has been enough anyway.

And for the last time, it is not corrupt, you can unzip the file perfectly. It merely doesn't follow the standard.

I am removing myself of this thread and community
Comment 14 Julien Nabet 2020-01-01 16:17:42 UTC Comment hidden (obsolete)
Comment 15 md-work 2020-02-10 17:06:55 UTC
I have an identical bug for xlsx files. I didn't create those files, but they seem to be by this software:
Bio-Rad CFX Maestro 1.1 Version 4.1.2433.1219

It looks like this is problem in the zip implementation which was used to create those OfficeOpenXml files. The delimiter for folders is being stored as backslash \ instead of a slash /. And although a slash seems to the default folder delimiter for zip files, Microsoft products open those zip files flawlessly.

The Windows-Explorer (Windows 7) is able to extract those files. And the Microsoft OneDrive online-office can also open them.
https://onedrive.live.com

You can actually take a OfficeOpenXML file created by LibreOffice, extract it on Linux and convert it into such a messed up file. Just rename and convert all slashes (directories) to backslashes.
mv _rels/.rels _rels\\.rels
Repeat for all files, delete the empty folders, repack the zip, rename to docx/xlsx and upload to OneDrive.

In the end, I think this shouldn't be hard to fix. Especially because there shouldn't be a legal case for "real" backslashes inside filenames, inside OfficeOpenXML files.
So LibreOffice can simply interpret all backslashes inside filenames as slashes.

Note: I also opened a ticket for 7-Zip, to see what those zip experts say.
https://sourceforge.net/p/p7zip/bugs/227/
Comment 16 Julien Nabet 2020-02-28 09:58:01 UTC
About "\", I had proposed a patch here: https://bugs.documentfoundation.org/show_bug.cgi?id=97379#c9
I just wonder if we should be strict when writing but also when reading zips or should we be strict only when reading zips.
Also, perhaps the other apps should just read the standard and follow it.