Bug 85324 - FILEOPEN: docx file with doc extension triggers the repair dialog
Summary: FILEOPEN: docx file with doc extension triggers the repair dialog
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
4.4.0.0.alpha0+ Master
Hardware: Other All
: medium normal
Assignee: Maxim Monastirsky
URL:
Whiteboard: target:4.4.0
Keywords: regression
Depends on:
Blocks:
 
Reported: 2014-10-22 10:46 UTC by Yousuf Philips (jay) (retired)
Modified: 2014-10-26 21:55 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments
corrupt dialog (22.07 KB, image/png)
2014-10-22 10:46 UTC, Yousuf Philips (jay) (retired)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yousuf Philips (jay) (retired) 2014-10-22 10:46:58 UTC
Created attachment 108228 [details]
corrupt dialog

When opening attachment 105386 [details] in master, it first states the file is corrupt, which isnt correct as the file opens fine with 4.3 daily and word 2007, and when i'm given the choice of yes and no of whether to repair the file, i click no, a new dialog appears stating "The file 'Sample Doc.doc' could not be repaired and therefore cannot be opened.". So i click the 'OK' button and then the file gets loaded.

Version: 4.3.4.0.0+
Build ID: bea74d73f4782b17f5286327f15db60ab3ae85de
TinderBox: Linux-rpm_deb-x86@45-TDF, Branch:libreoffice-4-3, Time: 2014-10-15_05:34:17

Version: 4.4.0.0.alpha0+
Build ID: c68642d535f2ebb7f1cd866ad19b1fd018e7cd6d
TinderBox: Linux-rpm_deb-x86@45-TDF, Branch:master, Time: 2014-10-18_23:03:32
Comment 1 Maxim Monastirsky 2014-10-22 11:54:26 UTC
Hi Jay,

So this is a docx file with a 'doc' extension, and I can confirm that this happens now with any such file.

A bit of history:
This bug was fixed once in Bug 59426. Later some of the changes reverted in [1], which didn't reintroduce this bug, but instead caused Bug 74978. Later I completely removed the original fix in [2] which fixed Bug 74978, and still didn't reintroduce this bug. Now it's broken again. I guess it's since [3], and I'll see what I can do with it.

[1] http://cgit.freedesktop.org/libreoffice/core/commit/?id=e4003b67062e575f9b77772488f9b9691fa9fc38
[2] http://cgit.freedesktop.org/libreoffice/core/commit/?id=f82f7bf3dd5053049259f6933d1335f6c9e314dd
[3] http://cgit.freedesktop.org/libreoffice/core/commit/?id=ed00d0ddd663085a5fd180301cdc82af80bc8077
Comment 2 Maxim Monastirsky 2014-10-22 13:53:02 UTC
Would like to know whether http://cgit.freedesktop.org/libreoffice/core/commit/?id=ce59240410e58352917fcda94cab3465f059d11d makes any difference?
Comment 3 Maxim Monastirsky 2014-10-24 11:01:50 UTC
Seems to work.
Comment 4 Yousuf Philips (jay) (retired) 2014-10-24 16:43:38 UTC
Loads fine now. About the corrupt dialog, is it supposed to trigger in this funny way and completely do the same thing with a yes or no user response.
Comment 5 Maxim Monastirsky 2014-10-25 21:24:56 UTC
(In reply to Jay Philips from comment #4)
> About the corrupt dialog, is it supposed to trigger in this
> funny way and completely do the same thing with a yes or no user response.
Of course not. For a corrupted ODF file, when you click yes it should try to repair it. Try it with attachment 98315 [details]. The problem arises when this code gets non-ODF ZIP based file (such as docx), and it suggest to repair although we actually unable to repair this non-ODF file. So what I did in the above commit, is that the docx filter will catch this file *before* it gets to the repair dialog code.

Regarding the "No" button: The current behavior is to continue trying to load the file with other filters (since the ODF had failed and showed the corruption dialog).
Comment 6 Maxim Monastirsky 2014-10-25 21:38:56 UTC
(In reply to Maxim Monastirsky from comment #5)
> The problem arises when
> this code gets non-ODF ZIP based file (such as docx), and it suggest to
> repair although we actually unable to repair this non-ODF file.
Just to clarify: This file is not corrupted, but the ODF code expects ODF, and knows nothing about docx, so when it gets docx it claims that it's corrupted. Sure it's not ideal, and we could be smarter about such situation in the future.
Comment 7 Yousuf Philips (jay) (retired) 2014-10-25 21:50:11 UTC
(In reply to Maxim Monastirsky from comment #5)
> Regarding the "No" button: The current behavior is to continue trying to
> load the file with other filters (since the ODF had failed and showed the
> corruption dialog).

Tried it with attachment 98315 [details] and i guess it trying to load with other filters gives me a General input/output error dialog. But i think stating "The file 'Trabajo Final - Adhesivos - Carlos D Salvador.odt' could not be repaired and therefore cannot be opened." should imply that it stops there.
Comment 8 Yousuf Philips (jay) (retired) 2014-10-25 21:51:31 UTC
(In reply to Maxim Monastirsky from comment #6)
> Just to clarify: This file is not corrupted, but the ODF code expects ODF,
> and knows nothing about docx, so when it gets docx it claims that it's
> corrupted. Sure it's not ideal, and we could be smarter about such situation
> in the future.

I think there should be some checking of the file regardless of extension.
Comment 9 Maxim Monastirsky 2014-10-25 22:19:22 UTC
(In reply to Jay Philips from comment #7)
> Tried it with attachment 98315 [details] and i guess it trying to load with
> other filters gives me a General input/output error dialog.
Right.

(In reply to Jay Philips from comment #8)
> I think there should be some checking of the file regardless of extension.
In fact there is such checking, but the ODF checking thinks that any ZIP file which doesn't have ODF layout is corrupted (and calling the docx checking from inside the ODF checking is against the current code design, which isn't going to be changed anytime soon). One possible solution I thought, is to not show the corruption dialog immediately when the ODF detection fails, but wait and show it only if no other filter catched the file in the meantime. Such change will also cause that after clicking "No", it won't try to load the file anymore.
Comment 10 Maxim Monastirsky 2014-10-25 22:34:29 UTC
BTW You can see similar behavior to this "No" button in Bug 80999. Both are affected by the same design of the code: When one type detection fails, it still tries other types.
Comment 11 Yousuf Philips (jay) (retired) 2014-10-26 10:42:25 UTC
(In reply to Maxim Monastirsky from comment #9)
> (In reply to Jay Philips from comment #8)
> > I think there should be some checking of the file regardless of extension.
> In fact there is such checking, but the ODF checking thinks that any ZIP
> file which doesn't have ODF layout is corrupted (and calling the docx
> checking from inside the ODF checking is against the current code design,
> which isn't going to be changed anytime soon). One possible solution I
> thought, is to not show the corruption dialog immediately when the ODF
> detection fails, but wait and show it only if no other filter catched the
> file in the meantime. Such change will also cause that after clicking "No",
> it won't try to load the file anymore.

To me, this kind of checking is incorrect. It should check the files before sending the file to an import filter. If anything, there should be a ZIP file check that determines if its an ODF or DOCX and then sends it to the necessary import filter.
Comment 12 Maxim Monastirsky 2014-10-26 11:36:21 UTC
(In reply to Jay Philips from comment #11)
> It should check the files before sending the file to an import filter.
That's exactly what we have now, and this code is the one that triggers the repair dialog, not the filter.

> If anything, there should be a ZIP
> file check that determines if its an ODF or DOCX and then sends it to the
> necessary import filter.
Note that we support also other ZIP based formats, that handled by external libraries, and the detection occurs inside each such library. How would you make sure that such file won't trigger the repair dialog?
Comment 13 Yousuf Philips (jay) (retired) 2014-10-26 13:36:00 UTC
(In reply to Maxim Monastirsky from comment #12)
> (In reply to Jay Philips from comment #11)
> > It should check the files before sending the file to an import filter.
> That's exactly what we have now, and this code is the one that triggers the
> repair dialog, not the filter.

Well you stated "This file is not corrupted, but the ODF code expects ODF, and knows nothing about docx, so when it gets docx it claims that it's corrupted.", which sounds as if its in the ODF import filter.

> > If anything, there should be a ZIP
> > file check that determines if its an ODF or DOCX and then sends it to the
> > necessary import filter.
> Note that we support also other ZIP based formats, that handled by external
> libraries, and the detection occurs inside each such library. How would you
> make sure that such file won't trigger the repair dialog?

Well i'd assume that libreoffice would have some file testing code which would easily identifies the type of file it is and then send it to the correct filter. Like when i created an image viewer before, i had signatures per file to easily identify the type of file it is, like GIFs having the GIF89 marker as it first few bytes, and stuff like that.
Comment 14 Maxim Monastirsky 2014-10-26 19:24:08 UTC
(In reply to Jay Philips from comment #13)
> Well you stated "This file is not corrupted, but the ODF code expects ODF,
> and knows nothing about docx, so when it gets docx it claims that it's
> corrupted.", which sounds as if its in the ODF import filter.
There are ODF filter and ODF type detector. Both deal only with ODF, but they're completely two different pieces of code.

To make things clear: At http://opengrok.libreoffice.org/xref/core/filter/source/config/fragments/types/ you can find a list of formats supported by LO. Most of them have a "DetectService" tag, which is a name of a UNO component that is able to look into a file and confirm whether it's from this type or not based on the file contents (ZIP contents, first few bytes etc.). During file open we gather all the types and sort them by matching of extension, then by the types related to the current open component (Writer, Calc etc.), then all the rest. Now we run one by one the "DetectService" of each type, and it return whether it's from this type or not. When some "DetectService" return a positive answer, we stop this loop and send the file to the import filter that is associated with the "detected" type.

So this is not a monolithic code like you might expect, but a plugable framework which uses many "detectors", each of them has it's own code base, and lives his own life. (This allows to write and distribute new import filters as extensions, which can be plugged into this framework, rather than compiling into LO.)

Now, the ODF detector has this repair dialog inside. So in order to support DOCX nicely we must do one of the two:

1. Make sure that the DOCX detector runs before the ODF one, so it will catch that file. This always should be the case when a file has the correct extension, which is true in most of the cases (because then the DOCX detector runs before any other). And in case of a wrong extension, we must do our best to make it sorted somewhere before the ODF one. That's what I did in this commit. Of course it won't help in case the DOCX file has some ODF extension.

2. What I suggested in comment 9 - Move the repair dialog out of the ODF detector and the detection loop, and show it after that loop, and only if no other detector claimed for a detection in the meantime.

Hope it's clear now.
Comment 15 Yousuf Philips (jay) (retired) 2014-10-26 21:55:25 UTC
Yes quite clear now. Thanks for the explanation.