Bug 123476 - Detect 0-byte files based on extension (esp. for MS Office and ODF formats)
Summary: Detect 0-byte files based on extension (esp. for MS Office and ODF formats)
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium enhancement
Assignee: Miklos Vajna
URL:
Whiteboard: target:7.1.0 target:7.2.0 target:7.1....
Keywords:
: 90613 98127 104819 120822 133164 (view as bug list)
Depends on:
Blocks: FormatDetection
  Show dependency treegraph
 
Reported: 2019-02-15 05:19 UTC by Aron Budea
Modified: 2021-05-16 18:40 UTC (History)
7 users (show)

See Also:
Crash report or crash signature:


Attachments
Screencast (2.93 MB, image/gif)
2021-05-03 12:54 UTC, Mike Kaganski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Aron Budea 2019-02-15 05:19:04 UTC
Currently all empty (0-byte) files are detected as HTML/Web files (this is actually a change in 6.2.0.3 since 6.1.0.3, until then they were detected as text files). Since there's no other identifying information, they should be detected based on extension.

Current behavior is a problem, because in Windows right click -> New -> <various MS Office formats> tend to create 0-byte files with MS Office installed, and opening and editing them in LibreOffice can cause confusion and potential data loss if the user doesn't notice the wrong format before saving their document.
Comment 1 Mike Kaganski 2019-02-15 05:48:17 UTC
Well - then not only detection should be changed, but also something needs to be done with document initialization as well - because *reading* from such a file using detected filter would be impossible.
Comment 3 Mike Kaganski 2019-02-15 07:57:39 UTC
Regarding the initialization: should something specific be done with the new document depending on the filter? E.g., for a 0-byte .docx, should we simply create a new Writer document (using default template) and set its filter to DOCX, or should we initialize it as if that is a DOCX - which would mean different default fonts, compatibility options, etc. (whatever is done in DOCX importer when a valid DOCX is imported, before actual DOCX data is read)? Should all filters be modified to be able to do that then? Would that require to have own default templates for all filters, of should the one default template for the module be used anyway, with application of filter-specific modifications (with a risk of the resulting new document to differ from the template as used in normal new document creation)?
Comment 4 Aron Budea 2019-02-15 08:13:53 UTC
(In reply to Mike Kaganski from comment #1)
> Well - then not only detection should be changed, but also something needs
> to be done with document initialization as well - because *reading* from
> such a file using detected filter would be impossible.
Sure, the point is not to read from an empty file, but to correctly initialize one.
Comment 5 Aron Budea 2019-02-15 12:27:35 UTC
(In reply to Mike Kaganski from comment #3)
> Regarding the initialization: should something specific be done with the new
> document depending on the filter? E.g., for a 0-byte .docx, should we simply
> create a new Writer document (using default template) and set its filter to
> DOCX, or should we initialize it as if that is a DOCX - which would mean
> different default fonts, compatibility options, etc. (whatever is done in
> DOCX importer when a valid DOCX is imported, before actual DOCX data is
> read)? Should all filters be modified to be able to do that then? Would that
> require to have own default templates for all filters, of should the one
> default template for the module be used anyway, with application of
> filter-specific modifications (with a risk of the resulting new document to
> differ from the template as used in normal new document creation)?
All very good questions, I'd say just create an empty document/spreadsheet/presentation, set the export type to the identified one, and do an export+import cycle. Out of that the last step is optional if it'd be a larger task, the most important is to start in the correct application and set the correct save format.

While I think the above method would be applicable to most formats, it's really relevant for the formats that could come in as 0-byte files in real life, ie. the ones that can be created by MS Office via Explorer context menu.
Comment 6 Miklos Vajna 2020-10-27 13:06:33 UTC
I plan to look at this.
Comment 7 Commit Notification 2020-10-28 18:35:06 UTC
Miklos Vajna committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/ada07f303e7cd1e39c73abe0741aefe7d9d73a57

tdf#123476 filter: try to detect 0-byte files based on extension

It will be available in 7.1.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 8 Commit Notification 2021-01-28 13:09:51 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/2854362f429e476d4a1ab4759c6a1f1c04150280

tdf#123476 filter: Also handle empty ODF

It will be available in 7.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 9 Commit Notification 2021-01-29 12:29:47 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "libreoffice-7-1":

https://git.libreoffice.org/core/commit/e3307e5e76d5c35ee79b262d519c4a777acce536

tdf#123476 filter: Also handle empty ODF

It will be available in 7.1.1.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 10 Justin L 2021-04-27 05:40:41 UTC
*** Bug 90613 has been marked as a duplicate of this bug. ***
Comment 11 Mike Kaganski 2021-04-27 06:01:14 UTC
*** Bug 98127 has been marked as a duplicate of this bug. ***
Comment 12 Mike Kaganski 2021-04-27 06:01:25 UTC
*** Bug 104819 has been marked as a duplicate of this bug. ***
Comment 13 Mike Kaganski 2021-04-27 06:06:45 UTC
*** Bug 120822 has been marked as a duplicate of this bug. ***
Comment 14 Mike Kaganski 2021-04-27 06:06:58 UTC
*** Bug 133164 has been marked as a duplicate of this bug. ***
Comment 15 junk_2010 2021-05-01 10:38:50 UTC
The comment in this bug report on 2021-01-29 12:29:47 UTC said regarding the patch:
"It will be available in 7.1.1."

I was the original reporter of Bug 104819 back in LibreOffice version 5.2.3.3 on 20/Dec/2016. That bug report was resolved as a duplicate which leads to this bug report.

After the posted update comments in this bug report about the fix being available in 7.1.1, I have tested using version 7.1.2.2 on Windows 10 (Version 20H2 OS build 19042.928) and unfortunately the patch does fully resolve the issues I reported.

The steps I followed were:
1. Create a new Microsoft Word Doc (.docx) in a folder (using the right mouse button option) on Windows 10 with Microsoft Word from Microsoft Office Professional Plus 2013 (15.0.4875.1001)

2. Add some header and footer text

3. Add some main document text using format styles, Title, Heading 1, Heading 2, Text Body etc

4. Save document

5. Quit LibreOffice Writer

6. Reopen document in LibreOffice Writer

The reopened document:
a) There was no header or footer text.
b) All of the main document text formatted using format styles was no longer formatted.
The main document text was all present but it was all "Preformatted Text"
c) If I try to open the saved document using Microsoft Word it says that the document cannot be opened as it is corrupt.
Comment 16 Aron Budea 2021-05-02 04:17:00 UTC
Works fine here in 7.1.2.2 when checked with an empty file with .docx extension.
Comment 17 junk_2010 2021-05-02 12:58:21 UTC
Aron,

Thank-you for checking and getting back so quickly.
I am currently at a loss to explain the different behaviour you and I are seeing.

I have checked the behaviour I saw yesterday and it is 100% repeatable for me. I have also checked I get the same behaviour using a different account on the same Windows 10 PC.

When I reported the original bug one issue that became apparent explained why the file was "silently" saving and creating data loss for me, and for others they were being prompted that there was an "potential issue".

This is the "Ask when not saving in ODF or default format" LibreOffice option. I had turned this option off. I turned this off by using the option to do so on the "Confirm File Format" prompt window the first time it was presented to me. I put this option back on today and repeated the experiment using a newly (right-mouse button click) .docx document. The was a 0 byte sized document. Now when I saved the document after edits I got the "Confirm File Format" prompt window and the only two document format options it offered me were "Use text Format" and "Use ODF Format". For me I still have to use "save as" to save in Microsoft Word .docx format to avoid data loss the first time I save a .docx document.

I also checked the behaviour by manually creating a 0 byte .doc document. The behaviour was the same for me as with a new .docx document, data loss on saving.

Thank-you for your work in looking at this issue. However, for me at the moment the issue is not resolved. If you would like me to try any other experiements with the version of LibreOffice I have installed please let me know.

I have double checked the version of LibreOffice from the help menu. The details are:
Version: 7.1.2.2 (x64) / LibreOffice Community
Build: 8a45595d069ef5570103caea1b71cc9d82b2aae4
       https://git.libreoffice.org/core/+log/8a45595d069ef5570103caea1b71cc9d82b2aae4
Environment: CPU threads: 4; OS: Windows 10.0 Build 19042
User Interface: UI render: Skia/Raster; VCL: win
Locale: en-GB (en_GB); UI: en-GB
Misc: Calc: threaded
Comment 18 Miklos Vajna 2021-05-03 07:33:00 UTC
This bug was fixed about half a year ago. It has a cppunit test that ensures it remains fixed. If you have a related problem, could you please open a follow-up bug instead? Thanks.
Comment 19 Orwel 2021-05-03 08:57:39 UTC
(In reply to Miklos Vajna from comment #18)
> This bug was fixed about half a year ago. It has a cppunit test that ensures
> it remains fixed. If you have a related problem, could you please open a
> follow-up bug instead? Thanks.

I can confirm what junk_2010@live.co.uk has written, the bug WAS NOT fixed and was/is still present (LO 
37.0,7.1). I reported a duplicate of this bug (Bug 90613), so I permanently check if it works. There was no working fix in any version of Windows/LO I had/have. Actually Win 20H2+LO:
Version: 7.1.2.2 (x64) / LibreOffice Community
Build ID: 8a45595d069ef5570103caea1b71cc9d82b2aae4
CPU threads: 16; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: sk-SK (sk_SK); UI: en-GB
Calc: CL

LO still opens a 0-bit unformatted file and if you does not notice it, you will loose all formats after you savethe document, as there is still no warning of saving such a file as simple .txt format.
This Bug should be REOPEN as it was fixed neither in 7.0 nor in 7.1.
Comment 20 Mike Kaganski 2021-05-03 09:04:41 UTC
(In reply to junk_2010 from comment #15)
(In reply to Orwel from comment #19)

junk_2010, Orwel: could you please check if that doesn't work with clean user profile?

Alco could you please record a screen cast, where you create such document in Explorer, open it in Writer, show Help->About, then save, showing the filter name that is displayed in the warning about file format?

Thanks!
Comment 21 Orwel 2021-05-03 12:26:52 UTC
Hi,

I have tested with clear U-profile. I notice a pop up window Confirm file format ("This document may contain formatting or content that cannot be saved in the currently selected file format “Text”...". 
In my profile this option (can be found in Options-Load/Save-General - Warn when not saving in ODF or default format) is deactivated - if I check this option in my U-profile, I get the same pop up window. But this is not a solution for described bug, indeed:

1. For people who use a lot of .docx, .doc, .odt files, this pop up dialog is very annoying as it comes with every single .docx/.doc file you want to save/re-save (Save/Save as). This means, each .docx file opened and saved/re-saved is showing this popup window. So for each save you have to make 2 steps, if want to keep the docx format (click save, then click keep format). Therefor I have deactivated it.

2. The pop up window comes (by checked Warn dialog in Load/Save-General) only with SAVE function. By SAVE AS, you only get the Save as window, where you can see the proposed extension as .txt.

But the BUG itself is that a clear .docx file SHOULD NOT be interpreted as .txt file in any way. 
If you open a clear .docx (created in Win Explorer by right click) in MS Office, you get the default template opened.
The same, if you open a clear .odt file (created in Win Explorer by right click), the created .odt file is not a TXT file...
So why does LO interpret a clear .docx file as txt file? The popup window is not a solution, because I need it to be deactivate (see point 1 above).  This is the bug we speak about...

(In reply to Mike Kaganski from comment #20)
> 
> 
> could you please record a screen cast

Do you mean a screen video record of proposed steps? If yes, do you still need it?
Comment 22 Mike Kaganski 2021-05-03 12:41:25 UTC
(In reply to Orwel from comment #21)
> (In reply to Mike Kaganski from comment #20)
> > 
> > 
> > could you please record a screen cast
> 
> Do you mean a screen video record of proposed steps?

Yes

> If yes, do you still need it?

Yes
Comment 23 Mike Kaganski 2021-05-03 12:54:08 UTC
Created attachment 171605 [details]
Screencast

Actually, I was able to repro myself. And I agree with "reopened" state, since it was never fixed.

The problem here is using *Writer* to open the file. I always had Word associated with .DOCX on my system, and always tested with "Open With->LibreOffice", and that works as intended.

But if I use "Open With->LibreOffice Writer", or associate DOCX with Writer (as opposed to simple LibreOffice), the problem appears. Given that in normal installation, where user chooses to associate MSO files with LibreOffice, DOCX are associated with Writer, this problem is indeed still not fixed for users.

The problem is likely the '--writer' command line option used in this case.
Comment 24 Miklos Vajna 2021-05-03 13:07:31 UTC
The bug was created by Aron, the original scope was Online. The fix works for Online, as far as I know. If you want to have this working in a wider scope, that's fine, but please let's have a separate, follow-up bug for that. Thanks.
Comment 25 junk_2010 2021-05-03 14:25:49 UTC
I was about to create a screencast but it seems it is no longer needed.

I would add that I believe this issue occurs with both new 0 byte .docx and
.doc files, though I appreciate a .doc file is no longer very common.

> The bug was created by Aron, the original scope was Online. The fix works for
> Online, as far as I know. If you want to have this working in a wider scope,
> that's fine, but please let's have a separate, follow-up bug for that. Thanks

I would suggest that if there is no wish to re-open this bug report, a new bug report is not required as you could just re-open one of the bug reports that was closed as a duplicate that of this report. All of the reports below appear to me to describe the issue:

Bug 90613 2015-04-14
Bug 98127 2016-02-24
Bug 104819 2016-12-20
Bug 120822 2018-10-23
Comment 26 Commit Notification 2021-05-03 16:39:43 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/dff586735b6618d9b011823594a33287d8f7f223

tdf#123476: also use filter by extension when its service is the same

It will be available in 7.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 27 Commit Notification 2021-05-06 08:42:25 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "libreoffice-7-1":

https://git.libreoffice.org/core/commit/a8e84a2d6e634c03d62e17bcc1b617238dcc9eb1

tdf#123476: also use filter by extension when its service is the same

It will be available in 7.1.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 28 junk_2010 2021-05-16 16:41:37 UTC
Thank-you for putting a fix in place for this.

I have downloaded and installed the 7.2.0.0.alpha1+(x64) daily build from:
https://dev-builds.libreoffice.org/daily/master/current.html
2021-05-16 04:50:54

Version: 7.2.0.0.alpha1+ (x64) / LibreOffice Community
Build ID: 58b0c95ad50139a62bddb348d10f94053c09cd5b
CPU threads: 4; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: en-GB (en_GB); UI: en-GB
Calc: CL

With this build I can confirm that the issue I reported has been resolved. Specifically the steps I followed on Windows 10 was:

1. Create a new Microsoft Word Doc (.docx) in a folder (using the right mouse button option) on Windows 10 with Microsoft Word from Microsoft Office Professional Plus 2013 (15.0.4875.1001)

2. Add some header and footer text

3. Add some main document text using format styles, Title, Heading 1, Heading 2, Text Body etc

4. Save document

5. Quit LibreOffice Writer

6. Reopen document in LibreOffice Writer

The re-opened document has now retained all of the formatting. I was also able to open the document in Microsoft Word without any issues.
Comment 29 BogdanB 2021-05-16 18:40:44 UTC
Based on comment 28 (from the original reporter of the bug) I will mark this bug as verified.