Bug 78731 - FILEOPEN: Specific SYLK file with .XLS extension opens in Writer instead of Calc
Summary: FILEOPEN: Specific SYLK file with .XLS extension opens in Writer instead of Calc
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
4.1.6.2 release
Hardware: All Windows (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
: 64894 80016 82393 96098 133282 (view as bug list)
Depends on:
Blocks: RTF FormatDetection
  Show dependency treegraph
 
Reported: 2014-05-15 08:27 UTC by Jean-Luc
Modified: 2023-10-24 22:39 UTC (History)
9 users (show)

See Also:
Crash report or crash signature:


Attachments
It is a SLK format with a .XLS extension. (535 bytes, text/spreadsheet)
2014-05-15 08:27 UTC, Jean-Luc
Details
RTF file with .doc extension (5.06 KB, application/octet-stream)
2015-11-27 13:48 UTC, Fabio Bas
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jean-Luc 2014-05-15 08:27:10 UTC
Created attachment 99068 [details]
It is a SLK format with a .XLS extension.

Somes files in a internal SLK format, but with a .XLS extension, are opened in Writer.
Appears since 4.1, tested with 4.1.6.2. It worked with 3.5
Works fine if the file extension is .SLK
Comment 1 Maxim Monastirsky 2014-05-15 13:22:54 UTC
Confirmed with a master build from Tb39 (Build ID: 45c89d62b527abec07072074484bd596ab1aa04a). Can't reproduce with other SYLK files I have.
Comment 2 Jean-Luc 2014-05-15 13:39:39 UTC
(In reply to comment #1)
> Confirmed with a master build from Tb39 (Build ID:
> 45c89d62b527abec07072074484bd596ab1aa04a). Can't reproduce with other SYLK
> files I have.

The bug reproduce with files .xls extension in SYLK format contain 2 acute accent charcaters ( é ). Work with 1 or 3 "é" characters.
Comment 3 Jean-Luc 2014-05-15 13:58:48 UTC
To reproduce :
 -Create a new empty file with calc, and save it in SYLK format ( .SLK ).
 -Close the file, and rename it with .XLS extension.
 -Reopen with Calc, and write words in cells

   - with 1 "é" character, save sylk format and .xls extension, reopen = ok Calc
   - with 2 "é" charcaters, save sylk format and .xls extension, reopen = Writer.
   - change extension in ".slk", open = ok Calc, write a 3rd "é", save.
   - change extension in ".xls", open = ok Calc
Comment 4 Jean-Luc 2014-05-15 14:01:40 UTC
To reproduce :
 -Create a new empty file with calc, and save it in SYLK format ( .SLK ).
 -Close the file, and rename it with .XLS extension.
 -Reopen with Calc, and write words in cells

   - with 1 "é" character, save sylk format and .xls extension, reopen = ok Calc
   - with 2 "é" characters, save sylk format and .xls extension, reopen = Writer.
   - change extension in ".slk", open = ok Calc, write a 3rd "é", save.
   - change extension in ".xls", open = ok Calc
   ... failed with 4 "é", ok with 5 "é" etc ...
Comment 5 Maxim Monastirsky 2014-05-15 21:59:43 UTC
@Fridrich: This file is erroneously detected by libwpd. Can it be solved in some way on the libwpd side, or we need to think of another solution?
Comment 6 Fridrich Strba 2014-05-16 08:03:28 UTC
I have a big problem with this. I see that libwpd is detecting it as a WP 4.2 file and I understand why. Now, if you change the extension of this file to *.sylk, it will load well. This can be a workaround for this kind of rare cases.

The problem lies in the fact that WP 4.2 file-format is a text file without header with WordPerfect codes embedded. We use a dry-parsing heuristics to detect this kind of files. We try to check whether the "codes" in the file follow the logic that "codes" in a WP file would follow. The problem is that this file contains an even number of "é" characters encoded as 0xE9. OxE9 in a WP42 file is a variable length function and we normally scan for a closing 0xE9 if we find an openning one. Them being in even numbers make us believe that it is a WP42 file with two 0xE9 codes. The problem is that I cannot make much with this kind of logic. Without this heuristics I am unable to detect any WP42 file at all. It is true that some special cases of text files can pass through this filter, but the workaround is to rename the extension of the file.
Comment 7 Jean-Luc 2014-05-16 08:30:19 UTC
(In reply to comment #6)
> I have a big problem with this. I see that libwpd is detecting it as a WP
> 4.2 file and I understand why. Now, if you change the extension of this file
> to *.sylk, it will load well. This can be a workaround for this kind of rare
> cases.
> 
> The problem lies in the fact that WP 4.2 file-format is a text file without
> header with WordPerfect codes embedded. We use a dry-parsing heuristics to
> detect this kind of files. We try to check whether the "codes" in the file
> follow the logic that "codes" in a WP file would follow. The problem is that
> this file contains an even number of "é" characters encoded as 0xE9. OxE9 in
> a WP42 file is a variable length function and we normally scan for a closing
> 0xE9 if we find an openning one. Them being in even numbers make us believe
> that it is a WP42 file with two 0xE9 codes. The problem is that I cannot
> make much with this kind of logic. Without this heuristics I am unable to
> detect any WP42 file at all. It is true that some special cases of text
> files can pass through this filter, but the workaround is to rename the
> extension of the file.

Thank you for your explainations. It is a logical way to rename the extension as is their true format.
This behavior is different from previous version of LibreOffice, OpenOffice or MsExcel ... We have to adapt legacy applications that generate such files.

I am very impressed with your responsiveness.
Comment 8 Maxim Monastirsky 2014-08-10 05:15:00 UTC
*** Bug 82393 has been marked as a duplicate of this bug. ***
Comment 9 Urmas 2014-08-10 05:43:34 UTC
So all the text file formats are sacrificed because of one obscure format which botches up the autodetection? WP4.2 should be triggered by extension only or by manual filter selection, not applied to any file then.
Comment 10 Maxim Monastirsky 2015-06-17 13:08:01 UTC
*** Bug 80016 has been marked as a duplicate of this bug. ***
Comment 11 Maxim Monastirsky 2015-06-17 13:15:57 UTC
*** Bug 64894 has been marked as a duplicate of this bug. ***
Comment 12 Maxim Monastirsky 2015-11-27 12:41:30 UTC
*** Bug 96098 has been marked as a duplicate of this bug. ***
Comment 13 Fabio Bas 2015-11-27 13:48:50 UTC
Created attachment 120839 [details]
RTF file with .doc extension

Attached another testcase. The same bug happens for rich text format files (rtf) using a .doc extension (this was quite common in the past).
Did anyone bisected this to understand the commit that caused the problem?
I see a simple solution in lowering the priority of wordperfect import filter, aka trying to load the file with all the others formats first, and then testing wordperfect as the last one. Could this work?
Comment 14 Fabio Bas 2015-11-27 13:51:59 UTC
Reopening: is there a reason why this has been marked as wontfix without an explanation?
Comment 15 Maxim Monastirsky 2015-11-28 17:51:35 UTC
(In reply to Fabio Bas from comment #13)
> Did anyone bisected this to understand the commit that caused the problem?
What exactly do you want to bisect? The problem is explained in comment 6. There is no way to make the WP detection "smarter" about this.

> I see a simple solution in lowering the priority of wordperfect import
> filter, aka trying to load the file with all the others formats first, and
> then testing wordperfect as the last one. Could this work?
This is an interesting idea. Looking at the detection priority list in filter/source/config/cache/typedetection.cxx reveals that the WP detection has indeed higher priority than other text based formats. I tried to lower it, and it indeed solved the issue. Will be good to do more tests with it, before pushing such change.

Unfortunately such approach won't solve all issues, e.g. csv or just plain text files with strange extension (see the duplicates of this bug) would still fail, because there is no way to "detect" such files before the WP detection catches them.

(In reply to Fabio Bas from comment #14)
> Reopening: is there a reason why this has been marked as wontfix without an
> explanation?
Well, the explanation is in comment 6. Anyway REOPENED isn't the right status for this, let's keep it as NEW instead.
Comment 16 David Tardon 2015-12-02 19:42:15 UTC
(In reply to Maxim Monastirsky from comment #15)
> This is an interesting idea. Looking at the detection priority list in
> filter/source/config/cache/typedetection.cxx reveals that the WP detection
> has indeed higher priority than other text based formats.

That's because only WP 4.2 format is text-based. The other WordPerfect formats supported by libwpd are binary. And the list prioritizes binary formats. But if moving it down the list works, I've got nothing against it.
Comment 17 Maxim Monastirsky 2015-12-02 21:24:10 UTC
(In reply to David Tardon from comment #16)
> That's because only WP 4.2 format is text-based.
Yes, and also WP1.

> And the list prioritizes binary formats.
BTW, there are some odd things in this list, like calc_SYLK & calc_DIF which are text based formats, but listed together with binary formats. Any idea why it was done that way?

> But if moving it down the list works, I've got nothing against it.
And by moving it below "generic_HTML", it should also be possible to avoid workarounds like the "calc_HTML" one.
Comment 18 David Tardon 2015-12-03 09:11:25 UTC
(In reply to Maxim Monastirsky from comment #17)
> BTW, there are some odd things in this list, like calc_SYLK & calc_DIF which
> are text based formats, but listed together with binary formats. Any idea
> why it was done that way?

Not really. Maybe "binary" is used loosely, in the sense "if the format has a standard header, which can be used for detecting it, it is binary"? It would explain why T602 is in that section too.


> And by moving it below "generic_HTML", it should also be possible to avoid
> workarounds like the "calc_HTML" one.

Yes, likely.
Comment 19 Ian Stuart 2017-04-23 16:17:04 UTC
We have just encountered a minor variation of this problem at a site where the opening of file with .xls extensions has been working flawlessly on an old version of Libreoffice.  The upgrade has had shall we say deleterious effects.

The behaviour on LibreOffice 5.3.2.2 on Mint and OpenSuse Leap 42 however is erratic - sometimes LibreOffice opens with scalc and other times  ignores the switch on the command line that specifies --calc.

The files generated are simple text files containing tab delimited data from a single application on a Linux server.  The number of columns can vary from file to file but the basic structure is the same.  The files are generated "on the fly" and LibreOffice is invoked by the application on demand; the user has no choice and does not have the facility to change extension from .xls to .csv; i.e the files are opened "for them"

The file formats are identified by file -b as either ISO-8858 text with CR line terminators or ASCII text with CR line terminators. However changing the extension of the files identified as ASCII from .xls extension to .csv results in the file opening with calc yet the file type remains ASCII.

If it will help or contribute in any way to resolving what has now become a significant problem in the life of the users affected by this feature, example files of those that open in calc and those that open in writer can be posted

Having read the explanation regarding filtering and parsing etc (comment 8 I think) it just seems to me and the users affected by this feature, that when a user specifies the application to use (in this case calc) it should be left to the user to override/determine the application . 

Does Microsoft Office decide to use Word when a user opens a file with Excel?  Surely there must be a way to override LibreOffice and determine/decide what application is required.
Kind regards
Comment 20 QA Administrators 2018-04-28 02:31:35 UTC Comment hidden (obsolete)
Comment 21 Fabio Bas 2018-04-28 08:02:11 UTC
The bug is still present.
Libreoffice info:
Version: 6.0.2.1
Build ID: f7f06a8f319e4b62f9bc5095aa112a65d2f3ac89
CPU threads: 4; OS: Mac OS X 10.13.4; UI render: default; 
Locale: it-IT (it_IT.UTF-8); Calc: group
Comment 22 Maxim Monastirsky 2020-06-14 08:05:41 UTC
*** Bug 133282 has been marked as a duplicate of this bug. ***
Comment 23 QA Administrators 2022-06-15 03:41:19 UTC Comment hidden (obsolete)
Comment 24 Fabio Bas 2022-06-15 08:31:22 UTC
Bug is still present, i just tested both attachments

Version: 7.3.4.2 / LibreOffice Community
Build ID: 30(Build:2)
CPU threads: 8; OS: Linux 5.18; UI render: default; VCL: gtk3
Locale: it-IT (it_IT.UTF-8); UI: it-IT
SlackBuild for 7.3.4 by Eric Hameleers
Calc: threaded