Bug 137175 - FILEOPEN .XLS/HTML incorrect codepage while open many reports
Summary: FILEOPEN .XLS/HTML incorrect codepage while open many reports
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
6.4.4.2 release
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: HTML-Import
  Show dependency treegraph
 
Reported: 2020-10-01 08:54 UTC by Andrew
Modified: 2023-04-02 16:26 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
Original .XLS report (14.36 KB, application/wps-office.xls)
2020-10-01 08:54 UTC, Andrew
Details
Screenshot opening original file and its content (195.93 KB, image/png)
2020-10-01 08:54 UTC, Andrew
Details
Original .XLS report translated to cp1251 (14.57 KB, text/html)
2020-10-01 08:55 UTC, Andrew
Details
Screenshot opening translated file and its content (188.65 KB, image/png)
2020-10-01 08:55 UTC, Andrew
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew 2020-10-01 08:54:03 UTC
Created attachment 165991 [details]
Original .XLS report

Many report builder save its own file with .XLS extension but really in html format.LibreOffice now open this file, but not understand Unicode or CP-1251
Comment 1 Andrew 2020-10-01 08:54:55 UTC
Created attachment 165992 [details]
Screenshot opening original file and its content
Comment 2 Andrew 2020-10-01 08:55:27 UTC
Created attachment 165993 [details]
Original .XLS report translated to cp1251
Comment 3 Andrew 2020-10-01 08:55:58 UTC
Created attachment 165994 [details]
Screenshot opening translated file and its content
Comment 4 Roman Kuznetsov 2020-10-02 06:09:50 UTC
I confirm the behavior in current 7.1.

Mike, what do you think about it? Should LO knows that it isn't really a XLS file but it's a HTML report? NOTOURBUG?
Comment 5 Mike Kaganski 2020-10-02 06:20:07 UTC
(In reply to Roman Kuznetsov from comment #4)
> Mike, what do you think about it? Should LO knows that it isn't really a XLS
> file but it's a HTML report? NOTOURBUG?

LibreOffice knows it's an HTML. Otherwise, it would not import its structure correctly.

The so-called "builders" are so cute - they naturally consider the proper HTML structure (with header/body, and meta having encoding etc) a rocket science, relying on some magic of auto-detection of encoding made by the software. The "HTML" lacks everything, including even <html> and <body> tag. So LibreOffice detects HTML, sees the absent metadata (encoding info), and just assumes cp-1252.

As an enhancement to encourage those "builders" (the brilliant samples of shitcode) to keep generating those awful reports, we could try to use something like was implemented recently for tdf#60145 when absence of meta-data was detected.
Comment 6 Andrew 2020-10-02 06:26:59 UTC
I understand, that it is "shitcode". But it is lives ...

However open dialogue contain option to select encoding but that it not work. It is problem.
Comment 7 Mike Kaganski 2020-10-02 06:29:09 UTC
(In reply to Andrew from comment #6)
> However open dialogue contain option to select encoding but that it not
> work. It is problem.

It doesn't contain options for encoding, only for language. The language is used to decide which locale to use to detect numbers (u.e., which decimal/thousand separators, currency, etc. to use). It has nothing to do with encoding.
Comment 8 Andrew 2020-10-02 06:36:14 UTC
Hmm... Ok. This is not so clear from the text of the dialogue.

Thanks. I think this can be seen as improvement.
Comment 9 Mike Kaganski 2020-10-02 06:37:26 UTC
Anyway, introducing a generic method to detect encoding for texts (which should be used by various filters when they fail to recognize the encoding themselves, presumably based on ICU as tdf#60145 fix does) is a valid enhancement request...
Comment 10 Mike Kaganski 2020-10-02 06:43:42 UTC
(In reply to Andrew from comment #8)

And also a valid enhancement request is to improve the wording of the dialog (kompi: a hint ;-) - I'd made that a separate request)
Comment 11 Xisco Faulí 2020-11-11 15:02:52 UTC
(In reply to Mike Kaganski from comment #9)
> Anyway, introducing a generic method to detect encoding for texts (which
> should be used by various filters when they fail to recognize the encoding
> themselves, presumably based on ICU as tdf#60145 fix does) is a valid
> enhancement request...

Moving to NEW and changing to enhancement