137175 – FILEOPEN .XLS/HTML incorrect codepage while open many reports

Bug 137175 - FILEOPEN .XLS/HTML incorrect codepage while open many reports

Summary: FILEOPEN .XLS/HTML incorrect codepage while open many reports

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	6.4.4.2 release
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	HTML-Import
	Show dependency tree / graph

Reported:	2020-10-01 08:54 UTC by Andrew
Modified:	2023-04-02 16:26 UTC (History)
CC List:	6 users (show)

See Also:	HTML-Import 150964
Crash report or crash signature:

Attachments
Original .XLS report (14.36 KB, application/wps-office.xls) 2020-10-01 08:54 UTC, Andrew	Details
Screenshot opening original file and its content (195.93 KB, image/png) 2020-10-01 08:54 UTC, Andrew	Details
Original .XLS report translated to cp1251 (14.57 KB, text/html) 2020-10-01 08:55 UTC, Andrew	Details
Screenshot opening translated file and its content (188.65 KB, image/png) 2020-10-01 08:55 UTC, Andrew	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrew 2020-10-01 08:54:03 UTC

Created attachment 165991 [details]
Original .XLS report

Many report builder save its own file with .XLS extension but really in html format.LibreOffice now open this file, but not understand Unicode or CP-1251

Comment 1 Andrew 2020-10-01 08:54:55 UTC

Created attachment 165992 [details]
Screenshot opening original file and its content

Comment 2 Andrew 2020-10-01 08:55:27 UTC

Created attachment 165993 [details]
Original .XLS report translated to cp1251

Comment 3 Andrew 2020-10-01 08:55:58 UTC

Created attachment 165994 [details]
Screenshot opening translated file and its content

Comment 4 Roman Kuznetsov 2020-10-02 06:09:50 UTC

I confirm the behavior in current 7.1.

Mike, what do you think about it? Should LO knows that it isn't really a XLS file but it's a HTML report? NOTOURBUG?

Comment 5 Mike Kaganski 2020-10-02 06:20:07 UTC

(In reply to Roman Kuznetsov from comment #4)
> Mike, what do you think about it? Should LO knows that it isn't really a XLS
> file but it's a HTML report? NOTOURBUG?

LibreOffice knows it's an HTML. Otherwise, it would not import its structure correctly.

The so-called "builders" are so cute - they naturally consider the proper HTML structure (with header/body, and meta having encoding etc) a rocket science, relying on some magic of auto-detection of encoding made by the software. The "HTML" lacks everything, including even <html> and <body> tag. So LibreOffice detects HTML, sees the absent metadata (encoding info), and just assumes cp-1252.

As an enhancement to encourage those "builders" (the brilliant samples of shitcode) to keep generating those awful reports, we could try to use something like was implemented recently for tdf#60145 when absence of meta-data was detected.

Comment 6 Andrew 2020-10-02 06:26:59 UTC

I understand, that it is "shitcode". But it is lives ...

However open dialogue contain option to select encoding but that it not work. It is problem.

Comment 7 Mike Kaganski 2020-10-02 06:29:09 UTC

(In reply to Andrew from comment #6)
> However open dialogue contain option to select encoding but that it not
> work. It is problem.

It doesn't contain options for encoding, only for language. The language is used to decide which locale to use to detect numbers (u.e., which decimal/thousand separators, currency, etc. to use). It has nothing to do with encoding.

Comment 8 Andrew 2020-10-02 06:36:14 UTC

Hmm... Ok. This is not so clear from the text of the dialogue.

Thanks. I think this can be seen as improvement.

Comment 9 Mike Kaganski 2020-10-02 06:37:26 UTC

Anyway, introducing a generic method to detect encoding for texts (which should be used by various filters when they fail to recognize the encoding themselves, presumably based on ICU as tdf#60145 fix does) is a valid enhancement request...

Comment 10 Mike Kaganski 2020-10-02 06:43:42 UTC

(In reply to Andrew from comment #8)

And also a valid enhancement request is to improve the wording of the dialog (kompi: a hint ;-) - I'd made that a separate request)

Comment 11 Xisco Faulí 2020-11-11 15:02:52 UTC

(In reply to Mike Kaganski from comment #9)
> Anyway, introducing a generic method to detect encoding for texts (which
> should be used by various filters when they fail to recognize the encoding
> themselves, presumably based on ICU as tdf#60145 fix does) is a valid
> enhancement request...

Moving to NEW and changing to enhancement