Bug 88821 - META CHARSET support for CALC HTML import
Summary: META CHARSET support for CALC HTML import
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
4.3.5.2 release
Hardware: Other All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard: target:5.3.0
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-27 13:03 UTC by grofaty
Modified: 2016-11-23 11:59 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
slovenian_utf-8.csv (54 bytes, text/csv)
2015-01-27 13:03 UTC, grofaty
Details
slovenian_utf-8.html (367 bytes, text/html)
2015-01-27 13:04 UTC, grofaty
Details
slovenian_windows-1250.csv (48 bytes, text/csv)
2015-01-27 13:05 UTC, grofaty
Details
slovenian_windows-1250.html (368 bytes, text/html)
2015-01-27 13:05 UTC, grofaty
Details
Tools_Options_Language_settings.png (41.45 KB, image/png)
2015-01-27 13:05 UTC, grofaty
Details
Screenshot (15.85 KB, image/png)
2015-01-28 00:16 UTC, m.a.riosv
Details
Problem_explained_in_detail.png (384.91 KB, image/png)
2015-01-28 07:29 UTC, grofaty
Details
correct import (403 bytes, text/html)
2015-01-28 18:42 UTC, raal
Details
5_1_master_additional_tests.png (29.05 KB, image/png)
2015-11-19 11:20 UTC, grofaty
Details
Example of FR file not imported OK (11.34 KB, text/html)
2016-09-19 15:02 UTC, bureautiquelibre
Details

Note You need to log in before you can comment on or make changes to this bug.
Description grofaty 2015-01-27 13:03:43 UTC
Created attachment 112821 [details]
slovenian_utf-8.csv

n LibreOffice Calc 4.3.5.2 on Windows 7 I see Calc wrongly assumes that HTML input file is ALWAYS in Windows-1252 code page.

MY SETTINGS:
1. Tools | Options | Language Settings | Languages
a) User interface: English (USA)
b) Locale settings: Slovenian
See attachment Tools_Options_Language_settings.png


TEST 1:
1. Start Calc.
2. File | Open
3. Select slovenian_utf-8.csv
4. Text Import dialog opens.
4a. Character set: "Unicode (UTF-8)".
4b. Language selection: Slovenian.
4c. Separated by: Semicolon
4d. Check: Quoted field as text and check Detect special numbers.
4e. Click on Open button.
Result: WORKS FINE. Non-English characters in Text field are correctly recognized as UTF-8 (setting 4a), Decimal settings is correctly recognized as comma separator (setting 4b) and Date (setting 4b) field is correctly recognized as date.


TEST 2:
The same as Test 1 except:
3. Select slovenian_windows-1250.csv
4a.  Character set: "Eastern Europe Windows-1250".
Result: WORKS FINE. Non-English characters in Text field are correctly recognized as Windows-1250 code page.


TEST 3:
The same as Test 1 expect:
3. Select slovenian_utf-8.html
4. The source of the problem is probably "Import Options" dialog. There is no option to select character set. The only option is language selection. I selected Custom: Slovenian.
Result: PROBLEM. Non-English characters in A2 field are corrupted.

If the same file is opened (File | Open File) by Firefox 35 web browser and non-English characters are correctly opened – there is HTML tag in file:  meta charset=UTF-8 so browser correctly recognizes the character set. But now in Firefox select menu View | Character Encoding | Western and you will get the same corrupted non-English text as in LibreOffice. So LibreOffice just assumes that input HTML file is ALWAYS in Windows-1252 (Western) code page.


TEST 4:
The same as Test 1 expect:
3. Select: slovenian_windows-1250.html
4. The same problem as in Test 3, the "Import options" dialog does not contain a character set option to select from. So I just selected Custom: Slovenian.
Result: PROBLEM. Non-English characters in A2 field are corrupted.

If the same file is opened by Firefox 25 web browser and non-English characters are correctly recognized as Windows-1250 code page (because of HTML meta charset tag). Now change encoding: View | Character Encoding | Western and you will get the same corrupted data for non-English characers like in LibreOffice. So LibreOffice just assumes that input HTML file is ALWAYS in Windows-1252 (Western) code page.


In my humble opinion the problem of importing HTML file is that LibreOffice assumes that source HTML file is ALWAYS encoded in Western (Windows-1252) code page.


How should be a problem fixed:

Quick fix: By statistics see e.g. https://en.wikipedia.org/wiki/UTF-8 most of the web pages nowdays are UTF-8 encoded. UTF-8 is also universal code page for ANY language in the world. Set default Calc import filter code page to UTF-8.


Permanent fix. Create the same import dialog for HTML files just like it is at importing CSV files (or at least add "Character set" option). But in this case please check the meta charset tag in HTML file and set default code-page selection to HTML meta tag if exists. If meta tag does not exists then default to UTF-8.
Comment 1 grofaty 2015-01-27 13:04:27 UTC
Created attachment 112822 [details]
slovenian_utf-8.html
Comment 2 grofaty 2015-01-27 13:05:02 UTC
Created attachment 112823 [details]
slovenian_windows-1250.csv
Comment 3 grofaty 2015-01-27 13:05:22 UTC
Created attachment 112824 [details]
slovenian_windows-1250.html
Comment 4 grofaty 2015-01-27 13:05:44 UTC
Created attachment 112825 [details]
Tools_Options_Language_settings.png
Comment 5 m.a.riosv 2015-01-28 00:16:51 UTC
Created attachment 112844 [details]
Screenshot

Hi, thanks for reporting.

In the open window there is an option to select HTML document (cal).

It works fine for me.
Comment 6 grofaty 2015-01-28 06:20:02 UTC
@m.a.riosv, I haven't noticed that option.
I did:
1. Open Calc
2. File | Open
3. In Open dialog at right-bottom changed "All files *.*" to "HTML Document (Calc) (*.html, *.htm)" and clicked Open button.
4. But now the same "Import Option" window opens that only has a language option, so I chose Slovenian
Comment 7 grofaty 2015-01-28 07:28:51 UTC
I was probably not clear enough. I am now attaching one big print-screen with several information with proofs with external web sources.

If you can't reproduce this problem, then please specify which is the LibreOffice version and operating system name/version you are testing on.
Comment 8 grofaty 2015-01-28 07:29:39 UTC
Created attachment 112851 [details]
Problem_explained_in_detail.png
Comment 9 grofaty 2015-01-28 07:36:34 UTC
One additional tip. On this bug tracker please do NOT! click on attachment and copy/paste the text in text editor, because you can mess-up an encoding (you really need to know what you are doing to not mess it up). Instead in Firefox right click on the Attachment slovenian_utf-8.html and/or slovenian_windows-1250.html link and select "Save Link As..." and save a file to your disk - then try to open it with LibreOffice.
Comment 10 raal 2015-01-28 18:42:34 UTC
Created attachment 112899 [details]
correct import

Hi grofaty,
please try to import this file. 
Open my and your file in text editor and compare.

Dfference is in the meta tag 
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />

Your file is probably in html5 and LO recognize only html4 - see http://www.w3schools.com/tags/att_meta_charset.asp

I think this is not a bug, but maybe enhancement.
Comment 11 grofaty 2015-01-29 10:01:06 UTC
@raal,
I tried your solution and it is working. You also made two little mistakes:
a) You ended the string with / which is a deprecated xhtml syntax. Removing last / character (and space before it).
b) You put meta tag at the top of document, but it should be inside the "head" tags.
Despite above two little mistakes your file is working fine and I above fixed mistakes and new file also works fine.

I would probably never guess this HTML 5.0 vs. HTML 4.x problem, because in browser you can have any of this two settings and also OLD browsers are accepting both solutions! I tested both files in ancient Internet Explorer 6 on Windows 2003 and it accepts HTML 5 syntax without a problem.

In browser world this HTML 5 change was designed to be backward compatible, but it looks like in LibreOffice it was not designed to have this change. Now it is a matter of perspective is it a bug or enhancement. My humble opinion is it is a bug.

It is probably easy to fix... Code for HTML 4.0 is working fine, just add additional few code lines to have HTML 5.0 support.

Just a note: I have no influence how web pages are designed (with old or new syntax) and I can't inform web server administrator to change this setting back to HTML 4.x, because it is not a bug in browsers, but instead in LibreOffice.

Thanks for testing this problem.

P.S. I also tested Windows-1250 code page and converted HTML 5 syntax into old HTML 4.x syntax and LibreOffice now open files correctly.
Comment 12 grofaty 2015-01-29 10:05:31 UTC
I changed title of bug from:
Import HTML file into Calc wrongly assumes source code-page is Windows-1252

to:
Import HTML file into Calc wrongly assumes source code-page is Windows-1252 for HTML 5 files. It correctly recognizes code page for old HTML 4.x files.

to make it more clear what the problem is.
Comment 13 raal 2015-01-29 10:52:17 UTC
grofaty,
thanks for testing. Setting as new.
Comment 14 grofaty 2015-03-18 09:12:47 UTC
Retested in LibreOffice Calc 4.4.1.2 on Windows 7 and problem still appears.
Comment 15 grofaty 2015-03-18 10:20:00 UTC
Playing around in LibreOffice 4.4.1.2:
beside html-4-xml variant:
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
and original html-4 variant:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
it also works:
<meta http-equiv="content-type" content="application/vnd.ms-excel; charset=UTF-8">
and if stripped down it also works:
<meta http-equiv="content-type" content="x/x; charset=UTF-8">
(or any other combination of one to many characters instead of x.


But as stated before it does not work (this bug report) for:
<meta charset="UTF-8">
Comment 16 grofaty 2015-07-06 10:59:33 UTC
Retested on LibreOffice v.5.0.0.2 on Windows 7 and problem still persists.
Comment 17 grofaty 2015-07-06 11:08:32 UTC
Retested on master on Windows 7 and problem still persists.

From Help | About dialog:

Version: 5.1.0.0.alpha1+ (x64)
Build ID: 89c77994d4638c86635c70535fab6508e2f3d900
TinderBox: Win-x86_64@62-TDF, Branch:MASTER, Time: 2015-07-06_06:19:35
Locale: sl-SI (sl_SI)
Comment 18 bureautiquelibre 2015-11-17 10:11:13 UTC
Hi,

Do you have any news about this issue?
Is the fix still planned for version 5.1?

I have the same problem when importing UTF-8 HTML in Calc.
Calc asks to select language but it has no effect (I use french). All accents are dropped.
Writer seems to handle it fine.

Best Regards,
Eric Ficheux
Comment 19 grofaty 2015-11-17 14:04:49 UTC
@bureautiquelibre@nantesmetropole.fr,
I haven't tested this bug since 2015-07-06 on "Version: 5.1.0.0.alpha1+ (x64)" where bug was still present. But since no developer touched it, bug most probably still exists.

But I don't think this bug is planned to be fixed in 5.1.x. I see no developer assigned on this bug yet, so most probably it will not be fixed in 5.1.0.

You know this is open-source software. There is a team of volunteers that decide bug importance. And if they decide bug is not critical it is probably not going to be fixed anytime soon. You have two options to wait someone fixes the problem which can takes even years or you pay someone to fix the problem like some professional company: https://www.libreoffice.org/get-help/professional-support/
In my case it is annoying bug to me, but it is not critical bug to have some paid professional support required. I still hope this bug is going to be fixed some day.
Comment 20 grofaty 2015-11-19 11:20:43 UTC
Created attachment 120659 [details]
5_1_master_additional_tests.png
Comment 21 grofaty 2015-11-19 11:21:01 UTC
I did additional tests with: 5.0.3.2 version and todays master (see at bottom of this post version details).

Calc/Writer 5.0.3.0.2 and Calc/Writer today's master are behaving 100%, so there is nothing new in master regarding this problem.

Test 1 - Open LibreOffice Calc, then Open icon from toolbar, select file and in next dialog select Automatic.
1. Open file correct_import.html  OK.
2. Open file slovenian_utf-8.html PROBLEM.
3. Open file slovenian_windows-1250.html PROBLEM.

Test 2 - Close all Calc files, Open LibreOffice Writer, then Open icon from toolbar and select file.
1. Open file correct_import.html  OK.
2. Open file slovenian_utf-8.html OK!!! Suprice.
3. Open file slovenian_windows-1250.html PROBLEM.

I am attaching new file 5_1_master_additional_tests.png to see comparision how above tests.

What is interesting is Test2/2 in Writer file is opened up correctly, but the same file is incorrectly opened in Calc.

P.S. Additionally it is interesting beside Open dialog in Calc is also opened second dialog Import Options, which I can't figure out if it has any affect or no using .html files. It looks like zero effect.

==============
MASTER VERSION
==============
Version: 5.1.0.0.alpha1+
Build ID: 66d2b72667792cb18b25805387824d636e2a455c
TinderBox: Win-x86@39, Branch:master, Time: 2015-11-18_02:35:53
Locale: sl-SI (sl_SI)
Comment 22 bureautiquelibre 2015-11-19 11:45:46 UTC
OK, thanks for the information.

The issue I had was with a data export from third party software.

Currently, we're asking our supplier to swith from HTML export (with .xls extension so it opens in Calc which is very bad practice) towards csv export (better practice).

I should have access to professional support for LibreOffice in a few months but this bug isn't critical for my organization ATM, so it's unlikely that we ask for a fix.

Best Regards,
Eric Ficheux
Comment 23 grofaty 2015-12-21 09:23:42 UTC
I retested all 6 tests from "Comment 21" on master on Windows 7 and problem is the same as in previously attached image: "5_1_master_additional_tests.png". So problem still persists.



From Help | About:

Version: 5.2.0.0.alpha0+
Build ID: 014633f83e44ae8ba33087b6f38e8e253e281969
CPU Threads: 3; OS Version: Windows 6.1; UI Render: default; 
TinderBox: Win-x86@62-merge-TDF, Branch:MASTER, Time: 2015-12-15_06:21:44
Locale: sl-SI (sl_SI)
Comment 24 grofaty 2016-06-29 07:46:58 UTC
Retested on LibreOffice master (see full version bellow) and problem is exactly the same as in "Comment 21" and result exactly the same as in attachmed 5_1_master_additional_tests.png


==============
MASTER VERSION
==============
Version: 5.3.0.0.alpha0+
Build ID: 0325b22a2a2b537a71f53b7c5d3e6c13fef68911
Comment 25 bureautiquelibre 2016-09-19 15:02:13 UTC
Created attachment 127433 [details]
Example of FR file not imported OK
Comment 26 Jan Holesovsky 2016-11-03 22:13:24 UTC
When investigating this bug, it turned out that the problem from comment 25 has a different root cause than the problem from comment 1.

The problem from comment 1 is just a missing support for <meta charset="...">.

The problem from comment 25 is deeper - turns out that the UTF-8 BOM (Byte Order Mark) is confusing the setting of the text converter that is used for the parsing of the file.

Either way, I've fixed both these issues: I fixed the BOM handling (for the bug from comment 25), and implemented the support for <meta charset="..."> (comment 1 and comment 2).

I'll push the fixes to master shortly.
Comment 27 Commit Notification 2016-11-03 22:18:17 UTC
Jan Holesovsky committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=b297f7bbfed83f87398231740e910afe6ebfbb97

tdf#88821: Set the encoding correctly for HTML files with a BOM.

It will be available in 5.3.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 28 Commit Notification 2016-11-03 22:18:25 UTC
Jan Holesovsky committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=84400eae86d7ae8e66f8247f4c4f3a717d90f8c0

tdf#88821: Implement support for <meta charset="..."> for HTML import.

It will be available in 5.3.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 29 Jan Holesovsky 2016-11-03 22:21:38 UTC
Setting to FIXED.
Comment 30 grofaty 2016-11-23 11:59:44 UTC
Hi,
on master I have performed all of the six test from Comment 21 and I can confirm problem is now fixed and problem solved.

Thanks a lot for fixing this problem.
Regards

==============
MASTER VERSION
==============
Version: 5.3.0.0.alpha1+
Build ID: f965a629fba10ecba7bad938a0c1c9c3db1e510d
CPU Threads: 3; OS Version: Windows 6.1; UI Render: default; Layout Engine: new; 
TinderBox: Win-x86@62-merge-TDF, Branch:MASTER, Time: 2016-11-23_00:13:10
Locale: sl-SI (sl_SI); Calc: group