Bug 95217 - Persian text not imported with "Link to External Data..." in Calc
Summary: Persian text not imported with "Link to External Data..." in Calc
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
5.0.2.2 release
Hardware: Other All
: medium minor
Assignee: Giuseppe Castagno (aka beppec56)
URL:
Whiteboard: target:5.2.0 target:5.1.1
Keywords:
Depends on:
Blocks: Calc-External-Datalink
  Show dependency treegraph
 
Reported: 2015-10-21 11:18 UTC by Chris Sherlock
Modified: 2017-08-28 19:15 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
The HTML file that is used as the "external data" (982 bytes, text/html)
2015-10-21 11:18 UTC, Chris Sherlock
Details
Resulting calc ODS document with corrupted Persian characters. (11.88 KB, application/vnd.oasis.opendocument.spreadsheet)
2015-10-21 11:19 UTC, Chris Sherlock
Details
Corrupted on Linux (170.10 KB, image/png)
2015-12-26 01:43 UTC, Chris Sherlock
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Sherlock 2015-10-21 11:18:11 UTC
Created attachment 119818 [details]
The HTML file that is used as the "external data"

1. Open a new Calc spreadsheet
2. Go into Insert -> Link to External Data..."
3. Browse to the text.html file (see attached)
4. Pick Automatic in the Import Options dialog box, then click on OK

Persian text is not imported. 

The original URL for the data is http://www.tsetmc.com/Loader.aspx?ParTree=15 - issue originally reported by Farid Tofighi on Google+ LibreOffice Community channel.
Comment 1 Chris Sherlock 2015-10-21 11:19:10 UTC
Created attachment 119819 [details]
Resulting calc ODS document with corrupted Persian characters.
Comment 2 Buovjaga 2015-10-21 17:38:32 UTC
Works ok here. Plz attach screenshot of corruption.

Set to NEEDINFO.
Change back to UNCONFIRMED after you have provided the screenshot.

Win 7 Pro 64-bit, Version: 5.0.2.2 (x64)
Build ID: 37b43f919e4de5eeaca9b9755ed688758a8251fe
Locale: fi-FI (fi_FI)
Comment 3 Chris Sherlock 2015-12-26 00:41:32 UTC
Trying again - I can't seem to get LibreOffice to pull the data when I link to http://bug-attachments.documentfoundation.org/attachment.cgi?id=119818
Comment 4 Chris Sherlock 2015-12-26 01:20:51 UTC
Even more strange - the latest build of LibreOffice (from master!) is now missing this menu item.
Comment 5 Chris Sherlock 2015-12-26 01:37:11 UTC
Still occurring. My steps were a bit unclear. 

1. Open a new Calc spreadsheet
2. Go into Insert -> Link to External Data..."
3. Point to https://bugs.documentfoundation.org/attachment.cgi?id=119818
4. Pick Automatic in the Import Options dialog box, then choose the all table, then click on OK

LibreOffice has problems importing. If you actually take the html file directly and import as a file, it doesn't have a problem. 

There is a lot of SAL_WARNs though:

(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
(pkix_CacheCert_Add: PKIX_PL_HashTable_Add for Certs skipped: entry existed
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
:1: parser error : StartTag: invalid element name
<!doctype html>
 ^
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
:1: parser error : StartTag: invalid element name
<!doctype html>
 ^
:1: parser error : StartTag: invalid element name
<!doctype html>
 ^
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:oox.storage:3764:1:oox/source/helper/zipstorage.cxx:66: ZipStorage::ZipStorage exception opening input storage: 
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:vcl:3764:1:vcl/source/window/winproc.cxx:862: ImplHandleKey: Keyboard-Input is sent to a frame without focus
warn:sfx.doc:3764:1:sfx2/source/doc/docfile.cxx:693: Physical name not convertible!
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:legacy.tools:3764:1:editeng/source/editeng/eehtml.cxx:54: EditHTMLParser::EditHTMLParser: Where does the encoding come from?
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:vcl:3764:1:vcl/source/window/winproc.cxx:862: ImplHandleKey: Keyboard-Input is sent to a frame without focus
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:sfx.doc:3764:1:sfx2/source/doc/docfile.cxx:693: Physical name not convertible!
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:legacy.tools:3764:1:editeng/source/editeng/eehtml.cxx:54: EditHTMLParser::EditHTMLParser: Where does the encoding come from?
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:ucb.ucp.webdav:3764:1:ucb/source/ucp/webdav-neon/NeonSession.cxx:1703: Neon received http error: '200 OK'
warn:legacy.tools:3764:1:svl/source/items/poolitem.cxx:114: destroying item in use


In other words, the editeng HTML control doesn't know:

1. Valid syntax for HTML 5 (it is complaining about <!doctype html> which is valid)
2. Seems to be having issues with detecting the encoding.
Comment 6 Chris Sherlock 2015-12-26 01:43:17 UTC
Created attachment 121553 [details]
Corrupted on Linux
Comment 7 Chris Sherlock 2015-12-26 01:44:59 UTC
It looks like the way it detects the encoding is to look to see whether the file starts with a BOM. 

Unfortuantely, that's not how web pages are sent. Instead, we should be looking at the headers that are returned from the web server:

HTTP/1.1 200 OK
Server: nginx/1.2.1
Date: Sat, 26 Dec 2015 01:41:30 GMT
Content-Type: text/html; name="text.html"; charset=UTF-8
Content-Length: 982
Connection: keep-alive
X-xss-protection: 1; mode=block
Content-disposition: inline; filename="text.html"
X-content-type-options: nosniff
Comment 8 Chris Sherlock 2015-12-26 04:06:37 UTC
So this goes through WebDAV, and at this point I've got no idea how it works. But stepping through the code, it's very suspicious that WebDAV sees 200 responses as errors.
Comment 9 Buovjaga 2015-12-26 14:02:23 UTC
(In reply to Chris Sherlock from comment #5)
> Still occurring. My steps were a bit unclear. 
> 
> 1. Open a new Calc spreadsheet
> 2. Go into Insert -> Link to External Data..."
> 3. Point to https://bugs.documentfoundation.org/attachment.cgi?id=119818
> 4. Pick Automatic in the Import Options dialog box, then choose the all
> table, then click on OK

Ok now I could repro and got garbled characters.
For step 3, we have to press enter after pasting the url (a bit unclear UX there).

Win 7 Pro 64-bit Version: 5.2.0.0.alpha0+
Build ID: a4764cfa80270f973da5861d0ddc28298bf16f4d
CPU Threads: 4; OS Version: Windows 6.1; UI Render: default; 
TinderBox: Win-x86@62-merge-TDF, Branch:MASTER, Time: 2015-12-24_22:45:12
Locale: fi-FI (fi_FI)
Comment 10 Giuseppe Castagno (aka beppec56) 2016-01-28 15:45:27 UTC
(In reply to Chris Sherlock from comment #8)
> So this goes through WebDAV, and at this point I've got no idea how it
> works. But stepping through the code, it's very suspicious that WebDAV sees
> 200 responses as errors.

Means neon library returns an error but with an http code of '200 OK' apparently on this Web server that means 'PROPFIND method is not available'.
Probably I should have chosen a different wording for the message.

Enabling +INFO.ucb.ucp.webdav you'll see almost the whole protocol exchange.

The content-type property should be mapped to ucb property MediaType.
In a WebDAV server (or a Web server with r/o WebDAV enabled) MediaType is mapped to 'getcontentype' DAV property, giving you the correct value.

I need to see what happens in a web link processing.
Comment 11 Giuseppe Castagno (aka beppec56) 2016-01-29 09:24:06 UTC
@Chris:

I found some time to analyze this bug.
It seems that when the webdav provider client application ask for MediaType property and the target URL is on a simple Web site, in this member function:

<http://opengrok.libreoffice.org/xref/core/ucb/source/ucp/webdav-neon/webdavcontent.cxx#1228>

the fallback to obtain the value from content-type header value in HEAD method response doesn't work as it should.

As Mark Hung pointed out on dev-list, they may be as well header name character case problem, possibly in the call back functions that analyze the response.

I'll have a look into it.
Comment 12 Giuseppe Castagno (aka beppec56) 2016-01-29 16:39:50 UTC
@Chris:

I pushed to gerrit:
<https://gerrit.libreoffice.org/#/c/21907/1>

The fixes the bug, both are needed. Two patch to help bisecting.

If you find the time, please test them.
Comment 13 Commit Notification 2016-01-30 07:44:56 UTC
Giuseppe Castagno committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=e973b342826e54f147251b132c3325d30749e312

Related tdf#95217: Http header names are case insensitive

It will be available in 5.2.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 14 Commit Notification 2016-01-30 07:47:11 UTC
Giuseppe Castagno committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=d61352f58a7f750d3b0b0a9c2d6498fbb7a6e10d

Related tdf#95217: Force HEAD method in Web access if PROPFIND failed

It will be available in 5.2.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 15 Chris Sherlock 2016-01-31 10:13:33 UTC
Giuseppe - that's fantastic work! Sorry I took do long to respond - nice bit of troubleshooting, and nice to see such an elegant fix :-) I tip my hat to you.

I'll build LO again and test this, then sign off on it.
Comment 16 Chris Sherlock 2016-02-01 03:33:41 UTC
Excellent - I can confirm this is now working as intended - the Persian text is now rendering correctly. 

Many thanks Giuseppe!
Comment 17 Commit Notification 2016-02-02 12:21:03 UTC
Giuseppe Castagno committed a patch related to this issue.
It has been pushed to "libreoffice-5-1":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=3d03b2f51912e7ca49251befca3fa61021dc6154&h=libreoffice-5-1

Related tdf#95217: Http header names are case insensitive

It will be available in 5.1.1.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 18 Commit Notification 2016-02-02 12:21:07 UTC
Giuseppe Castagno committed a patch related to this issue.
It has been pushed to "libreoffice-5-1":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=abec158e8b0a5c07380cd2bc7f7c5edbef878bed&h=libreoffice-5-1

Related tdf#95217: Force HEAD method in Web access if PROPFIND failed

It will be available in 5.1.1.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.