Bug 76080 - FILESAVE: URLs encoded into UTF-8 after saving HTML
Summary: FILESAVE: URLs encoded into UTF-8 after saving HTML
Status: RESOLVED INVALID
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.2.1.1 release
Hardware: All Windows (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: BSA
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-12 14:39 UTC by Tomáš Tunkl
Modified: 2015-02-12 07:59 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Test target file (6.96 KB, application/pdf)
2014-06-10 12:43 UTC, Tomáš Tunkl
Details
File before editing in LO (291 bytes, text/html)
2014-06-10 12:43 UTC, Tomáš Tunkl
Details
File after edit in LO (515 bytes, text/html)
2014-06-10 12:44 UTC, Tomáš Tunkl
Details
Visual code comparison (21.20 KB, image/png)
2014-06-10 12:55 UTC, Tomáš Tunkl
Details
modified version of the before edit in LibreOffice html file (800 bytes, text/html)
2014-06-10 18:44 UTC, Yousuf Philips (jay) (retired)
Details
File before editing in LO (utf-8) (284 bytes, text/html)
2014-06-11 06:59 UTC, Tomáš Tunkl
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tomáš Tunkl 2014-03-12 14:39:40 UTC
Problem description: 
LO is saving links in HTML files in UTF-8. There is problem with Internet Explorer because it doesn't support UTF-8 link format (more http://support.microsoft.com/kb/941052). So once you have non-US-ASCII character in the link LO will transform it into UTF-8 string. I have asked here http://ask.libreoffice.org/en/question/31061/can-url-encoding-be-disabled/ without good solution.

Steps to reproduce:
1. Open HTML with links including non-US-ASCII character
2. Edit whatever
3. Save
4. Not working with IE

Current behavior: Not working links written in UTF-8 in IE.

Expected behavior: Working links using non-US-ASCII character directly

              
Operating System: Windows 7
Version: 4.1.0.4 release
Comment 1 Stephan Bergmann 2014-03-12 15:46:40 UTC
Looks like conversion of file URLs between LO's internal (path payload always in UTF-8) and external (path payload according to platform expectations) representations does not happen for HTML (im-?)/export.  (The example at <http://ask.libreoffice.org/en/question/31061/can-url-encoding-be-disabled/> apparently involves file URLs, albeit relative ones, which is an extra challenge for the conversion between internal and external, which bases its work on the scheme of a---necessarily absolute---URI.)
Comment 2 Yousuf Philips (jay) (retired) 2014-06-10 03:47:52 UTC
Hi Tomas,

Which version of libreoffice did you try this on as i tried 4.0, 4.1 and 4.2 and didnt get the output you mentioned on ask libreoffice.
Comment 3 Tomáš Tunkl 2014-06-10 12:43:30 UTC
Created attachment 100812 [details]
Test target file
Comment 4 Tomáš Tunkl 2014-06-10 12:43:50 UTC
Created attachment 100813 [details]
File before editing in LO
Comment 5 Tomáš Tunkl 2014-06-10 12:44:09 UTC
Created attachment 100814 [details]
File after edit in LO
Comment 6 Tomáš Tunkl 2014-06-10 12:45:25 UTC
I have updated to the latest stable release 4.2.4.2 and it does the same.
Comment 7 Tomáš Tunkl 2014-06-10 12:55:47 UTC
Created attachment 100815 [details]
Visual code comparison
Comment 8 Yousuf Philips (jay) (retired) 2014-06-10 16:08:45 UTC
Hi Tomas,

Did some testing and this issue is only Windows related as on Linux it automatically changes the charset to utf8. So within windows, when you go and edit the link, first click on the 'web' tab, rather than editing it in the 'document' tab and the link should be fine.
Comment 9 Tomáš Tunkl 2014-06-10 17:15:11 UTC
Hello. It happens not only when editing the link. You just open the document and change f.e. font size and all the links in the document transform into UTF-8 strings.
Comment 10 Yousuf Philips (jay) (retired) 2014-06-10 18:44:48 UTC
Created attachment 100838 [details]
modified version of the before edit in LibreOffice html file

(In reply to comment #9)
> Hello. It happens not only when editing the link. You just open the document
> and change f.e. font size and all the links in the document transform into
> UTF-8 strings.

Hi. I tried to reproduce TestHTML_after_edit_LO.html from TestHTML_before_edit_LO.html as that was the only changes made between the two files were that you modified the url from an absolute location on your hard disk to a relative location within the same directory, and then i confirm that it worked in IE.

As you stated if i did more things to the file, it would all turn to UTF-8 string, so i did make changes by adding text to it as well as change the font and added another link, and everything went well for me and it was still openable in IE. I did notice that libreoffice in my case changed the charset to windows-1252 and a number of the characters change to their '&#' numeric equivalents.
Comment 11 Tomáš Tunkl 2014-06-11 06:11:01 UTC
I think that is the catch. In my case it keeps windows-1250 a converts the link to UTF-8 strings. I think it has something to do with the default charset. Here is screen record https://www.youtube.com/watch?v=4x0FRHkCNzQ
Comment 12 Tomáš Tunkl 2014-06-11 06:59:10 UTC
Created attachment 100862 [details]
File before editing in LO (utf-8)
Comment 13 Tomáš Tunkl 2014-06-11 07:12:52 UTC
Hello again. I have tried to reproduce the problem on more machines and I have found the bigger problem is:

HTML file before editing in LO encoded in utf-8. On most machines LO changes encoding to windows-1250 and converts links to utf-8 strings.

On some PCs (like mine) it does even with HTML files encoded from the beginning in windows-1250. Most PCs in fact keeps the links OK as you stated.

I have tested this issue on following setups:
Win 7 LO (3.5.4.2); Win XP LO (3.3.2); Win 7 LO (3.4.4); Win 7 LO (4.2.4.2)

So I'm sorry my intial file "File before editing in LO" was probably a bad example. Please try the newly uploaded "File before editing in LO (utf-8)" as this one should reproduce the problem better.
Comment 14 Urmas 2014-06-14 02:57:08 UTC
That file is not in UTF-8.
Comment 15 QA Administrators 2015-01-10 18:06:00 UTC
Dear Bug Submitter,

This bug has been in NEEDINFO status with no change for at least 6 months. Please provide the requested information as soon as possible and mark the bug as UNCONFIRMED. Due to regular bug tracker maintenance, if the bug is still in NEEDINFO status with no change in 30 days the QA team will close the bug as INVALID due to lack of needed information.

For more information about our NEEDINFO policy please read the wiki located here: 
https://wiki.documentfoundation.org/QA/FDO/NEEDINFO

If you have already provided the requested information, please mark the bug as UNCONFIRMED so that the QA team knows that the bug is ready to be confirmed.


Thank you for helping us make LibreOffice even better for everyone!


Warm Regards,
QA Team

Message generated on: 10/01/2015
Comment 16 Stephan Bergmann 2015-01-12 09:12:28 UTC
Attachment 100862 [details] is both broken and has "problematic" content:

For one, as comment 14 notes, the HTML file is labelled as "charset=utf-8", but contains the raw bytes E8 9A F8 9E EC that do not constitute UTF-8.  How has this broken file been generated?

For another, the file URL contained in the <a> link is problematic:

First, that file URL, as written in the HTML file, contains raw non-ASCII bytes (see above).  How they should be interpreted when "extracting" the URL from the HTML file depends on the HTML file's encoding (UTF-8), but as noted above the file is broken and those bytes cannot be interpreted meaningfully.  Different software in different scenarios (OS's locale settings, etc.) will likely respond in different ways when confronted with such broken input.

Second, even if the URL could meaningfully be "extracted" from the HTML file, it would contain non-ASCII bytes.  URLs are written in a subset of ASCII.  If a URLs "payload" (which is, roughly, a sequence of arbitrary byte values) shall contain values that are outside ASCII, they need to be escaped as %XX sequences.  Again, different software in different scenarios (OS's locale settings, etc.) will likely respond in different ways when confronted with such broken input.

Third, even if the file URL's "payload" (i.e., a representation of a Windows pathname) could meaningfully be "extracted," as it contains non-ASCII bytes, it would be unclear how to interpret it as an actual Windows pathname.  Windows pathnames are basically sequences of (16-bit) UTF-16 code units.  An alternative way to access pathnames is via the OS's selected 8-bit character set (like windows-1250 etc.), where Windows internally translates between that 8-bit character set and UTF-16, and some valid UTF-16 pathnames can not be represented in certain 8-bit character sets, and the same 8-bit input sequence can denote different UTF-16 pathnames depending on the actually selected OS 8-bit character set.  It is unspecified how (encodings of) non-ASCII bytes in a file URL's "payload" are to be interpreted on Windows, but general consensus appears to be to interpret them according to the OS's selected 8-bit character set (all the shortcomings of that approach notwithstanding).  That, again, means that software in different scenarios (i.e., OS's locale settings) will likely respond in different ways when confronted with such "problematic" input.
Comment 17 QA Administrators 2015-02-11 19:51:43 UTC
Dear Bug Submitter,

Please read this message in its entirety before proceeding.

Your bug report is being closed as INVALID due to inactivity and a lack of information which is needed in order to accurately reproduce and confirm the problem. We encourage you to retest your bug against the latest release. If the issue is still present in the latest stable release, we need the following information (please ignore any that you've already provided):

a) Provide details of your system including your operating system and the latest version of LibreOffice that you have confirmed the bug to be present

b) Provide easy to reproduce steps – the simpler the better

c) Provide any test case(s) which will help us confirm the problem

d) Provide screenshots of the problem if you think it might help

e) Read all comments and provide any requested information

Once all of this is done, please set the bug back to UNCONFIRMED and we will attempt to reproduce the issue. 

Please do not:
a) respond via email 
b) update the version field in the bug or any of the other details on the top section of FDO
Message generated on: 2015-02-11
Comment 18 Stephan Bergmann 2015-02-12 07:59:00 UTC
(might be related to, or even a duplicate of, bug 76291, but hard to tell with the information given here)