LO is saving links in HTML files in UTF-8. There is problem with Internet Explorer because it doesn't support UTF-8 link format (more http://support.microsoft.com/kb/941052). So once you have non-US-ASCII character in the link LO will transform it into UTF-8 string. I have asked here http://ask.libreoffice.org/en/question/31061/can-url-encoding-be-disabled/ without good solution.
Steps to reproduce:
1. Open HTML with links including non-US-ASCII character
2. Edit whatever
4. Not working with IE
Current behavior: Not working links written in UTF-8 in IE.
Expected behavior: Working links using non-US-ASCII character directly
Operating System: Windows 7
Version: 184.108.40.206 release
Looks like conversion of file URLs between LO's internal (path payload always in UTF-8) and external (path payload according to platform expectations) representations does not happen for HTML (im-?)/export. (The example at <http://ask.libreoffice.org/en/question/31061/can-url-encoding-be-disabled/> apparently involves file URLs, albeit relative ones, which is an extra challenge for the conversion between internal and external, which bases its work on the scheme of a---necessarily absolute---URI.)
Which version of libreoffice did you try this on as i tried 4.0, 4.1 and 4.2 and didnt get the output you mentioned on ask libreoffice.
Created attachment 100812 [details]
Test target file
Created attachment 100813 [details]
File before editing in LO
Created attachment 100814 [details]
File after edit in LO
I have updated to the latest stable release 220.127.116.11 and it does the same.
Created attachment 100815 [details]
Visual code comparison
Did some testing and this issue is only Windows related as on Linux it automatically changes the charset to utf8. So within windows, when you go and edit the link, first click on the 'web' tab, rather than editing it in the 'document' tab and the link should be fine.
Hello. It happens not only when editing the link. You just open the document and change f.e. font size and all the links in the document transform into UTF-8 strings.
Created attachment 100838 [details]
modified version of the before edit in LibreOffice html file
(In reply to comment #9)
> Hello. It happens not only when editing the link. You just open the document
> and change f.e. font size and all the links in the document transform into
> UTF-8 strings.
Hi. I tried to reproduce TestHTML_after_edit_LO.html from TestHTML_before_edit_LO.html as that was the only changes made between the two files were that you modified the url from an absolute location on your hard disk to a relative location within the same directory, and then i confirm that it worked in IE.
As you stated if i did more things to the file, it would all turn to UTF-8 string, so i did make changes by adding text to it as well as change the font and added another link, and everything went well for me and it was still openable in IE. I did notice that libreoffice in my case changed the charset to windows-1252 and a number of the characters change to their '&#' numeric equivalents.
I think that is the catch. In my case it keeps windows-1250 a converts the link to UTF-8 strings. I think it has something to do with the default charset. Here is screen record https://www.youtube.com/watch?v=4x0FRHkCNzQ
Created attachment 100862 [details]
File before editing in LO (utf-8)
Hello again. I have tried to reproduce the problem on more machines and I have found the bigger problem is:
HTML file before editing in LO encoded in utf-8. On most machines LO changes encoding to windows-1250 and converts links to utf-8 strings.
On some PCs (like mine) it does even with HTML files encoded from the beginning in windows-1250. Most PCs in fact keeps the links OK as you stated.
I have tested this issue on following setups:
Win 7 LO (18.104.22.168); Win XP LO (3.3.2); Win 7 LO (3.4.4); Win 7 LO (22.214.171.124)
So I'm sorry my intial file "File before editing in LO" was probably a bad example. Please try the newly uploaded "File before editing in LO (utf-8)" as this one should reproduce the problem better.
That file is not in UTF-8.
Dear Bug Submitter,
This bug has been in NEEDINFO status with no change for at least 6 months. Please provide the requested information as soon as possible and mark the bug as UNCONFIRMED. Due to regular bug tracker maintenance, if the bug is still in NEEDINFO status with no change in 30 days the QA team will close the bug as INVALID due to lack of needed information.
For more information about our NEEDINFO policy please read the wiki located here:
If you have already provided the requested information, please mark the bug as UNCONFIRMED so that the QA team knows that the bug is ready to be confirmed.
Thank you for helping us make LibreOffice even better for everyone!
Message generated on: 10/01/2015
Attachment 100862 [details] is both broken and has "problematic" content:
For one, as comment 14 notes, the HTML file is labelled as "charset=utf-8", but contains the raw bytes E8 9A F8 9E EC that do not constitute UTF-8. How has this broken file been generated?
For another, the file URL contained in the <a> link is problematic:
First, that file URL, as written in the HTML file, contains raw non-ASCII bytes (see above). How they should be interpreted when "extracting" the URL from the HTML file depends on the HTML file's encoding (UTF-8), but as noted above the file is broken and those bytes cannot be interpreted meaningfully. Different software in different scenarios (OS's locale settings, etc.) will likely respond in different ways when confronted with such broken input.
Second, even if the URL could meaningfully be "extracted" from the HTML file, it would contain non-ASCII bytes. URLs are written in a subset of ASCII. If a URLs "payload" (which is, roughly, a sequence of arbitrary byte values) shall contain values that are outside ASCII, they need to be escaped as %XX sequences. Again, different software in different scenarios (OS's locale settings, etc.) will likely respond in different ways when confronted with such broken input.
Third, even if the file URL's "payload" (i.e., a representation of a Windows pathname) could meaningfully be "extracted," as it contains non-ASCII bytes, it would be unclear how to interpret it as an actual Windows pathname. Windows pathnames are basically sequences of (16-bit) UTF-16 code units. An alternative way to access pathnames is via the OS's selected 8-bit character set (like windows-1250 etc.), where Windows internally translates between that 8-bit character set and UTF-16, and some valid UTF-16 pathnames can not be represented in certain 8-bit character sets, and the same 8-bit input sequence can denote different UTF-16 pathnames depending on the actually selected OS 8-bit character set. It is unspecified how (encodings of) non-ASCII bytes in a file URL's "payload" are to be interpreted on Windows, but general consensus appears to be to interpret them according to the OS's selected 8-bit character set (all the shortcomings of that approach notwithstanding). That, again, means that software in different scenarios (i.e., OS's locale settings) will likely respond in different ways when confronted with such "problematic" input.
Dear Bug Submitter,
Please read this message in its entirety before proceeding.
Your bug report is being closed as INVALID due to inactivity and a lack of information which is needed in order to accurately reproduce and confirm the problem. We encourage you to retest your bug against the latest release. If the issue is still present in the latest stable release, we need the following information (please ignore any that you've already provided):
a) Provide details of your system including your operating system and the latest version of LibreOffice that you have confirmed the bug to be present
b) Provide easy to reproduce steps – the simpler the better
c) Provide any test case(s) which will help us confirm the problem
d) Provide screenshots of the problem if you think it might help
e) Read all comments and provide any requested information
Once all of this is done, please set the bug back to UNCONFIRMED and we will attempt to reproduce the issue.
Please do not:
a) respond via email
b) update the version field in the bug or any of the other details on the top section of FDO
Message generated on: 2015-02-11
(might be related to, or even a duplicate of, bug 76291, but hard to tell with the information given here)