Bug 96413 - Basic ConvertToURL fails to URL encode many characters (see comment 18)
Summary: Basic ConvertToURL fails to URL encode many characters (see comment 18)
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: BASIC (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-12-11 14:04 UTC by Andreas Säger
Modified: 2023-04-17 11:53 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Spreadsheet with user-defined Basic functions (19.80 KB, application/vnd.oasis.opendocument.spreadsheet)
2015-12-11 14:04 UTC, Andreas Säger
Details
bt with debug symbols (8.92 KB, text/plain)
2015-12-29 19:14 UTC, Julien Nabet
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Säger 2015-12-11 14:04:58 UTC
Created attachment 121223 [details]
Spreadsheet with user-defined Basic functions

OK: ConvertToURL("C:\Foo Bar") --> file:///C:/Foo&20Bar
Fail: ConvertToURL("C:\Foo&Bar") --> file:///C:/Foo&Bar
Expected: ConvertToURL("C:\Foo&Bar") --> file:///C:/Foo&26Bar

Problem occurs with characters
!
$
&
'
(
)
*
+
,
/
:
=
@

ConvertToURL correctly encodes the follwing characters:
<space> 
"
#
%
;
?
[
\
]
{
|
}


Reference:
https://tools.ietf.org/html/rfc3986#page-12
Comment 1 Andreas Säger 2015-12-11 14:25:30 UTC
Same with PyUNO:
>>> s = '/foo & bar/concept.odt'
>>> uno.systemPathToFileUrl(s)
'file:///foo%20&%20bar/concept.odt'
Comment 2 Buovjaga 2015-12-14 15:02:02 UTC
What is the correct procedure to test this bug?

I ran the Path2URL macro and it said
A Scripting Framework error occurred while running the Basic script Standard.Module1.Path2URL.

Message: wrong number of parameters!

Set to NEEDINFO.
Change back to UNCONFIRMED after you have provided the information.
Comment 3 Andreas Säger 2015-12-14 16:07:54 UTC
The functions are called from the spreadsheet cells. Look at the cell formulas.
Comment 4 Buovjaga 2015-12-14 16:33:48 UTC
Sorry about that, I guess the concept was too alien to me.

Confirmed.

Win 7 Pro 64-bit Version: 5.2.0.0.alpha0+
Build ID: 917d59a84124d1022bd1912874e7a53c674784f1
CPU Threads: 4; OS Version: Windows 6.1; UI Render: default; 
TinderBox: Win-x86@62-merge-TDF, Branch:MASTER, Time: 2015-12-12_12:17:04
Locale: fi-FI (fi_FI)
Comment 5 Cor Nouws 2015-12-18 12:29:32 UTC
Thanks for filing. IMO an enhancement ?
Comment 6 Andreas Säger 2015-12-18 13:25:14 UTC
(In reply to Cor Nouws from comment #5)
> Thanks for filing. IMO an enhancement ?

No enhancement. One of my Basic macros truncated valid file paths when I was assuming that function convertToURL is able to deal with valid paths like C:/foo & bar/document.odt

Since Basic does not even provide basic string functions helping to implement my own URL encoding, I start my code with an error message and program exit when the following function returns True:

Function IsBadURL(sURL$) As Boolean
Dim a(), s$, nSplit&, sPN$
	nSplit = 9 'file:///
	If getGUIType = 1 then nSplit = 11 'file:///C:
	sPN = mid(sURL, nSplit)
	a() = Array("!", "$", "&", "'", "(", ")", "*", "+", ",", ":", "=", "@")
	for each s in a()
		if instr(sPN, s) > 0 then 
			IsBadURL = True
			exit function
		endif
	next
End Function
Comment 7 Cor Nouws 2015-12-18 14:09:29 UTC
OK.
Comment 8 Andreas Säger 2015-12-20 15:54:17 UTC
Weird work-around for the Basic fanboys among us:

Function WorkAround_ConvertToURL(sysPath) As String
Dim oSrv, a(), sURL$, nSplit&, sCode$, sProtocol$, sPathName$, sChar$, sRoot$
	sURL = convertToURL(sysPath)
	a() = Array("!", "$", "&", "'", "(", ")", "*", "+", ",", ":", "=", "@")
	If getGUIType = 1 then
		nSplit = 11
	else
		nSplit = 8
	endif
	sProtocol = left(sURL, nSPlit)
	sPathName = mid(sURL, nSplit +1)
	oSrv = createUnoService("com.sun.star.sheet.FunctionAccess")
	For each sChar in a()
		if instr(sPathName, sChar) > 0 then
			sCode = "%"& hex(asc(sChar))
			sPathName = oSrv.callFunction("SUBSTITUTE", Array(sPathName, sChar, sCode))
		endif
	next
	WorkAround_ConvertToURL = sProtocol & sPathName
End Function
Comment 9 Julien Nabet 2015-12-29 19:14:48 UTC
Created attachment 121619 [details]
bt with debug symbols

After some tests with gdb, it seems it decides to escape or not from here:
http://opengrok.libreoffice.org/xref/core/tools/source/fsys/urlobj.cxx#541

    541 inline bool mustEncode(sal_uInt32 nUTF32, INetURLObject::Part ePart)
    542 {
    543     return !rtl::isAscii(nUTF32) || !(aMustEncodeMap[nUTF32] & ePart);
    544 }
(gdb indicates ePart=INetURLObject::PART_PCHAR)

which gives then:
http://opengrok.libreoffice.org/xref/core/tools/source/fsys/urlobj.cxx#441
a 128 elements array.

If I apply this patch:
diff --git a/tools/source/fsys/urlobj.cxx b/tools/source/fsys/urlobj.cxx
index 771c3bb..bf279f16f 100644
--- a/tools/source/fsys/urlobj.cxx
+++ b/tools/source/fsys/urlobj.cxx
@@ -442,7 +442,7 @@ static sal_uInt32 const aMustEncodeMap[128]
     = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 /*   */                                              PP,
-/* ! */ PA   +PD+PE+PF+PG+PH+PI+PJ+PK+PL+PM+PN+PO+PP+PQ+PR,
+/* ! */ PA   +PD+PE+PF+PG+PH+PI+PJ+PK   +PM+PN+PO+PP+PQ+PR,
 /* " */                                  PM+PN   +PP,
 /* # */                                  PM,
 /* $ */ PA   +PD+PE+PF+PG+PH+PI+PJ+PK+PL+PM+PN+PO+PP+PQ+PR,

"!" is escaped.
However, I don't know the impact of this change on other parts of LO.
Comment 10 QA Administrators 2017-01-03 19:57:30 UTC Comment hidden (obsolete)
Comment 11 Andreas Säger 2017-01-05 18:19:58 UTC
This is still in 5.2.3.
Comment 12 QA Administrators 2018-06-10 02:45:30 UTC Comment hidden (obsolete)
Comment 13 QA Administrators 2020-06-10 03:48:01 UTC Comment hidden (obsolete)
Comment 14 QA Administrators 2022-06-11 03:29:26 UTC Comment hidden (obsolete)
Comment 15 Andreas Säger 2022-06-11 10:43:13 UTC
Still in

Version: 7.3.4.2 / LibreOffice Community
Build ID: 728fec16bd5f605073805c3c9e7c4212a0120dc5
CPU threads: 4; OS: Linux 5.4; UI render: default; VCL: x11
Locale: de-DE (de_DE.UTF-8); UI: en-US
Calc: threaded
Comment 16 Mike Kaganski 2023-04-16 11:46:59 UTC
Why should & be URL-encoded? It only has a special meaning in the query part, i.e. after question mark.
Comment 17 Eike Rathke 2023-04-16 11:54:21 UTC
IMHO it is not the task of ConvertToURL() to percent-encode anything except space as %20, as that is the only neither unreserved nor reserved character in an URI and what characters of the reserved characters are allowed depends on the actual URI scheme and protocol.
See https://datatracker.ietf.org/doc/html/rfc3986#section-2 and following.

It is the task of the application programmer (or another function) to know which reserved characters act as delimiters in a specific URI scheme and need to be percent-encoded.
Comment 18 Mike Kaganski 2023-04-16 12:07:51 UTC
The URI scheme is known in the function - it is 'file:' URL scheme. It should encode more characters, including # (which otherwise would be misinterpreted as start of fragment) or backslash (in non-Windows paths), which is invalid in URLs. But the characters listed in comment 0 are all normal in file URLs. Note that forward slash delimits hierarchical path parts.
Comment 19 Mike Kaganski 2023-04-16 17:13:27 UTC
(In reply to Andreas Säger from comment #0)
> Fail: ConvertToURL("C:\Foo&Bar") --> file:///C:/Foo&Bar
> Expected: ConvertToURL("C:\Foo&Bar") --> file:///C:/Foo%26Bar

This is an incorrect expectation. The URL is not required to percent-encode every non-alphanumeric character. This character - as well as the rest listed in the "Problem occurs with characters" - is OK in the URL unchanged. Not percent-encoding it is OK.

> Reference:
> https://tools.ietf.org/html/rfc3986#page-12

The reference describes *how* to percent-encode octets, and also *when*; and the latter is phrased "when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component". And this is exactly why the characters discussed in the comment 0 are OK in file URL.

(In reply to Andreas Säger from comment #6)
> One of my Basic macros truncated valid file paths when I was
> assuming that function convertToURL is able to deal with valid paths like
> C:/foo & bar/document.odt

So the bug is somewhere in the said macro. It likely has some incorrect assumptions about URL syntax.

Closing NOTABUG.
Comment 20 Mike Kaganski 2023-04-17 11:53:46 UTC
(In reply to Andreas Säger from comment #0)

Just to expand the last comment.

RFC 8089 [1] sect. 2 defines 'file' URL as:

> file-URI       = file-scheme ":" file-hier-part
> 
> file-hier-part = ( "//" auth-path )
>                / local-path
> 
> auth-path      = [ file-auth ] path-absolute
> 
> local-path     = path-absolute
> 
> file-auth      = "localhost"
>                / host

with the explanation "importing the "host" and "path-absolute" rules from [RFC3986] (as updated by [RFC6874])".

RFC 3986 defines them this way:

> host          = ...
> path-absolute = "/" [ segment-nz *( "/" segment ) ]
> segment       = *pchar
> segment-nz    = 1*pchar
> pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
> pct-encoded   = "%" HEXDIG HEXDIG
> unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
> sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>               / "*" / "+" / "," / ";" / "="

And finally, ALPHA, DIGIT, and HEXDIG are defined in RFC 2234 in an obvious way.

The scheme [1] is known both in Basic's ConvertToURL, and in Python's uno.systemPathToFileUrl, as mentioned in comment 18, simply because the functions are defined to convert system path to file URL.

The system path characters are converted to characters of the URL's 'file-hier-part'; more precisely, to 'path-absolute', because 'file-auth' is not applicable to system-path-to-URL conversion (the auth information is not available in the system path). Further, I omitted the 'host' part, because the issue did not consider "hosted" system paths, like UNC, so this allows to avoid additional complexity (which doesn't change anything, just clutters the text).

Now let us consider the issue character by character:

> Problem occurs with characters
> !
> $
> &
> '
> (
> )
> *
> +
> ,
> =

All the above are 'sub-delims', explicitly allowed in 'pchar', which constitute both 'segment' and 'segment-nz' of 'path-absolute'.

> /

This one is explicitly shown as the character delimiting segments in 'path-absolute'. Both Linux filesystems, and Windows filesystems use this character as hierarchy delimiter, so whenever it appears in the source path, it gets "converted" to itself in the resulting URL.

> :
> @

These two are allowed explicitly in 'pchar'.

> ConvertToURL correctly encodes the follwing characters:
> <space> 
> "
> {
> |
> }

These are not listed at all among the "very limited set", from which URIs consist (RFC 3986 sect. 1.2.1; Appendix a), so it must be percent-encoded.

> \

This one gets converted to "/" on Windows, since it's a hierarchy separator there; on Linux, it behaves the same as the five above.

> %

This is the character used in 'pct-encoded'; and its conversion is explicitly defined in RFC 3986 sect. 2.4.

> #
> ?
> [
> ]

These are 'gen-delims' that aren't explicitly mentioned as allowed in the 'path-absolute' components.

> ;

And this one is interesting. It is part of 'sub-delims', so in theory, could stay as is. Interestingly, both INetURLObject [2] and osl_getFileURLFromSystemPath [3] (which are used in Basic and Python, respectively), percent-encode it. The most likely reason is that the previous version of the "Uniform Resource Identifiers (URI): Generic Syntax" standard, RFC 2396, didn't allow that character there.

[1] https://www.rfc-editor.org/rfc/rfc8089
[2] https://opengrok.libreoffice.org/xref/core/include/tools/urlobj.hxx?r=485300f9#179
[3] https://opengrok.libreoffice.org/xref/core/include/osl/file.h?r=0ce7c84c#1443