96413 – Basic ConvertToURL fails to URL encode many characters (see comment 18)

Bug 96413 - Basic ConvertToURL fails to URL encode many characters (see comment 18)

Summary: Basic ConvertToURL fails to URL encode many characters (see comment 18)

Status:	RESOLVED NOTABUG

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	BASIC (show other bugs)
Version: (earliest affected)	Inherited From OOo
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-12-11 14:04 UTC by Andreas Säger
Modified:	2023-04-17 11:53 UTC (History)
CC List:	3 users (show)

See Also:
Crash report or crash signature:

Attachments
Spreadsheet with user-defined Basic functions (19.80 KB, application/vnd.oasis.opendocument.spreadsheet) 2015-12-11 14:04 UTC, Andreas Säger	Details
bt with debug symbols (8.92 KB, text/plain) 2015-12-29 19:14 UTC, Julien Nabet	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andreas Säger 2015-12-11 14:04:58 UTC

Created attachment 121223 [details]
Spreadsheet with user-defined Basic functions

OK: ConvertToURL("C:\Foo Bar") --> file:///C:/Foo&20Bar
Fail: ConvertToURL("C:\Foo&Bar") --> file:///C:/Foo&Bar
Expected: ConvertToURL("C:\Foo&Bar") --> file:///C:/Foo&26Bar

Problem occurs with characters
!
$
&
'
(
)
*
+
,
/
:
=
@

ConvertToURL correctly encodes the follwing characters:
<space> 
"
#
%
;
?
[
\
]
{
|
}


Reference:
https://tools.ietf.org/html/rfc3986#page-12

Comment 1 Andreas Säger 2015-12-11 14:25:30 UTC

Same with PyUNO:
>>> s = '/foo & bar/concept.odt'
>>> uno.systemPathToFileUrl(s)
'file:///foo%20&%20bar/concept.odt'

Comment 2 Buovjaga 2015-12-14 15:02:02 UTC

What is the correct procedure to test this bug?

I ran the Path2URL macro and it said
A Scripting Framework error occurred while running the Basic script Standard.Module1.Path2URL.

Message: wrong number of parameters!

Set to NEEDINFO.
Change back to UNCONFIRMED after you have provided the information.

Comment 3 Andreas Säger 2015-12-14 16:07:54 UTC

The functions are called from the spreadsheet cells. Look at the cell formulas.

Comment 4 Buovjaga 2015-12-14 16:33:48 UTC

Sorry about that, I guess the concept was too alien to me.

Confirmed.

Win 7 Pro 64-bit Version: 5.2.0.0.alpha0+
Build ID: 917d59a84124d1022bd1912874e7a53c674784f1
CPU Threads: 4; OS Version: Windows 6.1; UI Render: default; 
TinderBox: Win-x86@62-merge-TDF, Branch:MASTER, Time: 2015-12-12_12:17:04
Locale: fi-FI (fi_FI)

Comment 5 Cor Nouws 2015-12-18 12:29:32 UTC

Thanks for filing. IMO an enhancement ?

Comment 6 Andreas Säger 2015-12-18 13:25:14 UTC

(In reply to Cor Nouws from comment #5)
> Thanks for filing. IMO an enhancement ?

No enhancement. One of my Basic macros truncated valid file paths when I was assuming that function convertToURL is able to deal with valid paths like C:/foo & bar/document.odt

Since Basic does not even provide basic string functions helping to implement my own URL encoding, I start my code with an error message and program exit when the following function returns True:

Function IsBadURL(sURL$) As Boolean
Dim a(), s$, nSplit&, sPN$
	nSplit = 9 'file:///
	If getGUIType = 1 then nSplit = 11 'file:///C:
	sPN = mid(sURL, nSplit)
	a() = Array("!", "$", "&", "'", "(", ")", "*", "+", ",", ":", "=", "@")
	for each s in a()
		if instr(sPN, s) > 0 then 
			IsBadURL = True
			exit function
		endif
	next
End Function

Comment 7 Cor Nouws 2015-12-18 14:09:29 UTC

OK.

Comment 8 Andreas Säger 2015-12-20 15:54:17 UTC

Weird work-around for the Basic fanboys among us:

Function WorkAround_ConvertToURL(sysPath) As String
Dim oSrv, a(), sURL$, nSplit&, sCode$, sProtocol$, sPathName$, sChar$, sRoot$
	sURL = convertToURL(sysPath)
	a() = Array("!", "$", "&", "'", "(", ")", "*", "+", ",", ":", "=", "@")
	If getGUIType = 1 then
		nSplit = 11
	else
		nSplit = 8
	endif
	sProtocol = left(sURL, nSPlit)
	sPathName = mid(sURL, nSplit +1)
	oSrv = createUnoService("com.sun.star.sheet.FunctionAccess")
	For each sChar in a()
		if instr(sPathName, sChar) > 0 then
			sCode = "%"& hex(asc(sChar))
			sPathName = oSrv.callFunction("SUBSTITUTE", Array(sPathName, sChar, sCode))
		endif
	next
	WorkAround_ConvertToURL = sProtocol & sPathName
End Function

Comment 9 Julien Nabet 2015-12-29 19:14:48 UTC

Created attachment 121619 [details]
bt with debug symbols

After some tests with gdb, it seems it decides to escape or not from here:
http://opengrok.libreoffice.org/xref/core/tools/source/fsys/urlobj.cxx#541

    541 inline bool mustEncode(sal_uInt32 nUTF32, INetURLObject::Part ePart)
    542 {
    543     return !rtl::isAscii(nUTF32) || !(aMustEncodeMap[nUTF32] & ePart);
    544 }
(gdb indicates ePart=INetURLObject::PART_PCHAR)

which gives then:
http://opengrok.libreoffice.org/xref/core/tools/source/fsys/urlobj.cxx#441
a 128 elements array.

If I apply this patch:
diff --git a/tools/source/fsys/urlobj.cxx b/tools/source/fsys/urlobj.cxx
index 771c3bb..bf279f16f 100644
--- a/tools/source/fsys/urlobj.cxx
+++ b/tools/source/fsys/urlobj.cxx
@@ -442,7 +442,7 @@ static sal_uInt32 const aMustEncodeMap[128]
     = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 /*   */                                              PP,
-/* ! */ PA   +PD+PE+PF+PG+PH+PI+PJ+PK+PL+PM+PN+PO+PP+PQ+PR,
+/* ! */ PA   +PD+PE+PF+PG+PH+PI+PJ+PK   +PM+PN+PO+PP+PQ+PR,
 /* " */                                  PM+PN   +PP,
 /* # */                                  PM,
 /* $ */ PA   +PD+PE+PF+PG+PH+PI+PJ+PK+PL+PM+PN+PO+PP+PQ+PR,

"!" is escaped.
However, I don't know the impact of this change on other parts of LO.

Comment 10 QA Administrators 2017-01-03 19:57:30 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present on a currently supported version of LibreOffice
(5.1.6 or 5.2.3 https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the version of LibreOffice and
your operating system, and any changes you see in the bug behavior

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave
a short comment that includes your version of LibreOffice and Operating System

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to "inherited from OOo";
4b. If the bug was not present in 3.3 - add "regression" to keyword

Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug-20170103

Comment 11 Andreas Säger 2017-01-05 18:19:58 UTC

This is still in 5.2.3.

Comment 12 QA Administrators 2018-06-10 02:45:30 UTC Comment hidden (obsolete)

** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 13 QA Administrators 2020-06-10 03:48:01 UTC Comment hidden (obsolete)

Dear Andreas Säger,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 14 QA Administrators 2022-06-11 03:29:26 UTC Comment hidden (obsolete)

Dear Andreas Säger,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://web.libera.chat/?settings=#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Comment 15 Andreas Säger 2022-06-11 10:43:13 UTC

Still in

Version: 7.3.4.2 / LibreOffice Community
Build ID: 728fec16bd5f605073805c3c9e7c4212a0120dc5
CPU threads: 4; OS: Linux 5.4; UI render: default; VCL: x11
Locale: de-DE (de_DE.UTF-8); UI: en-US
Calc: threaded

Comment 16 Mike Kaganski 2023-04-16 11:46:59 UTC

Why should & be URL-encoded? It only has a special meaning in the query part, i.e. after question mark.

Comment 17 Eike Rathke 2023-04-16 11:54:21 UTC

IMHO it is not the task of ConvertToURL() to percent-encode anything except space as %20, as that is the only neither unreserved nor reserved character in an URI and what characters of the reserved characters are allowed depends on the actual URI scheme and protocol.
See https://datatracker.ietf.org/doc/html/rfc3986#section-2 and following.

It is the task of the application programmer (or another function) to know which reserved characters act as delimiters in a specific URI scheme and need to be percent-encoded.

Comment 18 Mike Kaganski 2023-04-16 12:07:51 UTC

The URI scheme is known in the function - it is 'file:' URL scheme. It should encode more characters, including # (which otherwise would be misinterpreted as start of fragment) or backslash (in non-Windows paths), which is invalid in URLs. But the characters listed in comment 0 are all normal in file URLs. Note that forward slash delimits hierarchical path parts.

Comment 19 Mike Kaganski 2023-04-16 17:13:27 UTC

(In reply to Andreas Säger from comment #0)
> Fail: ConvertToURL("C:\Foo&Bar") --> file:///C:/Foo&Bar
> Expected: ConvertToURL("C:\Foo&Bar") --> file:///C:/Foo%26Bar

This is an incorrect expectation. The URL is not required to percent-encode every non-alphanumeric character. This character - as well as the rest listed in the "Problem occurs with characters" - is OK in the URL unchanged. Not percent-encoding it is OK.

> Reference:
> https://tools.ietf.org/html/rfc3986#page-12

The reference describes *how* to percent-encode octets, and also *when*; and the latter is phrased "when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component". And this is exactly why the characters discussed in the comment 0 are OK in file URL.

(In reply to Andreas Säger from comment #6)
> One of my Basic macros truncated valid file paths when I was
> assuming that function convertToURL is able to deal with valid paths like
> C:/foo & bar/document.odt

So the bug is somewhere in the said macro. It likely has some incorrect assumptions about URL syntax.

Closing NOTABUG.

Comment 20 Mike Kaganski 2023-04-17 11:53:46 UTC

(In reply to Andreas Säger from comment #0)

Just to expand the last comment.

RFC 8089 [1] sect. 2 defines 'file' URL as:

> file-URI       = file-scheme ":" file-hier-part
> 
> file-hier-part = ( "//" auth-path )
>                / local-path
> 
> auth-path      = [ file-auth ] path-absolute
> 
> local-path     = path-absolute
> 
> file-auth      = "localhost"
>                / host

with the explanation "importing the "host" and "path-absolute" rules from [RFC3986] (as updated by [RFC6874])".

RFC 3986 defines them this way:

> host          = ...
> path-absolute = "/" [ segment-nz *( "/" segment ) ]
> segment       = *pchar
> segment-nz    = 1*pchar
> pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
> pct-encoded   = "%" HEXDIG HEXDIG
> unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
> sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>               / "*" / "+" / "," / ";" / "="

And finally, ALPHA, DIGIT, and HEXDIG are defined in RFC 2234 in an obvious way.

The scheme [1] is known both in Basic's ConvertToURL, and in Python's uno.systemPathToFileUrl, as mentioned in comment 18, simply because the functions are defined to convert system path to file URL.

The system path characters are converted to characters of the URL's 'file-hier-part'; more precisely, to 'path-absolute', because 'file-auth' is not applicable to system-path-to-URL conversion (the auth information is not available in the system path). Further, I omitted the 'host' part, because the issue did not consider "hosted" system paths, like UNC, so this allows to avoid additional complexity (which doesn't change anything, just clutters the text).

Now let us consider the issue character by character:

> Problem occurs with characters
> !
> $
> &
> '
> (
> )
> *
> +
> ,
> =

All the above are 'sub-delims', explicitly allowed in 'pchar', which constitute both 'segment' and 'segment-nz' of 'path-absolute'.

> /

This one is explicitly shown as the character delimiting segments in 'path-absolute'. Both Linux filesystems, and Windows filesystems use this character as hierarchy delimiter, so whenever it appears in the source path, it gets "converted" to itself in the resulting URL.

> :
> @

These two are allowed explicitly in 'pchar'.

> ConvertToURL correctly encodes the follwing characters:
> <space> 
> "
> {
> |
> }

These are not listed at all among the "very limited set", from which URIs consist (RFC 3986 sect. 1.2.1; Appendix a), so it must be percent-encoded.

> \

This one gets converted to "/" on Windows, since it's a hierarchy separator there; on Linux, it behaves the same as the five above.

> %

This is the character used in 'pct-encoded'; and its conversion is explicitly defined in RFC 3986 sect. 2.4.

> #
> ?
> [
> ]

These are 'gen-delims' that aren't explicitly mentioned as allowed in the 'path-absolute' components.

> ;

And this one is interesting. It is part of 'sub-delims', so in theory, could stay as is. Interestingly, both INetURLObject [2] and osl_getFileURLFromSystemPath [3] (which are used in Basic and Python, respectively), percent-encode it. The most likely reason is that the previous version of the "Uniform Resource Identifiers (URI): Generic Syntax" standard, RFC 2396, didn't allow that character there.

[1] https://www.rfc-editor.org/rfc/rfc8089
[2] https://opengrok.libreoffice.org/xref/core/include/tools/urlobj.hxx?r=485300f9#179
[3] https://opengrok.libreoffice.org/xref/core/include/osl/file.h?r=0ce7c84c#1443