Bug 91192 - AutoCorrect: Writer not recognizing a URL's trailing carat, hash mark, question mark, backslash, or pipe
Summary: AutoCorrect: Writer not recognizing a URL's trailing carat, hash mark, questi...
Status: REOPENED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.2.8.2 release
Hardware: x86-64 (AMD64) All
: medium major
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: AutoCorrect-Complete
  Show dependency treegraph
 
Reported: 2015-05-09 20:38 UTC by Nick Levinson
Modified: 2021-03-19 04:17 UTC (History)
9 users (show)

See Also:
Crash report or crash signature:


Attachments
screnshoot (226.26 KB, image/png)
2015-05-09 23:10 UTC, m.a.riosv
Details
examples of URLs including how tooltip interprets one URL as without trailing character (134.43 KB, image/png)
2019-12-14 20:00 UTC, Nick Levinson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Levinson 2015-05-09 20:38:59 UTC
Writer usually auto-formats URLs differently. However, if the URL ends with a carat, a hash mark, a quesiton mark, or a backslash, Writer treats that character as not part of the URL. It should treat it as part of the URL. Here's one of the URLs, as accessed May 7, 2015:

http://www.llbean.com/llb/shop/504714?gnrefine=1*WARMTH_RATING*Warmest^

In this case, the URL with or without the carat gets the intended page, although one is without a certain blank space. Nonetheless, we should not assume that a character is disposable.

Carats in URLs are rare. I don't know if it's technically improper, but I doubt it.

Here's one with a hash mark: http://www.example.com/#

URLs ending in hash marks are becoming common. Although the hash mark is officially meant for a fragment identifier, apparently Ajax programming supports a URL ending with a hash mark.

The question mark is officially reserved for query strings. I don't know if a URL ending with a question mark exists in the wild, but it's probably safer to assume that it does.

The backslash is found in Windows locations but I don't know if it's barred from URLs. In case it's not, it should be accepted as part of a URL.

Even if these characters are improper as URL characters, Writer should not make that judgment, except for spaces and for less-than and greater-than symbols. That judgment should be left to website servers and domain name servers. The history of the hash mark supports that point.

If a newer Writer version has solved this, that's good enough.

I'm guessing my hardware.
Comment 1 m.a.riosv 2015-05-09 23:10:27 UTC
Created attachment 115478 [details]
screnshoot

Hi Nick,

I think I can't reproduce with windows.

Maybe some option in Auto-correct.
Comment 2 Nick Levinson 2015-05-12 00:38:19 UTC
You fell right into the trap I was implying was present. If only part of a URL is linked, a user clicking on the link gets a wrong result or no result. In your test, the original URL had a trailing carat ("^") which you didn't copy. Please try it again, but with the trailing carat (above the "6" on U.S. keyboards).

By the way, you're using an earlier Writer version, so if you don't have the problem and my 4.2.8.2 does then something was changed to make Writer worse later and even if you don't have the problem a later version must be fixed (unless it was fixed in a version later than either of us apparently has).

The pipe character ("|") causes the same problem in Writer. Example: <https://bugzilla.mozilla.org/enter_bug.cgi#h=bugForm|Bugzilla>.
Comment 3 m.a.riosv 2015-05-12 20:14:08 UTC
I can reproduce now.

But using Menu/Insert/Hyperlink works fine for me.
Comment 4 Nick Levinson 2015-05-16 19:22:37 UTC
That's more complicated and not a friendly solution. Insert > Hyperlink appears to be meant for cases where you wish to type a text (perhaps a hyperlink but not necessarily) but have it link to something that might be different, such as to have "Pat's home page" or "example.net" as the link text that should link to "https://example.com". This bug report is about typing a URL in the text and having it unintentionally link somewhere else without the user noticing it, thus misleading whomever clicks on the link text later.
Comment 5 m.a.riosv 2015-05-17 12:25:26 UTC
The report was marked as new with comment#3, where I mention another way to do it. Wow really complicate [Ctrl+K] [Ctrl+V] [Enter]
Comment 6 Nick Levinson 2015-05-17 20:23:25 UTC
Yes, it is. Geeks can use a command line interface (CLI) and may not need a gooey (GUI). Nongeeks are people who want to get something done and expect the computer to just do it. They often don't know the difference between RAM and a hard drive and don't want to know. The process in comments 3/5 would be explained to a nongeek something like this: After you type the URL, you select it manually (we're talking about URLs that fail to be fully selected when done automatically), copy (ctrl-c) the selection, type ctrl-k, type ctrl-v, and press the Enter key. And most nongeeks are intimidated by unfamiliar control commands, so what you'd usually have to say is something like this: After you type the URL, you select it manually (ignoring how it was selected when done automatically), copy the selection, open the Insert menu, select the Hyperlink command ("what's that?" "it's what you click on to go to a website"), paste, and press the Enter key. And they'd have to remember that for every URL they type that didn't fully link, a failure they'd have to recognize each time, and you missed one when you were looking for it.

It would be a lot friendlier to users if the automatic linking of a string followed the rules about what characters can be in a URL. That would normally be any except the space, the opening angle bracket, and the closing angle bracket. (Domains have more limited rules but URLs don't, and if a subdomain violates the domain label rules a name server might allow it anyway, so LibreOffice should allow it, too.) Then users can go about being productive without getting stuck on the constructs in a module.

Thanks for the Status setting.
Comment 7 QA Administrators 2017-09-01 11:17:42 UTC Comment hidden (obsolete)
Comment 8 QA Administrators 2019-12-03 14:38:17 UTC Comment hidden (obsolete)
Comment 9 m.a.riosv 2019-12-06 01:26:41 UTC
Works for me.
Version: 6.5.0.0.alpha0+ (x64)
Build ID: 60e8941fd581bb06cbf6be62edb8c387e7c07812
CPU threads: 4; OS: Windows 10.0 Build 19035; UI render: default; VCL: win; 
Locale: es-ES (es_ES); UI-Language: en-US
Calc: CL
Comment 10 Nick Levinson 2019-12-14 20:00:11 UTC
Created attachment 156589 [details]
examples of URLs including how tooltip interprets one URL as without trailing character
Comment 11 Nick Levinson 2019-12-14 20:17:38 UTC
This still fails for all five characters when each one is trailing in a URL and the URL was typed or pasted directly into a document, as most users would do. Most users would not use the obscure kludge Insert > Hyperlink if they don't need a link text that differs from the link destination. They wouldn't even know they should, since they would simply type the URL and, if they're not geeks, probably wouldn't notice that the link is incomplete. Most people don't proofread character by character.

The screenshot I've uploaded not only shows the links as incomplete but shows the tooltip giving a URL and the URL in the tooltip is incomplete.

I run Fedora 31 Linux, kept evergreen, and LO was part of an update in the last week or so, yet I don't have the version cited in comment 9. The version cited in comment 9 was not available at the first link in comment 8. If that unavailable version works, what matters is what's available to the public. What is available as essentially stable through the first link in comment 8 is older than mine, thus irrelevant. The slightly newer version is evidently not meant for the general nongeek public. Therefore, I'm changing the status, because worksforme is inappropriate for a version cited in comment 9 but not available to the public.

Info from the About LibreOffice dialog in Writer as installed on my platform and updated by Fedora and still having the problem in this bug report:
Version: 6.3.3.2.0+
Build ID: 6.3.3.2-7.fc31
CPU threads: 2; OS: Linux 5.3; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.UTF-8); UI-Language: en-US
Calc: threaded

I tried correcting the bug's Summary to include the pipe, but I'll try again.
Comment 12 sdc.blanco 2021-01-31 23:49:51 UTC
With AutoCorrect "URL Recognition" [T] and Tools > AutoCorrect > While Typing enabled.

Can reproduce all examples shown in attachment 156589 [details] using 7.2.0.0.alpha0+

Additional Information:

1. If additional text follows # or ?, then there is URL recognition

http://example.com/directory#testing  
http://example.com/directory?testing  

Both these examples are recognized as URLs.

2. For ^ | \

URL conversion stops with these characters, even if additional text is appended to them. 

e.g.,  http://example.com/directory^testing  (URL stops at 'y' in directory)


Asking for UXEval:  Two questions.

1.  Is it a considered a "bug" a potential URL that ends with #  (or ?) does not include the # (or ?) in the URL recognition?

(but, as noted, no problem if text follows # or ? )

2.  Is it a problem that the three characters:  ^ | \ are not recognized as part of a URL (and URL recognition stops with these characters)?

Relevant to note that these three characters are considered "unsafe" and should have percent-encoding ( https://www.ietf.org/rfc/rfc1738.txt )

Could consider an enhancement request to character encode  ^ | \ as part of URL Recognition.
Comment 13 Heiko Tietze 2021-02-01 11:37:34 UTC
With the pipe breaking the hyperlink it is clearly a bug to me. Not sure if all the other characters are proper URLs, but why should LibreOffice guard the web? Would take everything until the next white space (all kind of spaces, tab, cr) into the URL.

> Relevant to note that these three characters are considered "unsafe" and
> should have percent-encoding ( https://www.ietf.org/rfc/rfc1738.txt )

Adding people with more expertise to get opinions. Btw, the issue is also relevant for LibreOffice Online.
Comment 14 Guilhem Moulin 2021-02-01 11:59:27 UTC
(In reply to Heiko Tietze from comment #13)
> With the pipe breaking the hyperlink it is clearly a bug to me.

I tend to disagree, any compliant URL-parser would stop there as well.

> Not sure if all the other characters are proper URLs, but why should LibreOffice guard the web? Would take everything until the next white space (all kind of spaces, tab, cr) into the URL.

How about URLs enclosed in parentheses, square/angle brackets, or even pipes?  Formatted URLs shouldn't include the markers.  How about punctuation following a URL?  Makes sense to greedily parse URLs following RFC3986/3987 and stick to this IMHO.
Comment 15 Nick Levinson 2021-02-02 23:09:58 UTC
The backslash should be accepted for another reason: If I type a URL with an incorrect backslash directly into certain browsers, the browser changes the incorrect backslash into a correct slash. Examples in Firefox 84.0.2 (64-bit): http:\\slashsleep.com loads http://slashsleep.com/ and http://slashsleep.com\3\you-will-sleep\1\sleep-always-wins.html loads http://slashsleep.com/3/you-will-sleep/1/sleep-always-wins.html (that's my website and I don't have an alias or redirection set up for the backslashes so either the browser or the hosting server is doing it for all URLs).

This fails but shouldn't: http://example.com?age=293 . However, this is properly hyperlinked in LO Writer: https://example.com/?age=293 . The sole difference is in the slash after the TLD; I'm not sure if a server could be configured to accept the slashless version, so LO should hyperlink it, just in case.

I favor recognizing characters that are questionable in URLs on the same principle that early on applied to emailing: be strict in what you send but generous in what you accept. LO should generously recognize a typist's text as a URL with the boundaries being spaces or angle brackets. The worst that can happen is failing to arrive at the URL when clicked and even that can be corrected in the browser's address bar, which is easier for nongeeks than figuring out what should have been in the URL in the LO document. This example uses a nonexistent TLD and yet is generously hyperlinked as a URL by LO Writer: http://google.quibble

Parentheses, square brackets, and pipes (unfamiliar to me as a URL boundary but here accepted arguendo) can be identified as URL boundaries if they appear spacelessly both before and after the string that otherwise is a URL. Examples: (example.com), [ftp://example.com], and |example.com| . However, spacelessness must be at both ends; if it's at only one end, I don't know exactly what should be hytperlinked.

Angle brackets are already known to be boundaries. While <http://example.com> properly hyperlinks in LO without hyperlinking the angle brackets, <example.com> does not hyperlink in LO, but should.

A comma following a URL's directory, file, query, fragment, or slash should be treated as part of the URL because the host's server might recognize it. But a comma-and-space following an apparent TLD should be treated as not part of the URL, although it's too burdensome to have LO check if a domain label is a known or actually proposed TLD listed at iana.org or icann.org.

If a URL ends with a TLD, it may be followed by a period or not without changing the URL. (I forgot which RFC says so.)
Comment 16 Guilhem Moulin 2021-02-02 23:39:58 UTC
My 2ยข: Browser have it much easier since they whatever is entered into the URL bar is assumed to be a URL: the browser can apply whatever heuristics to turn the *entire string* into a valid RFC-compliant URL.  Trying to mimic that logic in LO, with unclear boundaries and arbitrary text, will lead to false positives.  But what do I know.  Removing myself from CC since I'm not involved in the decision nor implementation.
Comment 17 Stephan Bergmann 2021-02-04 17:52:08 UTC
The code that guesses which part of a larger text shall be auto-detected as a URI is URIHelper::FindFirstURLInText (svl/source/misc/urihelper.cxx, containing detailed documentation).  Of necessity, it needs to apply some heuristics, and, also of necessity, the algorithm's outcome will not necessarily match any given user's exact expectations.  That said:

(In reply to sdc.blanco from comment #12)
> Asking for UXEval:  Two questions.
> 
> 1.  Is it a considered a "bug" a potential URL that ends with #  (or ?) does
> not include the # (or ?) in the URL recognition?
> 
> (but, as noted, no problem if text follows # or ? )

Especially with "?" (and similar to e.g. "," and "."), the heuristics conservatively try to avoid including trailing punctuation (for which it is assumed that it was not meant to be part of the URI).

> 2.  Is it a problem that the three characters:  ^ | \ are not recognized as
> part of a URL (and URL recognition stops with these characters)?
> 
> Relevant to note that these three characters are considered "unsafe" and
> should have percent-encoding ( https://www.ietf.org/rfc/rfc1738.txt )

That's not a "should" but a "must".  None of those three characters can appear in a URI as-is, they always need to be percent-encoded.  The used heuristics in general do not consider that a character that cannot appear in a URI would form part of a to-be-detected URI.
Comment 18 Heiko Tietze 2021-02-05 08:46:41 UTC
So the question mark should be included in the algorithm but the hash mark is attributed unsafe by the RFC. Anything else is a must not.

Users probably do not understand the algorithm easily (but know the "workaround"). The documentation should explain what happens and why.