Bug 152492 - Don't spell check URLs within hyperlinks (i.e. spell check only plain words within hyperlinks to avoid of false alarms)
Summary: Don't spell check URLs within hyperlinks (i.e. spell check only plain words w...
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.5.0.0 alpha0+
Hardware: All All
: medium normal
Assignee: Not Assigned
URL: http://specs.openoffice.org/appwide/l...
Whiteboard: target:7.6.0 target:7.5.0.0.beta2 tar...
Keywords: regression
Depends on:
Blocks:
 
Reported: 2022-12-13 10:54 UTC by NISZ LibreOffice Team
Modified: 2023-10-17 22:19 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments
Test document (10.79 KB, application/vnd.oasis.opendocument.text)
2022-12-13 13:55 UTC, ⁨خالد حسني⁩
Details
Screenhost from test document with the proposed patch applied (53.70 KB, image/png)
2022-12-13 13:57 UTC, ⁨خالد حسني⁩
Details

Note You need to log in before you can comment on or make changes to this bug.
Description NISZ LibreOffice Team 2022-12-13 10:54:18 UTC
Description:
In new documents, spell checking started to report
spelling mistakes within the recognized URLs.

Steps to Reproduce:
1. Create in new document.
2. Create URL in the document.

Actual Results:
In new documents, spell checking started to report
spelling mistakes within the recognized URLs.

Expected Results:
Spell checking started to report
spelling mistakes within the recognized URLs.


Reproducible: Always


User Profile Reset: No

Additional Info:
Version: 7.5.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: 52c75986adc2b370eb55ce918ab1db0a95831c83
CPU threads: 4; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: hu-HU (hu_HU); UI: en-US
Calc: CL threaded
Comment 1 László Németh 2022-12-13 12:04:57 UTC
Regression from commit 2cca160f8bfc4597cf0ad3aaaf0017a5210ea0ec
"tdf#126657, tdf#145104: Don’t set language to none on defined
styles".

Suggested fix: https://gerrit.libreoffice.org/c/core/+/143946
Comment 2 ⁨خالد حسني⁩ 2022-12-13 13:55:25 UTC
Created attachment 184128 [details]
Test document

(In reply to László Németh from comment #1)
> Regression from commit 2cca160f8bfc4597cf0ad3aaaf0017a5210ea0ec
> "tdf#126657, tdf#145104: Don’t set language to none on defined
> styles".
> 
> Suggested fix: https://gerrit.libreoffice.org/c/core/+/143946

This change seems to disable spell checking for any hyperlink text, but hyperlink text is not always a URL. This is one of the undesired aspects of the old approach.

On the other hand, URLs that are not hyperlinks gets marked as misspellings.

See the attached document.

IMHO, we probably have some code that can recognize URLs (since Writer knows how to turn a URL into a hyperlink), so may be this code can be reused to recognize URLs and skip spell check for them instead of using character styles.
Comment 3 ⁨خالد حسني⁩ 2022-12-13 13:57:23 UTC
Created attachment 184129 [details]
Screenhost from test document with the proposed patch applied

This a screenshot with the proposed patch applied, notice how the same text is marked a misspelling or not merely based on the character style.
Comment 4 ⁨خالد حسني⁩ 2022-12-13 14:03:09 UTC
Another idea, from the linked OO.o spec document:

> we will only except URLs from spelling already recognized/formatted
> as URLs and if the URL and the text representation are equal.

I think the last part is a good idea, if the hyperlink URL and the Text are the same, then skip spell checking. No idea if this is easier than trying to detect URLs and it will not also handle URLs not formatted as hyperlinks, but might be worth trying if it is simpler/more robust.
Comment 5 László Németh 2022-12-13 14:20:12 UTC
(In reply to خالد حسني from comment #2)
> Created attachment 184128 [details]
> Test document
> 
> (In reply to László Németh from comment #1)
> > Regression from commit 2cca160f8bfc4597cf0ad3aaaf0017a5210ea0ec
> > "tdf#126657, tdf#145104: Don’t set language to none on defined
> > styles".
> > 
> > Suggested fix: https://gerrit.libreoffice.org/c/core/+/143946
> 
> This change seems to disable spell checking for any hyperlink text, but
> hyperlink text is not always a URL. This is one of the undesired aspects of
> the old approach.

You are right, but it's better to avoid of the false alarms (which could be multiple on a single URL), so I started to fix the regression first, but continued on the other problem, too. The issue about the missing spell checking was reported in tdf#45949, and the proposed fix in https://gerrit.libreoffice.org/c/core/+/144044

> 
> On the other hand, URLs that are not hyperlinks gets marked as misspellings.
> 
> See the attached document.
> 
> IMHO, we probably have some code that can recognize URLs (since Writer knows
> how to turn a URL into a hyperlink), so may be this code can be reused to
> recognize URLs and skip spell check for them instead of using character
> styles.

I've made a minimized URL part recognition, using the information that the text was recognized as an URL likely, because the automatic URL recognition doesn't recognize the simplified URLs, e.g. "gerrit.libreoffice.org", i.e. an edited version of the recognized URL. It would be fine to check Impress, too, where it's possible to type a misspelled text, immediately selecting it and adding a hyperlink without red underline/spell checking, i.e. noticing the spelling mistake.

Thanks for your feedback!
Comment 6 László Németh 2022-12-13 14:29:05 UTC
(In reply to خالد حسني from comment #4)
> Another idea, from the linked OO.o spec document:
> 
> > we will only except URLs from spelling already recognized/formatted
> > as URLs and if the URL and the text representation are equal.
> 
> I think the last part is a good idea, if the hyperlink URL and the Text are
> the same, then skip spell checking. No idea if this is easier than trying to
> detect URLs and it will not also handle URLs not formatted as hyperlinks,
> but might be worth trying if it is simpler/more robust.

Indeed. It seems for me, it's not easier, because without modifying the word break algorithm, we must handle something with the multiple bad words of an URL. (Modifying word breaking can result bad page layout, too.)
Also the visible text can be the simplified version of the original URL (but that is easily recognizable, if it's part of the URL, e.g. "gerrit.libreoffice.org" is part of "http://www.gerrit.libreoffice.org/index.html"). We could create an advanced URL/text checking to detect the typos of the visible URL and hyper-reference, and suggest a fix at the right place, but this is more complex task, and the result is questionable.
Comment 7 ⁨خالد حسني⁩ 2022-12-13 14:40:24 UTC
(In reply to László Németh from comment #5)
> (In reply to خالد حسني from comment #2)
> > Created attachment 184128 [details]
> > Test document
> > 
> > (In reply to László Németh from comment #1)
> > > Regression from commit 2cca160f8bfc4597cf0ad3aaaf0017a5210ea0ec
> > > "tdf#126657, tdf#145104: Don’t set language to none on defined
> > > styles".
> > > 
> > > Suggested fix: https://gerrit.libreoffice.org/c/core/+/143946
> > 
> > This change seems to disable spell checking for any hyperlink text, but
> > hyperlink text is not always a URL. This is one of the undesired aspects of
> > the old approach.
> 
> You are right, but it's better to avoid of the false alarms (which could be
> multiple on a single URL), so I started to fix the regression first, but
> continued on the other problem, too. The issue about the missing spell
> checking was reported in tdf#45949, and the proposed fix in
> https://gerrit.libreoffice.org/c/core/+/144044

I just wanted to make sure you are aware of the limitation. Personally, since I misspell text a lot, I’m more concerned about false -ve (missing misspelled words) than false +ve (flaghing text too eagerly), so the current state is an improvement for me, but I’ll let it to your judgement which is the more pressing issue.
Comment 8 Commit Notification 2022-12-13 14:42:02 UTC
László Németh committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/ec8fdffec29de7c80da0c2a440e467c35a297119

tdf#152492 sw: fix unwanted spell checking of parts of URLs

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 9 László Németh 2022-12-13 14:44:09 UTC
Commit description:

tdf#152492 sw: fix unwanted spell checking of parts of URLs

In new documents, spell checking started to report
spelling mistakes within the recognized URLs.

Regression from commit 2cca160f8bfc4597cf0ad3aaaf0017a5210ea0ec
"tdf#126657, tdf#145104: Don’t set language to none on defined
styles".

Note: skipping spell checking of recognized URLs was specified by
http://specs.openoffice.org/appwide/linguistic/Spellcheckdialog.sxw,
see also https://bz.apache.org/ooo/show_bug.cgi?id=40133.
Unfortunately, the original implementation caused possible
regressions in text layout, so it was reverted in the
previous commit.
Comment 10 László Németh 2022-12-13 14:49:27 UTC
(In reply to خالد حسني from comment #7)
> (In reply to László Németh from comment #5)
> > (In reply to خالد حسني from comment #2)
> > > Created attachment 184128 [details]
> > > Test document
> > > 
> > > (In reply to László Németh from comment #1)
> > > > Regression from commit 2cca160f8bfc4597cf0ad3aaaf0017a5210ea0ec
> > > > "tdf#126657, tdf#145104: Don’t set language to none on defined
> > > > styles".
> > > > 
> > > > Suggested fix: https://gerrit.libreoffice.org/c/core/+/143946
> > > 
> > > This change seems to disable spell checking for any hyperlink text, but
> > > hyperlink text is not always a URL. This is one of the undesired aspects of
> > > the old approach.
> > 
> > You are right, but it's better to avoid of the false alarms (which could be
> > multiple on a single URL), so I started to fix the regression first, but
> > continued on the other problem, too. The issue about the missing spell
> > checking was reported in tdf#45949, and the proposed fix in
> > https://gerrit.libreoffice.org/c/core/+/144044
> 
> I just wanted to make sure you are aware of the limitation. Personally,
> since I misspell text a lot, I’m more concerned about false -ve (missing
> misspelled words) than false +ve (flaghing text too eagerly), so the current
> state is an improvement for me, but I’ll let it to your judgement which is
> the more pressing issue.

Now we have both already
Comment 11 László Németh 2022-12-13 14:52:19 UTC
... solving bug 45949: reporting the bad words in hyperlinks, but not the URLs.
Comment 12 László Németh 2022-12-13 14:55:19 UTC
@Khaled: thanks for fixing 126657, it solved the accessbility issue 39794, too. I planned something similar related to Bug 45949, so it was a surprise me to notice your change (I had to bibisect it to find your commit), and helped to do the right thing. :)
Comment 13 László Németh 2022-12-13 14:59:39 UTC
@Khaled: Possible improvement: add Check URL option to Language Settings->Writing Aids to check skipped URL parts.
Comment 14 László Németh 2022-12-13 15:03:53 UTC
(In reply to László Németh from comment #13)
> @Khaled: Possible improvement: add Check URL option to Language
> Settings->Writing Aids to check skipped URL parts.

@Khaled: sorry, not a request addressed to you, just a TODO item. (I planned to do, but likely not now.)
Comment 15 ⁨خالد حسني⁩ 2022-12-13 16:11:31 UTC
All seems good, thanks László for the fixes!
Comment 16 Commit Notification 2022-12-13 17:47:33 UTC
László Németh committed a patch related to this issue.
It has been pushed to "libreoffice-7-5":

https://git.libreoffice.org/core/commit/cb1531fa9e3030cb630c6b9c4861af47582f4988

tdf#152492 sw: fix unwanted spell checking of parts of URLs

It will be available in 7.5.0.0.beta2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 17 NISZ LibreOffice Team 2022-12-19 12:18:57 UTC
Verified in:
Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: ad387d5b984c6666906505d25685065f710ed55d
CPU threads: 4; OS: Windows 10.0 Build 19044; UI render: Skia/Vulkan; VCL: win
Locale: hu-HU (hu_HU); UI: en-US
Calc: CL threaded
Comment 18 Peter Wilkins 2023-02-08 17:58:08 UTC
I was beginning to get quite frustrated with v7.4.4 spell-checking every single URL - an almost mind-numbing experience when spell-checking a long technical document with countless URLs.

As others have commented, URLs are typically cryptic and often include words or phrases that could be interpreted as mispelled when, in fact, they are part of the actual URL - so I am not clear what spell-checking is meant to accomplish.

What might be helpful is a check for dead links which would, presumably, require LO to ping URLs to verify their existance. However this may not work for URLs that require login for access.

I have updated to v7.5.0.3 this problem seems to have gone away :) So, I have a few questions:

1) what spell-checking of URLs (if any) is still being done?
2) if there is still some spell-checking of URLs - when, and to what extent, does that happen?
3) what is the objective of spell-checking URLs (if still being done)?
~~
Comment 19 László Németh 2023-02-09 21:54:08 UTC
(In reply to Peter Wilkins from comment #18)
> I was beginning to get quite frustrated with v7.4.4 spell-checking every
> single URL - an almost mind-numbing experience when spell-checking a long
> technical document with countless URLs.
> 
> As others have commented, URLs are typically cryptic and often include words
> or phrases that could be interpreted as mispelled when, in fact, they are
> part of the actual URL - so I am not clear what spell-checking is meant to
> accomplish.
> 
> What might be helpful is a check for dead links which would, presumably,
> require LO to ping URLs to verify their existance. However this may not work
> for URLs that require login for access.
> 
> I have updated to v7.5.0.3 this problem seems to have gone away :) So, I
> have a few questions:
> 
> 1) what spell-checking of URLs (if any) is still being done?
> 2) if there is still some spell-checking of URLs - when, and to what extent,
> does that happen?
> 3) what is the objective of spell-checking URLs (if still being done)?
> ~~

1) In theory, no spell-checking of URLs, only plain text with hyperlinks (previously also plain text was skipped, see Bug 45949). Now instead of spelling the cryptic parts or skipping the text of the hyperlinks completely, the program  recognizes the cryptic text parts checking their context, i.e. the URI (http:// etc.), the domain name, also the slashes, the encoded acuted letters etc., checking only the normal words of the hyperlinks.
 
Sorry for the previous inaccurate description of this bug report. Now is "Don't spell check URLs within hyperlinks (i.e. spell check only plain words within hyperlinks to avoid of false alarms)")

There is a plan to add a new spelling option(s) to the Tools->Options->Language Settings->Writing Aids to check URLs. Checking the cryptic parts of an URL is mostly unimportant, but storing the domain names in the custom dictionary may be useful. Also making a comparison within the text and the associated URL can be useful to detect a typo, e.g. http://www.example.org and http://ww.example.org (missing "w").

Thanks for your feedback and questions!
Comment 20 Commit Notification 2023-10-17 22:19:56 UTC
Xisco Fauli committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/7acd895d8a6ed16611a20873f3b2216fccef0634

tdf#45949, tdf#152492: move UITest to CppUnittest

It will be available in 24.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.