Bug 159607 - RegEx based search should apply to entire document, not just current paragraph
Summary: RegEx based search should apply to entire document, not just current paragraph
Status: ASSIGNED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Documentation (show other bugs)
Version:
(earliest affected)
7.4.6.2 release
Hardware: Other All
: medium enhancement
Assignee: Not Assigned
URL: https://extensions.libreoffice.org/en...
Whiteboard: target:25.2.0
Keywords:
Depends on:
Blocks: Find&Replace-Regex
  Show dependency treegraph
 
Reported: 2024-02-06 22:05 UTC by -t
Modified: 2024-08-24 10:30 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
Notepad++ example of "Extended Mode" search (20.75 KB, image/png)
2024-02-06 22:10 UTC, -t
Details
regex can't cross OVER the paragraph boundary (21.25 KB, image/png)
2024-08-24 06:10 UTC, fpy
Details
match strings at the Paragraph bounds, e.g. xyz$ or ^abc (23.65 KB, image/png)
2024-08-24 06:12 UTC, fpy
Details

Note You need to log in before you can comment on or make changes to this bug.
Description -t 2024-02-06 22:05:26 UTC
Description:
The documentation states "A search using a regular expression will work only within one paragraph. To search using a regular expression in more than one paragraph, do a separate search in each paragraph." THIS IS A BUG, not a feature, and needs to be fixed, since A REGULAR EXPRESSION MAY INCLUDE ONE OR MANY PARAGRAPH BREAKS.  Obviously this is an archaic way of avoiding a bug, or limitation, that prevented the search feature from working correctly when CR/LFs were encountered. 

Steps to Reproduce:
1.Open help file:///C:/Program%20Files/LibreOffice/help/en-US/text/swriter/guide/search_regexp.html?&DbPAR=WRITER&System=WIN
2. Observe tip: "A search using a regular expression will work only within one paragraph. To search using a regular expression in more than one paragraph, do a separate search in each paragraph."
3. Fail.  Because your intended search "Find" includes a line break in the middle.

Actual Results:
Search online in vain for a workaround.  Find one AltSearch extension, 5 years out of date, unsupported on current LO version.

Expected Results:
I expect to be able to form a search term that is supported by MS Word and/or Notepad++.


Reproducible: Always


User Profile Reset: No

Additional Info:
It should support searching an entire document even when regular expressions are used. Optimally it should support searching an entire document when using a Boost regular expression engine. At the very least it should support searching an entire document when using BASIC EXTENDED CODES, please see https://npp-user-manual.org/docs/searching/#extended-search-mode for examples. Also see https://github.com/notepad-plus-plus/notepad-plus-plus and  https://extensions.libreoffice.org/en/extensions/show/alternative-dialog-find-replace-for-writer
Comment 1 -t 2024-02-06 22:10:07 UTC
Created attachment 192440 [details]
Notepad++ example of "Extended Mode" search

Explained at https://npp-user-manual.org/docs/searching/#extended-search-mode
Comment 2 -t 2024-02-06 22:13:29 UTC Comment hidden (no-value)
Comment 3 -t 2024-02-06 22:18:41 UTC Comment hidden (no-value)
Comment 4 m_a_riosv 2024-02-07 00:35:01 UTC
Please don't set up your own reports as NEW, some else must do it, well except you are going to fix it.
Comment 5 fpy 2024-08-15 07:24:37 UTC
(In reply to m_a_riosv from comment #4)
>  some[one] else 

well, can definitely "confirm" the problem, indeed far from new ...
e.g.
https://ask.libreoffice.org/t/find-replace-including-a-paragraph-mark/1390/9
Comment 6 V Stuart Foote 2024-08-15 15:33:07 UTC
Actually the help article (par_id3153414) is not accurate. 

Calc ICU lib regexp search/replace are a bit less "global" than ICU regexp searches in Writer.  And there is also "match" mode for wildcard and interoperability with Excel XLS and XLSX sheet formats. [1][2][3]

In other words, full document search and replace with ICU lib regular expressions *already* works by default in all modules. With the F&R dialog offering an "off" mode toggle in Writer or Calc 

While the "Replace" field of the F&R dialog does not directly execute the regexp.

The comment "will work only within one paragraph..." in the help article (par_id3153414)  should have been removed when Wildcard content was reworked for bug 142574 [4]. 

A broader rework of regexp (more seamless use) is in see also bug 38261. 

IMHO this bug reports a documentation issue.

=-ref-=
[1] https://books.libreoffice.org/en/CG24/CG2402-EnteringandEditingData.html#toc72
[2] https://books.libreoffice.org/en/WG75/WG7503-TextAdvanced.html#toc15
[3] https://help.libreoffice.org/24.8/en-US/text/swriter/guide/search_regexp.html?DbPAR=WRITER
[4] https://gerrit.libreoffice.org/c/help/+/120573
Comment 7 Mike Kaganski 2024-08-15 15:57:35 UTC
(In reply to V Stuart Foote from comment #6)

I don't see how this "works". The problem is that it can't match anything across paragraphs' bounds, like "two last characters of paragraph and three first characters of the next paragraph". This is not a documentation issue; this is not "INVALID". But I'm sure we have this filed somewhere already.
Comment 8 V Stuart Foote 2024-08-15 16:22:09 UTC
OK, we don't/can't test "paragraph" breaks with regexp because there are none (that ICU can parses) and IIRC we would have to refactor to support the regexp look-behind/look-ahead modes.  

Otherwise, full document runs can be edited with regexp, just not in a single pass.

But since it is trivial to replace all paragraph ends (our $ as represented by Pilcrow glyph) with a marker. Then parse the entire document text run. And after any changes restore the structure with a new paragraph ('\n' replacement for "marker" to recreate the paragraph breaks.

Obviously can have issues with a heavily styled document.
Comment 9 V Stuart Foote 2024-08-15 16:23:16 UTC
And sorry, meant to leave it NEW against documentation, not Resolved Invalid...
Comment 10 fpy 2024-08-16 20:05:02 UTC
(In reply to V Stuart Foote from comment #6)
> Actually the help article (par_id3153414) is not accurate. 

fixed here https://gerrit.libreoffice.org/c/help/+/171538
feek free to review ;)

> A broader rework of regexp (more seamless use) is in see also bug 38261. 

trying to, progressively.
Comment 11 Commit Notification 2024-08-23 20:06:18 UTC
Pierre F committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/help/commit/e3caa53e99709b7099611b67cf73e9bdbd8801ea

more (simple) regex examples + fix note on paragraph limitation. tdf#38261, tdf#159607
Comment 12 V Stuart Foote 2024-08-23 21:07:12 UTC
(In reply to Commit Notification from comment #11)
> Pierre F committed a patch related to this issue.
> It has been pushed to "master":
> 
> https://git.libreoffice.org/help/commit/
> e3caa53e99709b7099611b67cf73e9bdbd8801ea
> 
> more (simple) regex examples + fix note on paragraph limitation. tdf#38261,
> tdf#159607

Thanks Pierre, documentation and examples getting better, but not sure we yet have quite the correct wording about Regexp matches and Paragraphs.

"A search using a regular expression will work only within one paragraph. That is, a \n will match a line break within a paragraph."

We can in fact match a string in *every* paragraph of a document in one pass.

We just can't match strings at the Paragraph bounds, e.g. a Paragraph ending with xyz$ or starting with ^abc (needing to implement support look-ahead / look-behind syntax to be able to construct a pattern to match).
Comment 13 fpy 2024-08-24 06:07:45 UTC
(In reply to V Stuart Foote from comment #12)

thanks Stuart for the feedback, 

> ... not sure we
> yet have quite the correct wording about Regexp matches and Paragraphs.
> 
> "A search using a regular expression will work only within one paragraph.
> That is, a \n will match a line break within a paragraph."
> 
> We can in fact match a string in *every* paragraph of a document in one pass.

yes. what I'm trying to say, is a regex can't cross OVER the paragraph boundary (see attachment)
feel free to suggest a better wording of course.

> We just can't match strings at the Paragraph bounds, e.g. a Paragraph ending
> with xyz$ or starting with ^abc 

huh? we definitely can! see 2nd attachment)
the only limitation is for "^" to be followed by something, example given in the help.

> (needing to implement support look-ahead /
> look-behind syntax to be able to construct a pattern to match).




PS. this bugzilla is a pain for simple editing. need to upgrade or move to Ask! :/
Comment 14 fpy 2024-08-24 06:10:57 UTC
Created attachment 195994 [details]
regex can't cross OVER the paragraph boundary
Comment 15 fpy 2024-08-24 06:12:01 UTC
Created attachment 195995 [details]
match strings at the Paragraph bounds, e.g.  xyz$ or  ^abc
Comment 16 V Stuart Foote 2024-08-24 10:23:54 UTC
(In reply to fpy from comment #13)
> 
> yes. what I'm trying to say, is a regex can't cross OVER the paragraph
> boundary (see attachment)
> feel free to suggest a better wording of course.
> 

Yes, exactly.

> > We just can't match strings at the Paragraph bounds, e.g. a Paragraph ending
> > with xyz$ or starting with ^abc 
> 
> huh? we definitely can! see 2nd attachment)
> the only limitation is for "^" to be followed by something, example given in
> the help.
> 

Sorry, a fingerflub. Knew that was not perfect after I sent it, thought about submitting "s/xyz$ or starting with ^abc/xyz$.*^abc/" correction to merge the strings. But... BZ

> 
> PS. this bugzilla is a pain for simple editing. need to upgrade or move to
> Ask! :/

Yep BZ can be tedious/unforgiving, but not sure Ask or SE style would be any sort of improvement for organizing issues that BZ does well.

Thanks for working on this.
Comment 17 V Stuart Foote 2024-08-24 10:30:54 UTC
(In reply to V Stuart Foote from comment #16)
> > yes. what I'm trying to say, is a regex can't cross OVER the paragraph
> > boundary (see attachment)
> > feel free to suggest a better wording of course.
> > 
> 
> Yes, exactly.

And that needs to be the guidance in the the Help articles and userguide--that the regexp pattern can't match OVER|ACROSS bounds between paragraphs (currently, without needed look-behind/look-ahead implementation).