Bug 137737 - REPLACE not working if there is a footnote or endnote anchor at end of paragraph
Summary: REPLACE not working if there is a footnote or endnote anchor at end of paragraph
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.0.0.3 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.3.0 target:7.2.1
Keywords: bibisected, bisected, regression
Depends on:
Blocks: Find-Search
  Show dependency treegraph
 
Reported: 2020-10-25 16:17 UTC by R. Green
Modified: 2021-09-22 10:38 UTC (History)
9 users (show)

See Also:
Crash report or crash signature:


Attachments
Writer file showing problem with replace function (120.11 KB, application/vnd.oasis.opendocument.text)
2020-10-25 16:17 UTC, R. Green
Details

Note You need to log in before you can comment on or make changes to this bug.
Description R. Green 2020-10-25 16:17:12 UTC
Created attachment 166701 [details]
Writer file showing problem with replace function

Version: 7.0.0.3 (x64)
Build ID: 8061b3e9204bef6b321a21033174034a5e2ea88e
CPU threads: 4; OS: Windows 10.0 Build 19041; UI render: Skia/Raster; VCL: win
Locale: en-GB (en_GB); UI: en-GB
Calc: threaded

Open the attached file.

1. With the cursor at the start of the document, open the Find & Replace dialog.
2. Tick "Regular Expressions". Put "$ (quotation dollar) in "Find" and ” (right double quotation mark: Alt + 0148) in "Replace". Press "Find Next".
3. The quotation mark at the end of paragraph 1 is selected. Now press "Replace".

EXPECTED RESULT: The quotation mark is replaced by a curly right quotation mark.
ACTUAL RESULT: The selection moves to the next instance without replacing anything.

AFAICS, this bug occurs when ANY (looked-for) character is followed by a footnote or endnote anchor at the end of a paragraph.
Comment 1 pavlog 2020-11-15 19:25:57 UTC
Thank you for reporting the bug. I can confirm that the bug is present in 
Version: 7.0.3.1 (x64)
Build ID: d7547858d014d4cf69878db179d326fc3483e082
CPU threads: 4; OS: Windows 10.0 Build 18363; UI render: Skia/Raster; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: CL

As I see it appears with any other symbol in Replace field, not only right double quotation mark. 
Also if in Step 3 click Replace All instead of Replace it works right.
Comment 2 Buovjaga 2020-11-18 10:04:37 UTC
Bibisected with linux64-7.0 to
https://git.libreoffice.org/core/commit/0cd3b7926cafc01d06b589124215e9cb7c148f19
tdf#65038: use trailing string characters for look-ahead assertions

Adding Cc: to Mike Kaganski
Comment 3 Justin L 2021-08-12 11:00:01 UTC
This is fairly easily resolved with
-  if (pRegexMatcher && endPos < searchStr.getLength())
+  if (pRegexMatcher && endPos < searchStr.getLength() && !aSrchPara.searchString.endsWith("$"))

and similar for startsWith("^").

However, making a unit test will be the hardest part. Comment 1 indicates that a replaceAll works, so copying Mike's test code won't work. I'm also puzzled why it WOULD work in the first place.
Comment 4 Justin L 2021-08-12 19:41:01 UTC
proposed fix at http://gerrit.libreoffice.org/c/core/+/120410
Comment 5 Commit Notification 2021-08-25 07:30:00 UTC Comment hidden (obsolete)
Comment 6 Commit Notification 2021-08-25 08:52:38 UTC Comment hidden (obsolete)
Comment 7 Buovjaga 2021-08-25 10:11:06 UTC Comment hidden (obsolete)
Comment 8 R. Green 2021-08-25 11:26:08 UTC Comment hidden (obsolete)
Comment 9 Buovjaga 2021-08-25 11:27:57 UTC Comment hidden (obsolete)
Comment 10 R. Green 2021-08-25 11:31:48 UTC Comment hidden (obsolete)
Comment 11 Mike Kaganski 2021-08-25 15:07:12 UTC
(In reply to Commit Notification from comment #5)

This introduces a regression.

1. Create a lorem ipsum text.
2. Select any random part inside any paragraph.
3. Using F&R, use regex "^." (any character in the beginning of the paragraph), ad replace with "A" (character A).
4. Use "Current selection only"
5. Replace All.

The expected result is that the text is not found (unless we also selected the very first character of the paragraph, in which case the very first character of the paragraph should be replaced). The actual result now is that all the selected text is replaced with AAAAA...
Comment 12 Xisco Faulí 2021-08-25 18:34:09 UTC
(In reply to Mike Kaganski from comment #11)
> (In reply to Commit Notification from comment #5)
> 
> This introduces a regression.

Reported in bug so we don't mix bug ids, for future reference, etc...
Comment 13 Commit Notification 2021-08-25 22:40:59 UTC
Xisco Fauli committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/e73c3942de474004b04c1d538c89354183d44ae9

tdf#144089: Revert "tdf#137737 i18n search: don't expand start/end with regex ^ or $"

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 14 Justin L 2021-08-26 06:24:15 UTC
(In reply to Mike Kaganski from comment #11)
> 3. Using F&R, use regex "^." (any character in the beginning of the
> paragraph), ad replace with "A" (character A).
I don't think ^ specifically means "beginning of paragraph" in every context. I would expect it to be the beginning of the search string. So if only a selection of text was chosen, it would refer to the beginning of the selection (and if the selection contained multiple lines[aka paragraphs], it would also match the beginning of these other lines).

But your regression steps make it plain there is more involved than just this.
Comment 15 Mike Kaganski 2021-08-26 06:43:42 UTC
(In reply to Justin L from comment #14)
> I don't think ^ specifically means "beginning of paragraph" in every
> context. I would expect it to be the beginning of the search string. So if
> only a selection of text was chosen, it would refer to the beginning of the
> selection (and if the selection contained multiple lines[aka paragraphs], it
> would also match the beginning of these other lines).

No. It must match start of the paragraph to behave according to least surprise principle.

The caret is a special kind of a look-behind assertion, and as such, it needs to consider the data *to the left* of current position. Consider a look-behind condition like

  (?<=\s).

It must only succeed if the previous character was a "space". Then in a text like

  lorem ipsum

when you select only the second word "ipsum", and use "current selection only", you would have the "i" *matching* the regex. If you would not consider the text outside of the selection as the context, you would not be able to match the starting character, or you would need to select the character before this. (Yes, there are ways to create more sophisticated queries, but they would likely be less specific - e.g., using ^ as alternative in (?<=^|\s) would also match starts of paragraphs in selections spanning over paragraph boundaries, which is likely not what user wanted.)

Regex engines also have this idea, expressed in e.g. icu::RegexMatcher::region [1] allowing to search inside a specific region, but considering the data outside of it as the context (configurable though).

[1] https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1RegexMatcher.html#a556e40af5e725121ad23944b24e0fb1b
Comment 16 Commit Notification 2021-08-26 07:23:35 UTC Comment hidden (obsolete)
Comment 17 Mike Kaganski 2021-08-26 08:47:54 UTC
DoSearch in sw/source/core/crsr/findtxt.cxx uses lcl_CleanStr, which does some utter magic with special characters in the text. Namely, in case of RES_TXTATR_FTN (among other things), it replaces it with '\x7f' in the middle of the text, and drops in the end. The input looks like

  "He heard quiet steps behind him. That didn't bode well. Who could be following\x01 him this late at night and in this deadbeat part of town."\x01

and the output looks like

  "He heard quiet steps behind him. That didn't bode well. Who could be following\x7f him this late at night and in this deadbeat part of town."

That latter string arrives to the TextSearch::searchForward's searchStr; and in it, the "$ matches (because the footnote anchor is dropped). Note how you will be unable to find regex 'following ' (without quotes; i.e. the word "following" followed by a space) in this text - because there is a footnote anchor between the word and the space.

Personally I would simply not strip the anchor from the string; but that needs a way to treat those meta-characters specially - having some syntax for them; another option would be to remove them *everywhere*, not only at the end (so that the 'following ' example would work). But I digress.

OTOH, ReplaceBackReferences (again in sw/source/core/crsr/findtxt.cxx) does not pre-process the string obtained from the paragraph using that lcl_CleanStr, and passes the original string to TextSearch::searchForward as is. Hence the difference between the results of find and replace. The replace previously relied on the "find" step limiting the endPos, and so the truncation happened inside TextSearch::searchForward using that position; and now the function sees the actual anchor following the position, and the regex does not match.

Trying to find a way to "synchronize" this - likely the "natural" option is to make ReplaceBackReferences use lcl_CleanStr.
Comment 18 Mike Kaganski 2021-08-26 09:20:58 UTC
(In reply to Mike Kaganski from comment #17)

It's even more complicated. After doing the ReplaceBackReferences, SwView::ExecSearch also does SwView::Replace, which again calls DoSearch (indirectly), but now passes nStart = nEnd = 139. In this case, lcl_CleanStr in it does not replace anything (because the anchors are outside of the start-end range). Trying to outsmart ourselves. Fun.
Comment 19 Mike Kaganski 2021-08-26 10:18:49 UTC
FTR: although complicated, the regex

  "(?=[\p{Control}]?$)

works as expected (of course, this does *not* mean that this is not a bug, because it only works because at "search" time it finds the quote in the end of the search string, and at replace time, it finds this quote followed by the control character at the end of the search string).
Comment 20 Mike Kaganski 2021-08-26 11:45:44 UTC
My proposal now is:

1. Fix this by excluding special processing of footnote/endnote anchor. This will change the behavior: the footnote anchor will be considered a "character" (non-word, so e.g. regex g\b will find "following" in the first body paragraph). Note: "$ will *not* find the first body paragraph's last " anymore, while "[:control:]$ will (but of course, one would need to use captures or look-ahead assertions to not replace the footnote/endnote character itself, also removing the note itself).

2. In the long run, see how to process the rest of the special characters consistently. Namely, they are RES_TXTATR_FLYCNT (fly anchor), RES_TXTATR_FIELD (field), RES_TXTATR_REFMARK (reference), RES_TXTATR_TOXMARK (index entry), RES_TXTATR_META and RES_TXTATR_METAFIELD (no idea when these are used). The problem here is that each of them will create the same problem; and for those that conceptually can't be considered "control characters", we need to drop them from the string at *all* times, not only when they are at start/end of string, and also when they are outside of the nStart/rEnd boundaries.

Please reply what you think specifically about #1 here. Note again, that this would fix it in a very different way compared to what OP wanted.
Comment 21 Justin L 2021-08-27 04:49:53 UTC
(In reply to Mike Kaganski from comment #20)
> Please reply what you think specifically about #1 here. Note again, that
> this would fix it in a very different way compared to what OP wanted.

This is difficult because we have two very real situations here. The most common one would be OPs - where LO implementation specifics are not cared about, and the regex is only concerned about the "text". So I think this is the one that needs the most focus, even though regex is a "pro user" feature.

The second would be someone who is actually searching for footnote markers etc. But that "developer class" level of user will be very rare.

I personally expect that find/replace would ignore all of these internal things - and that is obviously how previous implementers also viewed it. (And they also acknowledged it is basically impossible to avoid mid-paragraph.)
Comment 22 Mike Kaganski 2021-08-27 05:40:49 UTC
(In reply to Justin L from comment #21)
> This is difficult because we have two very real situations here. The most
> common one would be OPs - where LO implementation specifics are not cared
> about, and the regex is only concerned about the "text". So I think this is
> the one that needs the most focus, even though regex is a "pro user" feature.
> 
> The second would be someone who is actually searching for footnote markers
> etc. But that "developer class" level of user will be very rare.

Thank you!

Let me mention tdf#70142 for some context here. People *in general* expect that what they *see* on screen is searchable (which includes fields, numbering, etc.); and in this regard, footnote anchor at lease should *participate* in search (even if its character is not (yet) searchable).

OTOH, making dynamic text inside special objects searchable has some drawback: e.g., replacing would be impossible when find-and-replace searches inside fields (how would it replace inside or across boundaries of a field?).

So conceptually, there might be two modes for find: "normal" search, and "search inside fields" (the latter one disabling replace).

But anyway, recognizing special objects as parts of text (even if just as "control characters" in normal mode) would change workflow, but not disallow things (possibly just require somewhat more complex expressions - consider comment 19). I suppose that separating "normal power users" from "advanced power users" when talking about regexes is too much, and IMO just mentioning in the release notes (and related help page) that now the selected special objects (footnote anchors) match [\p{Control}] would be enough - this adds flexibility.

OTOH, things like floating object anchors or reference or index entry are completely "metadata". We need to modify the code to remove those objects from the search string completely (which should be possible IMO, just needs more work; and doing that would enable tdf#70142, because that way, we would be able to replace the special characters *universally* including in the middle of the string).
Comment 23 Mike Kaganski 2021-08-27 07:42:43 UTC
https://gerrit.libreoffice.org/c/core/+/121134 is the proposed WIP change implementing #1 from comment 20. I do not intend to push it right now, before UX could discuss this - only for testing.
Comment 24 Roman Kuznetsov 2021-08-30 16:12:32 UTC
(In reply to Mike Kaganski from comment #23)
> https://gerrit.libreoffice.org/c/core/+/121134 is the proposed WIP change
> implementing #1 from comment 20. I do not intend to push it right now,
> before UX could discuss this - only for testing.

UX-team, Mike needs your opinion here
Comment 25 Heiko Tietze 2021-09-06 10:58:31 UTC
No objection within a week, I guess that's a silent agreement. Like Justin I also think there are not many users who complain about missing anchors in a regex search.
Comment 26 Mike Kaganski 2021-09-22 08:18:15 UTC
So I believe that I have exercised all due precautions (I nod to bug 142214 comment 47 ;D - and I did all the following even before that controversy). I had set needsUXEval and asked additional review at 2021-08-27; I had described the background and implications in comment 20 and comment 22; I waited for more opinions even after comment 25 for two weeks more. I thank Justin for providing the feedback. Providing a configuration option here is *not* an option for now, at least until there is a proper "remove all control characters everywhere" change, as explained in the last paragraph of comment 22. For now, the current behavior is broken. I am going to push the patch now.
Comment 27 Commit Notification 2021-09-22 10:38:18 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/657ec7ff6863ebc79dcea228898b619d43543a6c

tdf#137737: treat foot/endnote anchor as a "normal" (control) character

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.