Bug Hunting Session
Bug 62603 - Find/Replace affects formatting in undesired ways
Summary: Find/Replace affects formatting in undesired ways
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: high critical
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Find-Search
  Show dependency treegraph
 
Reported: 2013-03-21 18:19 UTC by Christian Gagné
Modified: 2019-05-23 12:51 UTC (History)
7 users (show)

See Also:
Crash report or crash signature:


Attachments
findReplace_formatTest.doc: simplify testing with a copy/paste/highlight doc. (9.00 KB, application/msword)
2016-09-09 06:18 UTC, Justin L
Details
Example file to reproduce the bug (13.70 KB, application/vnd.oasis.opendocument.text)
2017-10-07 05:10 UTC, Luke Kendall
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christian Gagné 2013-03-21 18:19:37 UTC
This bug corresponds to the Apache OpenOffice bug 121482, which was previously marked as RESOLVED FIXED but has now been reopened.

Since LibreOffice 4.0 now uses the the same ICU-based regexp engine as AOO 3.4, it also suffers from the same formatting-related problems. Regexp-based search and replace operations now affect a text portion’s formatting, even though no style-related operation was specified.

For example, the regexp search "([:alnum:]) replaced with “$1 to replace straight quotes with curly quotes affects the formatting (specifically, italics in this case are removed from part of the text portion).

This seems to suggest that search and replace operations using regular expressions now not only operate on the underlying text content, but also interfere with the text’s *representation*, which is of great concern since one of the most important principles of both AOO and LibO is that they are supposed to cleanly separate the “model” or content structure from the “frame” or visible representation of the content.

This bug yields an underlying question: does the ICU regexp engine really allow clean seperation between content and presentation? Is the problem solely related to AOO and LibO’s implementation of ICU or is there an inherent problem in ICU?

It appears necessary for the LibO project to fix this bug by themselves and independently of AOO, since it is assumed that LibO will not re-base their code on AOO’s in the future. The fact that AOO once thought that the bug was fixed but then changed their mind and realized that they were not sure is troubling.

Eventually, fixing this bug might require cleaning up and improving the API’s *search descriptors*, especially with regards to the way text portions are treated by search descriptors. The very old and as of yet unfixed enhancement bug (OOo/AOO bug 2997) asking for the addition of character styles searches through the search and replace dialog comes to mind. It is troubling that this particular issue was never fixed in more than ten years. The search decriptors’ use of the `awt` module for locating character formatting in paragraphs might be a hint to understanding this issue.
Comment 1 Christian Gagné 2013-03-22 15:06:52 UTC
Note: In the previous comment, the reference to the AOO issue number 2997 inadvertently links to an unrelated issue that pertains to another software module; please disregard the link.
Comment 2 manj_k 2013-03-22 15:25:37 UTC
Added AOO bug URLs to "See Also".
Comment 3 Thomas Hackert 2013-06-21 15:32:55 UTC
Hello Christian, *,
thank you for reporting this bug :) As I am only one of the QA guys and not able to understand enough from your description, but can reproduce the bug as follows:

1. Open a new Writer document
2. Enter <quote>"test "test</quote>, where the second "test" is set to italic
3. <Ctrl>+<H>
4. Enter <quote>"([:alnum:])</quote> in "Search for"
5. Enter <quote>“$1</quote> in "Replace with"
6. Click on "Other Options" and mark "Regular expressions"
7. Now click on "Replace All"

Result: Quotation marks are replaced, the "t" from the second test is not italic any more ...
Expected result: LO should replace the quotation marks, but not touch the formation

I have to say, that I have to copy the quotation mark to Writer, as Writer replaced my inserted ones with curly quotation marks ... :(

LO: Version: 4.1.0.1 Build ID: 1b3956717a60d6ac35b133d7b0a0f5eb55e9155 with Germanophone lang- as well as helppack
OS: Debian Testing AMD64
HTH
Thomas.
Comment 4 Thomas van der Meulen 2013-06-21 15:46:39 UTC
Thank you for your bug report, I can reproduce this bug running LibreOffice Version: 4.1.0.1
Build ID: 1b3956717a60d6ac35b133d7b0a0f5eb55e9155 on Mac osx 10.8.4. 

I just did the steps that Thomas gave me and the T wasn't italic anymore.
Comment 5 QA Administrators 2015-03-17 00:10:20 UTC Comment hidden (obsolete)
Comment 6 Buovjaga 2015-03-31 07:02:55 UTC
(In reply to thackert from comment #3)
> Hello Christian, *,
> thank you for reporting this bug :) As I am only one of the QA guys and not
> able to understand enough from your description, but can reproduce the bug
> as follows:
> 
> 1. Open a new Writer document
> 2. Enter <quote>"test "test</quote>, where the second "test" is set to italic
> 3. <Ctrl>+<H>
> 4. Enter <quote>"([:alnum:])</quote> in "Search for"
> 5. Enter <quote>“$1</quote> in "Replace with"
> 6. Click on "Other Options" and mark "Regular expressions"
> 7. Now click on "Replace All"
> 
> Result: Quotation marks are replaced, the "t" from the second test is not
> italic any more ...
> Expected result: LO should replace the quotation marks, but not touch the
> formation
> 
> I have to say, that I have to copy the quotation mark to Writer, as Writer
> replaced my inserted ones with curly quotation marks ... :(

Reproduced.

Win 7 Pro 64-bit Version: 4.5.0.0.alpha0+
Build ID: 8c3cf9dd48e40604867d3a28bddaccd65142df17
TinderBox: Win-x86@62-TDF, Branch:MASTER, Time: 2015-03-27_15:15:18
Locale: fi_FI
Comment 7 Luke Kendall 2015-12-11 02:43:42 UTC
Yes, I found that somehow, all my non-breaking spaces in my <nbsp><sp> pairs of spaces between sentences had been lost: changed into just plain <sp><sp>.  I did a regexp search and replace to fix them all.  Some days later, I noticed that wherever the replacement was made and the end of the sentence was in italics, it copied the italic style across to the first character of the next sentence.  

Likewise, where the end of the 1st sentence was Regular, and the 2nd sentence was italic, the 1st character of the italic sentence was changed to Regular.

Caused me a lot of work to manually find and fix them.  This is a truly horrible bug.  Because you can't search for text and include atribute style changes, you can't easily find them, and you certainly can't fix them with a big search and replace.  Given that my novel is over 130,000 words long, his bug is really punishing me.

That's twice, now, this bug has affected my novel, and I've had to manually fix it.  At least this time I found this bug report, and found what caused it, so I can at least use a search for likely sentence-starts with italic format, to narrow the search, even if each fix must be done manually.

An odd thing I noticed while fixing this, too: when the Find&Replace dialog has focus, all the comments in the document are not displayed.  But when I click back into the document and select the text to correct the italicisation of, the comments are still not displayed: they instantly reappear however when I either clear the formatting or set the text to italic: not before.  That seems odd, so I thought I'd mention it.
Comment 8 Luke Kendall 2015-12-11 02:44:27 UTC
I should add, I'm using LO 5.0.3, on Linux (Ubuntu 14.04).
Comment 9 Luke Kendall 2015-12-14 06:14:48 UTC
I would like to suggest that this bug level is probably critical, as the correct encoding of text attribute is lost.  This is data loss.  F&R damages the document content, and the more replacements that are made, the bigger the damage caused. 

In some ways the damage is made worse by the fact that the damage it causes is a little subtle, so you may not even notice the errors introduced until (like me), you have made so many other changes that you can't simply revert to an earlier copy of your document (assuming you have one).

It also means that Find&Replace with regular expressions cannot be used - or at least, cannot be used to do the actual replacement.  So it is also a major loss of function: but the data loss is worse.

I do not have permission to increase the bug severity: I hope that I have argued the case for such a change, however.
Comment 10 Buovjaga 2015-12-14 06:39:53 UTC
Ok, adjusting per https://wiki.documentfoundation.org/images/0/06/Prioritizing_Bugs_Flowchart.jpg

Luke: would you like to join the QA? team https://wiki.documentfoundation.org/QA
Then we could discuss prioritizing and give you the rights to do it.
Comment 11 Luke Kendall 2016-01-12 09:03:55 UTC
Sorry, I've been absolutely flat-out getting my novel published, and Christmas and New Year (and I still am, as I'm now finalising the print edition(s) and starting to plan the book launch), but I'd be happy enough to do so.

I had a look at the link provided, and had a look at the flowchart, which is clear and makes good sense to me.  So, yes, I'm happy to get involved as you suggest.

My apologies for the long delay in replying.  I have now added myself to the Cc list, so I'll notice further emails much more quickly (rather than noticing it because I happened to come back to see if anything had changed).

Cheers,
luke
Comment 12 Buovjaga 2016-01-12 09:33:18 UTC
Ok, Luke, I added you to contributors. Drop by on IRC whenever you feel like discussing things: https://wiki.documentfoundation.org/QA/IRC
Comment 13 Buovjaga 2016-01-12 10:55:04 UTC
Setting priority to high.
Comment 14 Luke Kendall 2016-04-01 07:31:39 UTC
Could this be assigned to someone?  There are some global edits I need to do, but do not dare to, because of this bug: the work it would cause me to fix the text afterwards would be about the same as doing the global edits manually.

I'm thinking I may need to switch to some other .docx or .odt editor (liem Office) just to do the global edit.  But I'm not optimistic, because round-tripping tends to break some header/footer/paragraph styles in strange ways.

And because of https://bugs.documentfoundation.org/show_bug.cgi?id=99015 I can't edit the XML directly, either, as an ugly/desperation workaround.
Comment 15 Justin L 2016-09-09 06:18:55 UTC
Created attachment 127228 [details]
findReplace_formatTest.doc: simplify testing with a copy/paste/highlight doc.

confirmed still a bug in LO 5.3dev, and has existed since at least bibisect-43all LO3.6 (oldest I can test in Ubuntu 16.04).

Regex is irrelevant - it happens without regex as well.  Basically, Find/Replace applies the format of the first character across the entire replaced string.
Comment 16 Luke Kendall 2016-09-10 03:17:01 UTC
Thanks, Justin - it's especially valuable to know it happens for non-regex uses too!  That means I'd just been lucky so far in those few cases I used it for a simple text replacement - because such strings tend to have far fewer matches, and usually of whole words with a space or punctuation character alongside I suppose.

I still dare not use Replace All, as it does too much damage to my novel(s). I will be *so* glad when this bug is finally fixed!
Comment 17 Xisco Faulí 2016-09-14 14:49:52 UTC
Only regressions should use the keyword 'preBibisect'. Removing it...
Comment 18 Luke Kendall 2017-05-13 07:30:18 UTC
(In reply to Justin L from comment #15)
> Created attachment 127228 [details]
> findReplace_formatTest.doc: simplify testing with a copy/paste/highlight doc.
> 
> confirmed still a bug in LO 5.3dev, and has existed since at least
> bibisect-43all LO3.6 (oldest I can test in Ubuntu 16.04).
> 
> Regex is irrelevant - it happens without regex as well.  Basically,
> Find/Replace applies the format of the first character across the entire
> replaced string.

Could I just add that the effect goes one character beyond the end of the replaced string? I.e. the font attribute of the last character of the match I think is applied to the character past the replaced string.

And since searching for text attributes (e.g., italic) makes Find & Replace unreliable (https://bugs.documentfoundation.org/show_bug.cgi?id=103400), there's also no way to find them with a search, you have to scan every character of your document by eye to find them.
Comment 19 Luke Kendall 2017-10-07 05:10:20 UTC
Created attachment 136821 [details]
Example file to reproduce the bug

I am just reporting that this bug is still present in 5.4.1.2.

Attached is a very short file with a simple example of the broken behaviour.  I hope this helps.

Here are the reproduction steps:

1. Open the provided file (RegExp-format-ruin.odt)
2. Open Find and Replace.
3. Copy and paste the Find string provided in the doc into the Find field
4. Copy and paste the Replace string provided in the doc into the Replace field
5. Click Find.
6. Click Replace

Result: note that the Italic style capital "O" changes to Regular.

From the user perspective, only the space was changed (of course, the whole matched text was replaced).

All the same, the format or attributes of matched sub-patterns ($1, $2 etc.) should be preserved when they are replaced.

PS:
===

I feel this is related to similar problems when manually replacing selected text: the new text typed in is given the format (italic/regular/bold) of the text immediately to its left, instead of the format of the text that is being replaced.

Maybe you would prefer a second bug report for that.  I have done so as Bug 112961.
Comment 20 QA Administrators 2018-10-08 02:48:12 UTC Comment hidden (obsolete)
Comment 21 Luke Kendall 2018-10-08 05:08:28 UTC
I can confirm the bug is still there, exactly as described in the example file supplied to reproduce the bug, RegExp-format-ruin.odt

Tested with:

Version: 6.1.2.1
Build ID: 65905a128db06ba48db947242809d14d3f9a93fe
CPU threads: 4; OS: Linux 4.4; UI render: default; VCL: gtk2; 
Locale: en-GB (en_AU.UTF-8); Calc: group threaded
Comment 22 Luke Kendall 2019-05-23 12:51:10 UTC
Just a note that I'm hopeful that Phil Krylov may have fixed and some larger issues as noted in the (related, I think) https://bugs.documentfoundation.org/show_bug.cgi?id=79717