Bug 31480 - Find/replace non-printing characters easily
Summary: Find/replace non-printing characters easily
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: accessibility
: 31509 137219 (view as bug list)
Depends on:
Blocks: Find-Search
  Show dependency treegraph
 
Reported: 2010-11-08 16:01 UTC by David Nelson
Modified: 2024-04-23 13:15 UTC (History)
12 users (show)

See Also:
Crash report or crash signature:


Attachments
Simple example of a tiny bit of text with formating. (8.45 KB, application/vnd.oasis.opendocument.text)
2021-02-19 16:41 UTC, Jan-Marek Glogowski
Details
Dump of the internal SwNodes structure (3.31 KB, text/xml)
2021-02-19 16:48 UTC, Jan-Marek Glogowski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Nelson 2010-11-08 16:01:27 UTC
Hi, :-)

In Microsoft Office, when you do a find/replace, you have a dropdown list enabling you to easily include many special characters to search for, such as

- carriage return
- new line
- tab mark
- page break
- non-breaking spaces
- and various others.

In LibO, I think you can only do this via regular expressions. But your average user is incapable of using regular expressions.

Could you possibly add a similar dropdown box?

Thanks if so, and thanks very much for your work. :-)
Comment 1 Don't use this account, use tml@iki.fi 2010-11-09 01:47:33 UTC
But note that being able to search for "carriage return", "new line" and "page break" (and possibly also the other ones you mention) depends on those being present in the internal representation of text. I am not sure at all these *typographical concepts" exist in the internal representation of text in OpenOffice.org/LibreOffice. I think I have been told that OOo/LO uses a much more "structured" approach with separate objects for paragraphs etc, and maybe then even stores forced line breaks just as data structures, not as actual embedded carriage returns and/or new line characters.

So implementing this might be much more complex than what it perhaps is in MS Office. That doesn't mean it wouldn't be useful, of course. Even if we keep the traditional OOo way to store text in LibreOffice, we could present to the user an illusion that also the formatting characters you mention are actually present. That might be useful for people migrating from MS Office.

On the other hand, for the (few...) people who actually prefer to think of documents in a structured fashion and not as stream of characters including formatting characters, being able to search for for instance carriage returns sure would seem unnatural. In an ideal world, that is how one should conceptualize documents, no?

Of course, I might be totally misunderstanding stuff above, and in that case, feel free to correct me, and/or ignore my rambling.
Comment 2 David Nelson 2010-11-09 03:09:26 UTC
Hi Tor, 

Thank you for your comments. I think you've indeed understood what I was on about, but:

I understand what you're talking about as regards LibO/OOo's internal storage.

However, that is invisible to the end user.

I, the dumb end user, pressed the carriage return key while typing. I don't care how the software stores it. But I want to be able to search for that carriage return after.

Same thing when I press Shift-Enter (a "new line" or "soft return"). I want to be able to find those "new line" "characters" after.

Same thing for tab "characters". Etc.

Since I made those keystrokes and they have a result on-screen, they are obviously being stored in some form or other. Otherwise, next time I open it, my doc would look different from the way it looked when I typed it, no? ;-)

I sometimes need to search for the "new line" characters and replace them with a "carriage return" and thus create new paragpaphs, etc. Or I need to search for 8 space characters and replace them with a "tab mark" instead.

In MS Office, I have a dropdown list of such "special characters" and it makes life very simple to use them in find/replaces.

Could we get that in LibO, too, please?

Thanks if so. ;-)

Please let me know if I haven't explained clearly. :-)
Comment 3 Kohei Yoshida 2010-11-09 11:30:30 UTC
*** Bug 31509 has been marked as a duplicate of this bug. ***
Comment 4 David Nelson 2010-11-09 11:36:34 UTC
Please note that the term I meant was NON-PRINTING CHARACTERS, not "special characters"...
Comment 5 Gudmund 2011-04-16 09:49:45 UTC
(In reply to comment #3)
> *** Bug 31509 has been marked as a duplicate of this bug. ***

(In reply to comment #2)
> Hi Tor, 
> 
> Thank you for your comments. I think you've indeed understood what I was on
> about, but:
> 
> I understand what you're talking about as regards LibO/OOo's internal storage.
> 
> However, that is invisible to the end user.

Indeed, unlike plain text, where you actually can search for and replace newlines (LF), carriage returns (CR) or combinations (CRLF) if you use the right text handling tools.

Some Unicode pointers:
 LF:    Line Feed, U+000A
 CR:    Carriage Return, U+000D
 CR+LF: CR (U+000D) followed by LF (U+000A)
 NEL:   Next Line, U+0085
 LS:    Line Separator, U+2028
 PS:    Paragraph Separator, U+2029

(I wonder how LibreOffice handles plain text files internally, since those characters really *are* there then...)

> I, the dumb end user, pressed the carriage return key while typing. I don't
> care how the software stores it. But I want to be able to search for that
> carriage return after.
> 
> Same thing when I press Shift-Enter (a "new line" or "soft return"). I want to
> be able to find those "new line" "characters" after.

I can't see why LibreOffice couldn't handle these things by allowing the user an easy way to *both* search *and* replace arbitrary combinations of CR and LF, by handling these things inside the content.xml.

> Since I made those keystrokes and they have a result on-screen, they are
> obviously being stored in some form or other. Otherwise, next time I open it,
> my doc would look different from the way it looked when I typed it, no? ;-)


This is what it can look like:
"<text:p text:style-name="Standard">Two paragraphs starting with this line ending</text:p>
<text:p text:style-name="Standard"/>
-<text:p text:style-name="Standard">Two newlines starting with this line ending<text:line-break/>
<text:line-break/>"

It looks like there may be a few cases to handle. Paragraphs seem to have and opening tag (<text:p text:style-name="Standard"/>), and a closing tag (</text:p>) only if there was text in the line, while newlines only have closing tags (<text:line-break/>).

Writing a textutils script that can handle this simple example is a bit of work, but surely not too hard, even for a non-programmer like me, so a pro like the LibreOffice developers shouldn't find it hard at all ;). 

The only potential problem I can see in this simple example, is the "Standard" style-name bit inside the opening tag. Is there a policy for this in LibreOffice, like using the closest preceding one, or polling a standard template?

My guess at why LibreOffice handles it this way, is that it helps make it handle paragraphs, newlines etc. in a uniform way across platforms that have different ways of handling new lines.

> I sometimes need to search for the "new line" characters and replace them with
> a "carriage return" and thus create new paragpaphs, etc. Or I need to search
> for 8 space characters and replace them with a "tab mark" instead.

You're not alone in this. It's a showstopper for me too and many others, reducing LibreOffice to a very limited number of tasks, forcing me to keep MS Office, which I want to get rid of.
Comment 6 Björn Michaelsen 2011-12-23 11:34:04 UTC Comment hidden (obsolete)
Comment 7 sasha.libreoffice 2012-03-23 07:28:49 UTC
In 3.5.1 not implemented yet
> - carriage return
> - new line
> - tab mark
IMHO it is more easy to add to context Help and tooltips information how to search for these characters using regular expressions than actually implement them.
Similarly for replacing for them.

Problem is with this:
> - page break
> - non-breaking spaces
I do not know how to find them using regular expressions.
Comment 8 Gryllida 2012-04-26 17:43:22 UTC
Implementing graphical user interface (drop-down list) for at least the existing regular expressions, such as \t, \n, $, ^, would be useful to novice users.

There is an add-on ("alternative find and replace" [1]) which does the job, (including probable workarounds of the way LibreOffice stores text? it can actually handle \n in a way different from what the regular expressions page [2] says); it can probably be helpful to implement this bug.

[1] http://extensions.openoffice.org/en/project/AltSearch
[2] http://help.libreoffice.org/Common/List_of_Regular_Expressions
Comment 9 QA Administrators 2014-10-23 17:32:08 UTC Comment hidden (obsolete)
Comment 10 sasha.libreoffice 2014-10-24 11:08:51 UTC
in 4.3.1.2 not implemented yet
Comment 11 Adolfo Jayme Barrientos 2014-12-25 11:17:11 UTC
*** Bug 87645 has been marked as a duplicate of this bug. ***
Comment 12 Annlaxton 2020-08-19 10:11:28 UTC Comment hidden (spam)
Comment 13 Daveo 2020-10-18 08:35:33 UTC
*** Bug 137219 has been marked as a duplicate of this bug. ***
Comment 14 Eyal Rozenberg 2020-10-18 09:52:44 UTC
Author of dupe bug 137219 here...

I believe this bug mixes up several issues - because the MS Word feature it refers to mixes up these issues:

1. The ability to search for multi-line / multi-paragraph patterns
2. The ability to search for non-printing characters which relate to LO document structure, e.g. LF, CR, PS (see comment #5) and maybe others.
3. The UI for exposing these abilities, and whether it needs to be similar to MS-Word's in referring to "low-level" concepts like LF and CR and their effect, or only to "high-level" concepts like lines and paragraphs.
4. The ability to search for characters which don't relate to document structure, but which can't easily be entered using the keyboard, like non-breaking space and tab, or non-printing characters such as ZWJ, RLM, LRM and so on.

I therefore suggest that these issues be split up into separate bugs with appropriate dependencies/relations between them.

Why?

* NBSP and tab searching are really just a UI shortcoming, they can be searched even now if you paste these characters from elsewhere and there's no reason to hold that up for the other, deeper, more complex issues.
* Part of what can be / needs to be done is allowing for regular expressions like "foo\n.*bar" to work in some reasonable way. That can be done without deciding whether LFs/CRs are exposed to the user or not.


I would like to ask for the CC list members' opinions on this suggestion.
Comment 15 Indiana Lambert 2021-02-05 10:27:17 UTC Comment hidden (spam)
Comment 16 Eyal Rozenberg 2021-02-05 10:48:38 UTC Comment hidden (obsolete)
Comment 17 Jennifer Sanchez 2021-02-19 12:01:50 UTC Comment hidden (spam)
Comment 18 Jennifer Sanchez 2021-02-19 12:03:17 UTC Comment hidden (spam)
Comment 19 Jennifer Sanchez 2021-02-19 12:03:41 UTC Comment hidden (spam)
Comment 20 Jennifer Sanchez 2021-02-19 12:04:32 UTC Comment hidden (spam)
Comment 21 Jennifer Sanchez 2021-02-19 12:05:00 UTC Comment hidden (spam)
Comment 22 Jennifer Sanchez 2021-02-19 12:05:42 UTC Comment hidden (spam)
Comment 23 Jan-Marek Glogowski 2021-02-19 16:41:36 UTC
Created attachment 169902 [details]
Simple example of a tiny bit of text with formating.
Comment 24 Jan-Marek Glogowski 2021-02-19 16:48:37 UTC
Created attachment 169903 [details]
Dump of the internal SwNodes structure

Just to give you an idea, how Writer internally sees this simple document. There is simply no way to search for newlines with an regexp. NBSP OTOH has a direct representation in utf-8 (C2 A0) / unicode, that's why you can search for it.

Obviously it's not an unsolvable problem, but nobody found it yet important enough to implement, or even got budget for it.
Comment 25 Eyal Rozenberg 2021-02-19 17:55:40 UTC
(In reply to Jan-Marek Glogowski from comment #24)
> There is simply no way to search for newlines with an regexp.

I believe you're misunderstanding... the requested feature is not applying a regexp to the serialization of LO's representation of the document, but rather to implement a regexp in a way in which a more complex search (e.g. using XPath or what-not) would perform a close equivalent of what a regexp application would in a pure textual document.

Specifically, in your example document, the regexp /Two\nThree/ would find a match, starting before the T of Two and ending after the Three, on the next line.

It is up to LO to make this "magic" happen.
Comment 26 Jan-Marek Glogowski 2021-02-19 18:14:34 UTC
(In reply to Eyal Rozenberg from comment #25)
> (In reply to Jan-Marek Glogowski from comment #24)
> > There is simply no way to search for newlines with an regexp.
> 
> Specifically, in your example document, the regexp /Two\nThree/ would find a
> match, starting before the T of Two and ending after the Three, on the next
> line.

Just for this match to happen, you would need to convert the internal representation of newline to "\n", so a regexp can match, and somehow convert a result back. And IMHO that would be really non-trivial. You don't want to write a regexp abstraction over LO internal representation. That would probably be even harder.
Comment 27 Eyal Rozenberg 2021-02-19 18:39:38 UTC
(In reply to Jan-Marek Glogowski from comment #26)
> Just for this match to happen, you would need to convert the internal
> representation of newline to "\n", so a regexp can match

You would likely not want run a regexp search. Perhaps something like an XQuery or XPath lookup. i.e. you would probably transform the regex, not the document.


> And IMHO that would be really non-trivial.

Yes, of course it would be non-trivial - it would require quite a bit of programming work. And that is another reason why I think this bug should be split up.
Comment 28 Claudia 2021-08-29 03:50:41 UTC Comment hidden (spam)
Comment 29 olivia emma 2022-01-21 23:16:10 UTC Comment hidden (spam)
Comment 30 olivia emma 2022-01-22 15:03:24 UTC Comment hidden (spam)
Comment 31 olivia emma 2022-01-23 21:16:46 UTC Comment hidden (spam)
Comment 32 EvelynHarper 2022-01-23 23:17:54 UTC Comment hidden (spam)
Comment 33 Abarrane madson 2022-01-25 14:53:45 UTC Comment hidden (spam)
Comment 34 rock smith 2022-04-06 12:36:57 UTC Comment hidden (spam)
Comment 35 rock smith 2022-04-06 12:39:26 UTC Comment hidden (spam)
Comment 36 3d Cube Bpo 2022-06-16 18:26:08 UTC Comment hidden (spam)
Comment 37 Aliza Smith 2022-06-24 05:52:26 UTC Comment hidden (spam)
Comment 38 annaluss 2022-07-20 08:23:40 UTC Comment hidden (spam)
Comment 39 Nekopoi APK 2022-08-15 12:54:50 UTC Comment hidden (spam)
Comment 40 tudor sebastian 2022-08-20 06:05:40 UTC Comment hidden (spam)
Comment 41 YouTecho 2022-08-29 05:21:25 UTC Comment hidden (spam)
Comment 42 AlleneBrick 2022-12-08 06:47:30 UTC Comment hidden (spam)
Comment 43 Allene Brick 2022-12-10 10:07:12 UTC Comment hidden (spam)
Comment 44 sofia 2022-12-19 19:15:27 UTC Comment hidden (spam)
Comment 45 Nekopoi APK 2023-01-17 10:51:13 UTC Comment hidden (spam)
Comment 46 Yeezy 2023-02-08 08:02:51 UTC Comment hidden (spam)
Comment 47 Nekopoi APK 2023-04-26 16:54:43 UTC Comment hidden (spam)
Comment 48 totogorae 2023-04-27 07:12:11 UTC Comment hidden (spam)
Comment 49 totogorae 2023-04-27 07:13:01 UTC Comment hidden (spam)
Comment 50 totogorae 2023-04-27 07:13:19 UTC Comment hidden (spam)
Comment 51 totogorae 2023-04-27 07:13:27 UTC Comment hidden (spam)
Comment 52 totogorae 2023-04-27 07:13:34 UTC Comment hidden (spam)
Comment 53 totogorae 2023-04-27 07:13:43 UTC Comment hidden (spam)
Comment 54 totogorae 2023-04-27 07:13:53 UTC Comment hidden (spam)
Comment 55 totogorae 2023-04-27 07:14:03 UTC Comment hidden (spam)
Comment 56 totogorae 2023-04-27 07:14:12 UTC Comment hidden (spam)
Comment 57 totogorae 2023-04-27 07:14:24 UTC Comment hidden (spam)
Comment 58 totogorae 2023-04-27 07:14:33 UTC Comment hidden (spam)
Comment 59 totogorae 2023-04-27 07:14:43 UTC Comment hidden (spam)
Comment 60 Nekopoi APK 2023-05-04 04:32:35 UTC Comment hidden (spam)
Comment 61 Nekopoi APK 2023-05-04 04:32:54 UTC Comment hidden (spam)
Comment 62 Nekopoi APK 2023-05-04 04:33:04 UTC Comment hidden (spam)
Comment 63 Thomas112 2023-05-09 11:13:04 UTC Comment hidden (spam)
Comment 64 anton 2023-05-12 09:48:04 UTC Comment hidden (spam)
Comment 65 anton 2023-05-12 09:49:01 UTC Comment hidden (spam)
Comment 66 ggg123 2023-05-15 09:15:28 UTC Comment hidden (spam)
Comment 89 Naga Petir 2023-05-20 08:26:56 UTC Comment hidden (spam)
Comment 90 Naga Petir 2023-05-20 08:27:12 UTC Comment hidden (spam)
Comment 91 Naga Petir 2023-05-20 08:28:18 UTC Comment hidden (spam)
Comment 93 Luis212 2023-06-06 08:02:35 UTC Comment hidden (spam)
Comment 94 alex clerk 2023-06-11 11:57:12 UTC Comment hidden (spam)
Comment 95 arthurjonh 2023-06-11 13:31:51 UTC Comment hidden (spam)
Comment 96 Naga Petir 2023-06-12 04:37:57 UTC Comment hidden (spam)
Comment 97 Naga Petir 2023-06-12 04:38:12 UTC Comment hidden (spam)
Comment 98 Naga Petir 2023-06-12 04:38:33 UTC Comment hidden (spam)
Comment 99 Afton123 2023-06-12 10:52:41 UTC Comment hidden (spam)
Comment 100 mynorthsidehr.site 2023-06-14 17:54:14 UTC Comment hidden (spam)
Comment 101 john smith 2023-06-16 12:30:05 UTC Comment hidden (spam)
Comment 102 kantimemedicare.one 2023-06-16 15:13:09 UTC Comment hidden (spam)
Comment 103 ehallpass365@gmail.com 2023-06-19 07:01:03 UTC Comment hidden (spam)
Comment 104 Naga Petir 2023-07-01 19:21:47 UTC Comment hidden (spam)
Comment 105 madisonabubakar 2023-07-03 01:17:05 UTC Comment hidden (spam)
Comment 106 madisonabubakar 2023-07-03 01:18:39 UTC Comment hidden (spam)
Comment 107 beer 2023-07-04 03:49:54 UTC Comment hidden (spam)
Comment 108 SaintOtis12 2023-07-10 01:35:08 UTC Comment hidden (spam)
Comment 109 anton 2023-07-18 09:03:53 UTC Comment hidden (spam)
Comment 110 anton 2023-07-22 12:56:05 UTC Comment hidden (spam)
Comment 111 loginssaga 2023-07-24 17:37:23 UTC Comment hidden (spam)
Comment 112 loginssaga 2023-07-24 17:38:14 UTC Comment hidden (spam)
Comment 113 masonclass 2023-07-26 12:21:31 UTC Comment hidden (spam)
Comment 116 loginssaga 2023-08-04 13:25:05 UTC Comment hidden (spam)
Comment 117 anton 2023-08-05 13:01:46 UTC Comment hidden (spam)
Comment 118 beer 2023-08-07 06:33:31 UTC Comment hidden (spam)
Comment 119 foxdealer 2023-08-08 08:34:10 UTC Comment hidden (spam)
Comment 120 badshah 2023-08-17 13:52:41 UTC Comment hidden (spam)
Comment 121 beer 2023-09-05 09:18:39 UTC Comment hidden (spam)
Comment 122 Resso MOD APK 2024-02-23 07:00:10 UTC Comment hidden (spam)