165931 – Regular expressions must be able to match non-break line endings

Bug 165931 - Regular expressions must be able to match non-break line endings

Summary: Regular expressions must be able to match non-break line endings

Status:	ASSIGNED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	Inherited From OOo
Hardware:	All All

Importance:	low enhancement
Assignee:	László Németh

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	Find&Replace-Regex
	Show dependency tree / graph

Reported:	2025-03-27 13:02 UTC by Eyal Rozenberg
Modified:	2025-05-26 11:06 UTC (History)
CC List:	6 users (show)

See Also:
Crash report or crash signature:

Attachments
selecting last words of the text lines in Find & Replace (after adding U+200d to the end of the lines) (179.92 KB, image/png) 2025-03-28 16:39 UTC, László Németh	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Eyal Rozenberg 2025-03-27 13:02:26 UTC

In a plain text editor, regular expressions are able to match line start and line end; and since lines only end with a newline character, i.e. hard-broken, that is fine (ignoring multi-line match capability of course).

Now, in LibreOffice, lines may be broken because of the paragraph dimensions - the text area width. This is not an ephemeral word-wrap like in a text editor, which is not really a feature of the content, just of the editor window.

It is thus legitimate for the user to want to match Lines ending ...:

1. regardless-of-the-reason, i.e. either wrapping, line-break, paragraph break or end-of-content (in a table cell, in a section, in the document etc.)
2. due to word wrap only.
3. due to line-break.
4. due to end-of-content.
5. due to paragraph break (including section break, page break etc.)

What LO currently offers is matching (3.) with \n and matching (4.) with $  - and that's not nearly good enough.

Comment 1 Eyal Rozenberg 2025-03-27 13:05:35 UTC

Note that this issue is independent of whether text can be matched past the line-end (past line-break, past-paragraph-break etc.; see bug 159607)

Comment 2 Mike Kaganski 2025-03-27 13:12:12 UTC

(In reply to Eyal Rozenberg from comment #0)
> It is thus legitimate for the user to want to match Lines ending ...:

This is another "I claim people need it, without telling why" request. IMO, this is a strong WF (should I also not tell why? But I'll tell: it would be a HUGE increase of complexity, splitting the search between the used regex engine, and the layout - finding these *truly ephemeral* (at least technically) breaks); but maybe there really *is* a compelling use case?

Comment 3 Heiko Tietze 2025-03-28 08:12:17 UTC

Second the missing use case argument. And what exactly is a "non-break line ending"? CR/LF for sure not.

Comment 4 Eyal Rozenberg 2025-03-28 14:57:03 UTC

(In reply to Mike Kaganski from comment #2)
> This is another "I claim people need it, without telling why" request.

(In reply to Heiko Tietze from comment #3)
> Second the missing use case argument. And what exactly is a "non-break line
> ending"? CR/LF for sure not.

I'm sorry, I mistakenly assumed it was somewhat-obvious to everyone because it was very obvious to me; which, indeed, is an occasional failing of mine. So, let me spell out what I mean.

=====
Use case: The user wants to locate some quote or piece of text in their document, and while they don't remember the text exactly, they do remember it had a certain word or phrase at the end of the line.
=====

You'll agree that this is relatively common, right?

Ok, unwards.

As we write documents, many, or even most, transitions between lines are not due to breaks inserted by the user (such as paragraph breaks or line breaks), but rather - stretches of text which are too long to fit on one line, and get wrapped to additional lines. If a typical paragraph in our document takes up, say, 4 lines - we have at least 3 ends-of-lines which are not due to a break and at most one that is.

Additionally, the last (or sometimes single) paragraph within a larger entity, e.g. a drawing object, does not end with a paragraph break; it ends with the end of the containing entity.

Given, that many or most lines end without a break, catering to the use case I listed above requires the ability to indicate that a certain point in a pattern is located at the end of a line. And - since we can already more-or-less locate line breaks and paragraph breaks in patterns (with /\n/ and /$/ respectively), being able to locate an end-of-content end-of-line and a wrap end-of-line, say with pattern FOO and BAR, will "complete the series", and then we would be able to say /(\n|$|FOO|BAR)/ to match _any_ end-of-line.

Comment 5 László Németh 2025-03-28 16:37:22 UTC

Regex library search operates on the plain text conversion of the document, where a single text line contains the full text of a paragraph (i.e. paragraph/line). We always need a plain text conversion (back and forth) of the document for regex search, and we have only a single \n for line end (i.e. in plain text editors, you cannot search for paragraph end without adding some extra syntax or heuristic – similarly, in Writer plain text import, there is a heuristic to recognize shorter lines as paragraph boundaries).

Fortunately there are possible solutions or workarounds: 1) Easy command line, 2) Macro + Find & Replace 3) Macro only (first step for an add-on development) 

== 1) Easy command line ==

1) Export your document to PDF.

2) Grep your plain text content of the PDF, showing the matching lines in Linux/macOS/Cygwin command line: 

$ less document.pdf | grep '” *$'

Note: When I made some research for hyphenation development (https://numbertext.org/typography/automatikus_magyar_elv%C3%A1laszt%C3%A1s_a_LibreOffice-ban.pdf), I used this, generating hundreds of documents with pyUNO, and the basic Linux tool "less" converted the PDFs to plain text documents with the requested line breaks immediately.

== 2) Macro + Find & Replace ==

1. Mark line ends with neutral Unicode characters using UNO, e.g. with zero-width joiner (it depends on your text).

2. Apply Find & Replace with regex pattern matching, e.g. "\w+\W?\u200d" to select last line words (with an optional punctuation mark) using Find All.

3. Format the selected words, e.g. underline them (but other formatting, e.g. applying bold text would change the following line ends, so sometimes it's better to use only macro). 

3. Remove the neutral Unicode characters using Find & Replace.

For example, the Basic code for inserting ZWJ (U+200d): 

'''''''''''''
Sub RunArg(command, args)
dim document   as object
dim dispatcher as object
document   = ThisComponent.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
dispatcher.executeDispatch(document, command, "", 0, args)
End Sub

Sub Run(command)
RunArg(command, Array())
End Sub

Sub HardBreak()
dim args1(1) as new com.sun.star.beans.PropertyValue
cursor = ThisComponent.CurrentController.getViewCursor()
Run(".uno:Escape")
Run(".uno:GoToEndOfDoc")
Do
	' insert ZWJ (zero-width joiner, U+200D) character at the end of the line
	Run(".uno:GoToEndOfLine") 
	args1(0).Name = "Text"
	args1(0).Value = "‍" ' ZWJ within quotation marks
	RunArg(".uno:InsertText", args1)
	' go the the previous line
	Run(".uno:GoLeft")
	Run(".uno:GoToStartOfLine")
	origStart = cursor.Start
	Run(".uno:GoUp")
	' loop until the cursor position doesn't change any more
Loop Until cursor.Text.compareRegionStarts(origStart, cursor.Start) = 0
End Sub
''''''''''''''''''''

Note: it seems, ZWJ can modify hyphenation (maybe a bug), see the attached screenshot.

== 3) Macro-only ==

When the regex replace modifies line breaking, line ends, it's better to use a macro-only solution, e.g. extending the previous macro to do everything automatically. For example, selecting line-by-line the document using UNO dispatcher calls:

	Run(".uno:GoToEndOfLine")
	Run(".uno:StartOfLineSel")

and calling Find & Replace with Search In Selection:

Sub SearchInSelection(regex)
dim args1(22) as new com.sun.star.beans.PropertyValue
args1(0).Name = "SearchItem.StyleFamily"
args1(0).Value = 2
args1(1).Name = "SearchItem.CellType"
args1(1).Value = 0
args1(2).Name = "SearchItem.RowDirection"
args1(2).Value = true
args1(3).Name = "SearchItem.AllTables"
args1(3).Value = false
args1(4).Name = "SearchItem.SearchFiltered"
args1(4).Value = false
args1(5).Name = "SearchItem.Backward"
args1(5).Value = false
args1(6).Name = "SearchItem.Pattern"
args1(6).Value = false
args1(7).Name = "SearchItem.Content"
args1(7).Value = false
args1(8).Name = "SearchItem.AsianOptions"
args1(8).Value = false
args1(9).Name = "SearchItem.AlgorithmType"
args1(9).Value = 1
args1(10).Name = "SearchItem.SearchFlags"
args1(10).Value = 71680 ' code for search in selection
args1(11).Name = "SearchItem.SearchString"
args1(11).Value = regex
args1(12).Name = "SearchItem.ReplaceString"
args1(12).Value = ""
args1(13).Name = "SearchItem.Locale"
args1(13).Value = 255
args1(14).Name = "SearchItem.ChangedChars"
args1(14).Value = 2
args1(15).Name = "SearchItem.DeletedChars"
args1(15).Value = 2
args1(16).Name = "SearchItem.InsertedChars"
args1(16).Value = 2
args1(17).Name = "SearchItem.TransliterateFlags"
args1(17).Value = 1073743104
args1(18).Name = "SearchItem.Command"
args1(18).Value = 1
args1(19).Name = "SearchItem.SearchFormatted"
args1(19).Value = false
args1(20).Name = "SearchItem.AlgorithmType2"
args1(20).Value = 2
args1(21).Name = "Quiet"
args1(21).Value = true
args1(21).Name = "SynchronMode"
args1(21).Value = true
RunArg(".uno:ExecuteSearch", args1())
end sub

(See argument SynchronMode to update the text lines to update the document to select next line correctly).

Note: adding the ZWJ or other mark to the line is still needed.

So it's work for me (especially because regex is already a feature for advanced users), but if you think, please file an enhancement request or reopen this issue with that. Maybe it's worth to add a complete macro-only solution.

Comment 6 László Németh 2025-03-28 16:39:53 UTC

Created attachment 200063 [details]
selecting last words of the text lines in Find & Replace (after adding U+200d to the end of the lines)

Comment 7 Eyal Rozenberg 2025-03-28 18:11:03 UTC

(In reply to László Németh from comment #5)
> Regex library search operates on the plain text conversion of the document,

First, the presumption to treat a document as a sequence of sequences of (paragraph-level) plain text characters - is itself a bug. A document is not that. Other office suites, like MSO for example, do not presume to limit searches to plain-text representations.

Would you rather we also filed that as an underlying bug which blocks many of the feature requests?

> Fortunately there are possible solutions or workarounds: etc. etc.

Laszlo, with respect - suggestions involving complicated procedures, certainly exporting PDFs and working outside the app, are not a way that LibreOffice "works for you". There are workarounds to lots of bugs - that does not invalidate them.

Given what you've said, please consider confirming.

Comment 8 Mike Kaganski 2025-03-28 18:22:36 UTC

(In reply to Eyal Rozenberg from comment #4)
> Use case: The user wants to locate some quote or piece of text in their
> document, and while they don't remember the text exactly, they do remember
> it had a certain word or phrase at the end of the line.
> =====
> 
> You'll agree that this is relatively common, right?

Heh, no. I would consider that highly unlikely, almost impossible case (and my reasoning about great complexity to solve this likely niche use case would mean WF). Additionally - remembering such things in a *text flow*, i.e. when any insertion above could change this, would make such a memory unreliable (the lines shuffled since then - you'll agree that this is relatively common, right? But I can of course be mistaken wrt how common this could be.

Comment 9 Eyal Rozenberg 2025-03-28 18:53:56 UTC

(In reply to Mike Kaganski from comment #8)
> my reasoning about great complexity to solve this likely niche use case
> ... But I can of course be mistaken wrt how common this could be.

Ah, but think of what happens if you save your document as a text file; or if you were working on a text file to begin with. Now, the use case I described is simply "look for a pattern followed by \n" - because in text files, all lines end with \n (if we ignore \r anyway).

So if you "back-translate" this use case - one of the most common you can think of in a text file - to a Writer document - you get this bug (possibly without the distinction between non-break and with-break).

Indeed, supporting this would mean quite a bit of work - it is not a trivial feature to implement. But finding is in the core-of-cores of the functionality of an editor, and a document (rather than plain text) editor needs to be able to find stuff. So this is not a kind of bells-and-whistles request which one can decide that we don't want to have in LO on principle.

Comment 10 Mike Kaganski 2025-03-28 18:57:45 UTC

(In reply to Eyal Rozenberg from comment #9)

But the *whole* difference is that Writer *is* not a text editor, but a *word processor*, with the core idea of text flow *in its heart*. Trying hard to push a scenario from a *different* class of software here is ideologically wrong. It is *very important* for a word processor user to stop thinking in categories not related to this software. And you are doing exactly that - making users feel like they use a text editor.

The more I read, the more I prefer WF.

Comment 11 Mike Kaganski 2025-03-28 18:59:56 UTC

(And even in a text editor, I *myself* would never ever recall, that I saw some word in the end of a line - only if it was in the end of a paragraph, even if that paragraph was in my head - so for me, even in a text editor, this would be ~impossible scenario; but indeed, that may be me.)

Comment 12 László Németh 2025-03-29 01:30:21 UTC

@Eyal: thanks for your suggestion and persistence! I found more attractive examples and an opportunity for extension, see the end of my comment (after my long hesitation). 

(In reply to Eyal Rozenberg from comment #7)
> (In reply to László Németh from comment #5)
> > Regex library search operates on the plain text conversion of the document,
> 
> First, the presumption to treat a document as a sequence of sequences of
> (paragraph-level) plain text characters - is itself a bug. A document is not
> that. Other office suites, like MSO for example, do not presume to limit
> searches to plain-text representations.

Any known implementation helps to prove the legitimacy of an enhancement. I've tried to check Adobe InDesign's GREP regexes, which is based on the boost library. MSO uses/used a more simplified regex-like pattern matching, I haven't checked it, yet.

There is a clear requirement for matching the end (and start) of the lines in the case of optical margin. I've added something in Linux Libertine G, using Graphite's regex-like pattern matching, but that was – a possible – solution for a very special problem.

> 
> Would you rather we also filed that as an underlying bug which blocks many
> of the feature requests?
> 
> > Fortunately there are possible solutions or workarounds: etc. etc.
> 
> Laszlo, with respect - suggestions involving complicated procedures,
> certainly exporting PDFs and working outside the app, are not a way that
> LibreOffice "works for you". There are workarounds to lots of bugs - that
> does not invalidate them.

The problem is that my most difficult solution is still less complicated, that the possible core implementation of the proposed feature, which would break rules, i.e. normal behavior of regex search by mixing different *standards*/areas of Writer core, as Mike pointed out. Designing and implementing a new regex search, which cannot support replace, or adding a new layer, which operates the layout update during the replace is an unaffordable price for the recently [i.e. before changing my mind :] known importance/interoperability of this feature.

On the other hand, UNO's XLineCursor is a perfect API for this and much more. I almost gave a ready solution for a macro solution, which can be the core of an add-on. An experienced LibreOffice add-on developer can finish it within a few days.

> 
> Given what you've said, please consider confirming.

My pleasure. Moreover, I just realized that the layout regex with an extended syntax seems highly useful for my upcoming typography developments: adding 4 different regex layout boundary marks for 1) line end, 2) column end 3) page end and 4) spread end, and a 5th mark for the hyphenation, e.g. (maybe used by ICU regex) \L, \C, \P, \S and \H. Their usage in the regex pattern in Find&Replace enables the layout text export automatically instead of the recent one (document model).

For example:

1) Search too short last paragraph lines (ten or less characters – a real typographical problem):

\L[^\L]{1,10}\n

2) Search top short hyphenation at the end of the pages (also a real typographical problem)

\b\w{1,2}\H\P

3) Search for five or more consecutive hyphenated lines (also an issue):

([^\L\C]*\H){5,}

4) Select (and count) the hyphenated lines:

[^\L\P]*\H 

It's not clear yet, this is the best way to solve my problems, but I like the freedom it would give us to analyse and adjust typography.
 
"The best fixes are the ones you get for free by fixing something else :-)" :)
(https://bugs.documentfoundation.org/show_bug.cgi?id=108025#c18).

Comment 13 Eyal Rozenberg 2025-03-29 09:41:32 UTC

(In reply to Mike Kaganski from comment #11)

I search for stuff at the ends of lines and the beginnings of lines all the time, in text editors. Admittedly, in paragraphs, much less frequently.

> It is *very important* for a word processor user to stop thinking 
> in categories not related to this software. 

Well, for that to actually be the case, we would need to offer 'structural' search rather than just a linear textual pattern, e.g. let people say: "Find me two vertically-adjacent table cells both containing a certain word, in a table that's aligned to the left of the page". But - we don't offer anything like that, we only allow textual search patterns, and some simple style constraints on the whole match. So we should at least bolster the textual pattern capabilities somewhat.

Comment 14 Eyal Rozenberg 2025-03-29 09:59:23 UTC

(In reply to László Németh from comment #12)
> I just realized that the layout regex with an
> extended syntax seems highly useful for my upcoming typography developments:
> adding 4 different regex layout boundary marks for 1) line end, 2) column
> end 3) page end and 4) spread end, and a 5th mark for the hyphenation, e.g.
> (maybe used by ICU regex) \L, \C, \P, \S and \H. Their usage in the regex
> pattern in Find&Replace enables the layout text export automatically instead
> of the recent one (document model).

That's a pleasing compromise between what we have now and the behemoth project of actual  document-structural search. It's also generally in line with the approach of MSO.

An important point to note is the distinction between breaks-of-things and ends-of-things. I focused on paragraphs, but if you're looking at pages or columns - sometimes you flow into the next column or page, sometimes you manually break. 

Also, in regular expressions, there are marks for the beginning the end of the unit of content you're allowed to search, which right now is a paragraph; and there isn't a generic boundary for those (unlike words, where we do have word boundaries). 

I liked your examples... although there is a nitpick w.r.t. the use of \n; see bug 108256; it doesn't match ends-of-paragraphs, not paragraph breaks, only line breaks, for now.

Another idea in the same vein as your examples: Search for mid-paragraph lines with too few characters before the non-break line end, i.e. too much justification space:

(^|\L).{,40}\L

(using your syntax for a line break.)

> It's not clear yet, this is the best way to solve my problems, but I like
> the freedom it would give us to analyse and adjust typography.
>  
> "The best fixes are the ones you get for free by fixing something else :-)"

Reminds me of the adage about the best programs being the ones you write while working on something :-)

All that being said, I suggest that before you tighten all the bolts on an implementation here, and in respect to Mike's opinion, someone mention this at the ESC and/or the weekly design meeting, for people to possibly get mad and sound alarms if they want to...

And - if you start working on this, please also have a look at the other bugs in the similar vein blocking the meta-bug which might get automatically resolved by your approach.

Comment 15 V Stuart Foote 2025-03-29 12:39:59 UTC

(In reply to László Németh from comment #12)
> extended syntax seems highly useful for my upcoming typography developments:
> adding 4 different regex layout boundary marks for 1) line end, 2) column
> end 3) page end and 4) spread end, and a 5th mark for the hyphenation, e.g.
> (maybe used by ICU regex) \L, \C, \P, \S and \H. Their usage in the regex
> pattern in Find&Replace enables the layout text export automatically instead
> of the recent one (document model).

Awesome, but please do not use a '\' indicator for the pattern description, it already would conflict with ICU regex \P{} (for {UNICODE PROPERTY NAME}).

Maybe adopt the MSO structural meta marker of '^', even allow the ^p as an analog of $ and ^l for \n,  and clearly show it is not an ICU regex implementation?

Comment 16 Heiko Tietze 2025-03-31 11:11:55 UTC

Volunteer assigned, removing UX keyword. Please remember to document ie. add changes to the help page (for a recent complaint see bug 165725 comment 15).