Bug 116146 - Index concordance entry "Search term" does not accept regular expression
Summary: Index concordance entry "Search term" does not accept regular expression
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
6.0.2.1 release
Hardware: x86-64 (AMD64) All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Concordance-File
  Show dependency treegraph
 
Reported: 2018-03-02 19:23 UTC by Eric Bright
Modified: 2018-10-14 06:35 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments
A concordance.sdi file to be used along with the .odt file. (141 bytes, text/plain)
2018-03-10 16:02 UTC, Eric Bright
Details
A .odt example file. (15.22 KB, application/vnd.oasis.opendocument.text)
2018-03-10 16:05 UTC, Eric Bright
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Bright 2018-03-02 19:23:06 UTC
"Search term" in a concordance file, i.e. a .sdi file, does not accept regular expression and if one is put in, the whole row is ignored.

Example:

consciousness;;;;;
metaphysics;;;;;
biology;;;;;1
Dennett(?!, however);Dennett, Daniel;;;;
(?<!with )Daniel(?! .2| A);Dennett, Daniel;;;;

In the above example, the last two rows will be ignored, although they will show up in the "Edit Concordance File" dialogue box.

To fix this issue, at least two things need to be addressed:

1- To make the "search term" behave exactly as a search term behaves in the "Find & Replace" function, and to use the same code-base for both of them.

2- To allow regular expression to be enabled in "search term" cells of a concordance file as an option. This can be set in the "Table of Contents, Index or Bibliography" dialogue box as an option the same way it appears in the "Find & Replace" dialogue box, which will make the whole indexing process much easier.

This fix will give LibreOffice a boost and flexibility and will make external indexing software almost unnecessary (to make those external software completely unnecessary, LibreOffice needs to gain ways to exclude pages and words from the index, needs to be able to undo changes made in the text by the index functions, needs to be able to update the index when new words are typed, and so on).

As it stands now, the Index function is almost useless, since there is no way to exclude pages or word combinations and lots and lots of junk gets into the final index that cannot be cleaned up easily. Regular expression will allow terms to be excluded (as shown in the examples above).
Comment 1 Buovjaga 2018-03-10 14:55:09 UTC
Ok, I tried reproducing it with the help of https://helponline.libreoffice.org/latest/en-US/text/swriter/01/04120250.html
Could you give a concrete view on how the last two rows are ignored? Perhaps you need to attach an example document.
Comment 2 Eric Bright 2018-03-10 16:02:51 UTC
Created attachment 140549 [details]
A concordance.sdi file to be used along with the .odt file.
Comment 3 Eric Bright 2018-03-10 16:05:23 UTC
Created attachment 140550 [details]
A .odt example file.

Use along with the .sdi file that is attached. The text is chosen to demonstrate the bug effects in conjunction with the concordance.sdi file.
Comment 4 Eric Bright 2018-03-10 16:05:52 UTC
Try to add an index at the end of the attached .odt document as follows:

1. Insert > Table of Contents and Index > Table of contents Index or Bibliography
2. under tab Type, 'Type and Title' > 'Alphabetical Inex'
3. under tab Type, 'Options' > 'Concordance file' put a check mark
4. under tab Type, 'Options' > 'File' dropdown > 'Open'
5. select concordance.sdi and click Open
6. click Okay again

Now the index is created. Notice that Dennett is not indexed.

If you remove the regular expressions from the .sdi file and Update the index, then Dennett will be indexed, but all the wrong Dennetts will also be indexed.

The regular expressions are used for two reasons:

1. the Index function must not index the Index itself; and
2. the "Daniel" entry on the 2nd page needs to be excluded.
Comment 5 Buovjaga 2018-03-10 17:17:27 UTC
Ok, thanks. I will set this to an enhancement.

Arch Linux 64-bit
Version: 6.1.0.0.alpha0+
Build ID: 22b1d4784d02070ae1933c59cf2c9bb5a5284773
CPU threads: 8; OS: Linux 4.15; UI render: default; VCL: kde4; 
Locale: fi-FI (fi_FI.UTF-8); Calc: group
Built on March 10th 2018