The workflow and use case of the similarity search is difficult to understand. In particular, Combine reads as if the parameters are put together with a logical OR while without all parameters have to be met. However, that's not clear and an example is missing at the documentation.
Maybe this can shed some light: The algorithm used is a Weighted Levenshtein Distance (including wildcards ? and *). The mathematical definition of the real WLD means EITHER maximum X replacements OR Y characters shorter OR Z characters longer, where a mix of operations is allowed but each operation draws from a shared 100% pool of operations. The relaxed (UI Combined, internal SplitCount) mode allows maximum X replacements AND/OR Y character shorter AND/OR Z characters longer. Only insertions and deletions share one pool from which they draw, replacements use a second independent pool. This is more what a user expects if not familiar with WLD. More details and an example can be found in the comments at https://opengrok.libreoffice.org/xref/core/i18npool/source/search/levdis.hxx?r=ee8f0a10#26
(In reply to Eike Rathke from comment #1) > Maybe this can shed some light: ...but not enough. Was going to try to improve, but not sure I understand completely. If "Combine" is UNchecked in the Similarity Search dialog, then what happens? (i.e., how is it different from when Combine is checked?) (the source code says EITHER, does that mean that each parameter is used exactly for a match (which in everyday thinking sounds like "combine") The mathematical explanation of relaxed WLD sounds like what one would expect (in everyday language use),if the descriptions of each parameter (on the help page) are Combined. (a guess for) Possible text for help page under Combine heading: "If unchecked, then search matches any item that matches one of the three parameters. If checked, then an intelligent combination of the settings for exchange, add, and remove characters is used." cc: Eike Rathke Just curious: if these interpretations are correct, then it is hard to understand how checking or unchecking Combine will make a big difference in practice. If that naive speculation is completely wrong, then a practical "tip" about when it is better to choose one or the other would be good (and could be added to the help page in a "tip" box).
Created attachment 166244 [details] Document with instructinos for testing similarity search Is similarity search supposed to be able to find two words (i.e, two letter strings with a space between them)? If yes, then maybe there is a bug. If no, I will include a note in the documentation. See attached file for simple, detailed instructions about how to experience the behavior (tested with 7.1.0.0.alpha0+).
(In reply to sdc.blanco from comment #3) > Is similarity search supposed to be able to find two words (i.e, two letter > strings with a space between them)? See bug 126294 for similar problem.
Seth Chaiklin committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/help/commit/bbb9b402a4197f412a411efeef434b168d0ce96d Partially resolves: tdf#129492 (and related to: tdf#64739) improve explanation of Similarity search
(In reply to Heiko Tietze from comment #0) > The workflow and use case of the similarity search is difficult to > understand. In particular, Combine reads as if the parameters are put > together with a logical OR while without all parameters have to be met. Logic of Combine should be explained now and some tips about usage. > an example is missing at the documentation. It is still missing. Therefore "partially resolved" Who can provide a good, short useful example?
https://en.wikipedia.org/wiki/Levenshtein_distance#Example However, as Wikipedia is CC-BY-SA and mentioning every BY in the help is quite cumbersome (or do we do that already templated?), rather link to it or create a new example with similar few steps. Or just link to the article altogether.
Olivier Hallot committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/help/commit/9d2a16b7eb33cf0ff58e010d502d64c6dfcdff4f tdf#129492 Similarity search examples