50786 – Phone Book sort algorithm

Bug 50786 - Phone Book sort algorithm

Summary: Phone Book sort algorithm

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	3.5.3 release
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Duplicates (1):	134747 (view as bug list)
Depends on:
Blocks:	Sorting
	Show dependency tree / graph

Reported:	2012-06-06 06:21 UTC by Bert Van Vreckem
Modified:	2020-12-13 00:31 UTC (History)
CC List:	3 users (show)

See Also:
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Bert Van Vreckem 2012-06-06 06:21:44 UTC

The sort algorithm (apparently) sorts on ascii values of the characters in text strings. Consequently, a space character is also considered to be a letter. However, a space should not have an influence on the sorting order and should be ignored.

If it is not possible to implement this in the standard sort function (as it would change the behaviour of a feature that is used quite a lot), then I hope it can at least be implemented in "natural sort".

An example: Flemish surnames often start with "De" or "Van de" and these names can be written either with or without spaces. LibreOffice Spreadsheet sorts these names incorrectly:

DESIRED RESULT:

1stname	Surname
Bart	Debacker
Peter	De Decker
Mark	Demol
Rudy	De Wilde
Wim	Dewilde

ACTUAL RESULT:

1stname	Surname
Peter	De Decker
Rudy	De Wilde
Bart	Debacker
Mark	Demol
Wim	Dewilde


WORKAROUND:

As a workaround, you add a column calculated from the the one to be sorted with spaces removed (e.g. =SUBSTITUTE(B2," ","") ). However, I find this rather kludgy.

Comment 1 Heiko Tietze 2015-10-26 09:46:40 UTC

Some more additions:

In #95192 the sorting of street numbers is discussed. To generalize I'd ask how to sort with multiple numbers. My suggestion is to use all until the first whitespace (or rather the first stop character)

Text 1/15
Text 2/11
Text 3/9
Text 4/3

With respect to the name there might be also the need to sort independent from the name affix. The German rules define that "von der Lippe" has to be collated at L - affixes "von" plus "der" are not part of the sort order unlike the Flemish "de". So why don't we trust the system settings of collation here? Guess there is something similar to Linux' LC_COLLATE under Windows and Mac Os. On the other hand, how should LibO Calc knows what kind of text is in the cells? Adding more switches makes the ugly dialog even worse. So we should rather let the user calculate per SUBSITUTE() or RIGHT() in order to sort properly.

Comment 2 Heiko Tietze 2020-12-09 09:23:10 UTC

*** Bug 134747 has been marked as a duplicate of this bug. ***

Comment 3 Heiko Tietze 2020-12-09 09:26:12 UTC

"Ignore spaces" sounds very special. But since Writer can do a "Phone Book" search (additional to alphanumeric and numeric) it could be expected as such in Calc too.

Comment 4 Eike Rathke 2020-12-09 20:43:13 UTC

Calc already offers different algorithms if the locale defines such, see for example {de-DE} German (Germany) (or other German), "Alphanumeric" and "Phone book".

The {nl-BE} Dutch (Belgium) locale does not define additional algorithms. I currently don't know if there is a specific phonebook collation order offered for nl-BE in ICU or if the general phonebook could be used.

Is that (ignoring spaces) a thing in the Netherlands as well?
Apart from that I doubt it's a general "ignore spaces" thing but depends on the prefix "De" or "Van de" or such.

Natural Sort is a specific handling of numbers within text and not suited to implement wild alternatives.

Comment 5 foxtrot 2020-12-09 21:02:21 UTC

It's not only for literal phone book sorting or "ignoring spaces" - my issue (134747) was marked as a duplicate of this, and it's actually "only consider letters", as I try to sort various text data, like song titles that also can contain punctuation marks. So there is more than just sorting locale names...

Comment 6 Heiko Tietze 2020-12-10 08:34:01 UTC

(In reply to Eike Rathke from comment #4)
> Calc already offers different algorithms...

Tested the duplicate bug 134747 and sorting in Calc never resulted in this "phone book" order that Writer can do, which is expected by foxtrot.

Comment 7 Eike Rathke 2020-12-10 11:32:03 UTC

I don't know what you tried. If I use the 4 example lines of that bug description copied twice to Writer, selecting each section and sorting with German (Germany) but one Alphanumeric and one Phone book algorithm I get identical results for both:

I want...
I'm gonna...
Ich habe...
In der...

Same in Calc.

I also don't see why there should be any difference because AFAIK the German phone book algorithm is about the position of Umlauts, not blanks or apostrophes.

Comment 8 foxtrot 2020-12-10 17:05:44 UTC

The point is: I never talked about "phone book", My issue was linked to this as a "duplicate". I just would like to sort those titles right. Would you sort them the way the algorithm does? The algorithm would sort it right, if it just considered letters - that's my point.

Comment 9 Heiko Tietze 2020-12-11 07:46:36 UTC

(In reply to Eike Rathke from comment #7)
> I don't know what you tried.
Me neither anymore. When I tried first Calc sorted differently to Writer with PB-mode enabled. Anyway, if this mode would take spaces into consideration I imagine it's a solution to both issues and a valuable enhancement. But feel free to resolve/reopen tickets.

Comment 10 Robert Lacroix 2020-12-13 00:31:00 UTC

I think a more general approach would be to allow regular expression matching to ignore parts of a sort string. Calc is already most of the way there since the Find and Replace feature has regular expression matching as an option. For people who don't know how to use regular expressions, add a modifiable list of common patterns to choose from, similar to what we have for date formats. Ignoring spaces would simply be one of these selectable predefined patterns.