Created attachment 202635 [details] An example showing the bug In a spreadsheet, I copy/paste these three strings in the correct Tibetan / Dzongkha alphabetical order: ཀ་ སྐ་ ས་ but when I set the document's language in Dzongkha or Tibetan and sort the data in ascending order, the result is ཀ་ ས་ སྐ་ see attached example CLDR has collation rules for Tibetan and Dzongkha: https://github.com/unicode-org/cldr/blob/main/common/collation/ (bo.xml and dz.xml) LibreOffice has collation rules for Dzongkha: https://github.com/LibreOffice/core/blob/4efd03d69ac7f6ae463aa56cea6f0e80f289f6e3/i18npool/source/collator/data/dz_charset.txt The GLibC also has implemented the rules: https://sourceware.org/bugzilla/show_bug.cgi?id=21547
I think, I can't help here. It's more something for Eike or Jonathan. I don't know, whether the expected order is the correct one. Excel and OnlyOffice sorts it in the same order as LibreOffice.
I can answer any question on the order if needed. I think there's only one peer-reviewed paper about the Tibetan alphabetical order (although more focused on the historical aspect): https://d1i1jdw69xsqx0.cloudfront.net/digitalhimalaya/collections/journals/ret/pdf/ret_63_02.pdf
(In reply to Elie Roux from comment #2) > I can answer any question on the order if needed. I don't think we have any Tibetan script expert here on Bugzilla. So I'll ask a few questions that may seem obvious to you, but are actually hard for me as a non-user, and I hope your answer would help other QA people and developers as well. You listed three collation rules in comment #0: > CLDR has collation rules for Tibetan and Dzongkha: > https://github.com/unicode-org/cldr/blob/main/common/collation/ (bo.xml and > dz.xml) > > LibreOffice has collation rules for Dzongkha: > https://github.com/LibreOffice/core/blob/ > 4efd03d69ac7f6ae463aa56cea6f0e80f289f6e3/i18npool/source/collator/data/ > dz_charset.txt > > The GLibC also has implemented the rules: > https://sourceware.org/bugzilla/show_bug.cgi?id=21547 Are the character orders in them correct from your respective? Do they give the same collation orders? Regarding the three strings in your example (I added their Unicode codepoints): ཀ་ (U+0F40 U+0F0B) སྐ་ (U+0F66 U+0F90 U+0F0B) ས་ (U+0F66 U+0F0B) I see that both ཀ and ས in the 30-consonant list for Tibetan (I am Chinese so it's easy for me to search for information about Tibetan, if Dzongkha is somehow different, let me know), but སྐ is not on the list, and involves the U+0F90 (SUBJOINED LETTER KA) character. Is there some general sorting rule for strings with subjoined letters?
Thanks for your questions! > Are the character orders in them correct from your respective? Do they give the same collation orders? The rules for Dzonkha are very slightly outdated, the most up to date collation rules are: https://github.com/unicode-org/cldr/blob/main/common/collation/bo.xml But the differences are only in very niche cases, not on the very simple example I gave. > Is there some general sorting rule for strings with subjoined letters? In Tibetan there is an idea of root letter, which is the primary letter on which to sort. In the example of སྐ, the main letter is ཀ, and the superscript letter is ས. སྐ is thus organized with ཀ, not ས. A few resources: - https://web.archive.org/web/20220709105007/http://www.dit.gov.bt/sites/default/files/Collation_in_Dzongkha.pdf - https://download.mimer.com/pub/developer/charts/Chilton_slides.pdf - https://download.mimer.com/pub/developer/charts/tibetan.htm - http://cjc.ict.ac.cn/eng/qwjse/view.asp?id=1502 - https://doi.org/10.5070/H917135529
I have found answer from Eike on a similar question in Ask:https://ask.libreoffice.org/t/calc-data-sorting-not-following-ascii-unicode-order/87503/12 The mentioned folder for "tailoring" for zh-TW contains a file for dz too. I have put Eike
.. I have put Eike in CC. He can likely tell, whether it is a bug in LibreOffice or not.
Yes, one way to test these tailoring / collation rules is to use the online ICU app: https://icu4c-demos.unicode.org/icu-bin/collation.html if you select "bo" in the list in the top left corner (which is at "und" by default), you see that it will sort ཀ སྐ ས in the same order, I'll add a screenshot as an attachment.
Created attachment 202669 [details] screenshot of ICU app for Tibetan
I get the result you expect, when I do not use language "Dongkha" but "Tibetan (PR China)". Can you please test that language setting?
Ah yes, that works, thanks!