Bug 145853 - Add a few language codes for Slavic auxiliary zonal languages
Summary: Add a few language codes for Slavic auxiliary zonal languages
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium enhancement
Assignee: Eike Rathke
URL:
Whiteboard: target:7.4.0 target:7.3.0.0.beta2
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-23 15:51 UTC by Yaroslav Serhieiev
Modified: 2021-12-27 16:56 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Yaroslav Serhieiev 2021-11-23 15:51:54 UTC
Description:
First of all, thanks for looking into this!

As a member of Interslavic (constructed language) volunteer programmer team, I’d like to inquire, whether you could assist with adding a few reserved codes for pan-Slavic languages? According to the BCP47 notation, we’d fit best at “sla-Latn”, “sla-Cyrl” language tags (since Interslavic has two alphabets: Latin and Cyrillic). That would be optimal for our current use case, since Interslavic doesn’t have (yet) an assigned ISO-639-3 code, although I'd note it already has received its code on Glottolog (https://glottolog.org/resource/languoid/id/inte1263).

The context is the following. We’ve got a Hunspell dictionary in works (https://github.com/medzuslovjansky/dictionary-hunspell), and it is quite close to the first beta release, and we’d like to ship also it as a LibreOffice extension. In the worst case, we could have used one of Serbian locales, probably, but it does not feel like the most optimal solution anyway - Serbian users of Interslavic spell checker would be pressed to drop their native language spell checker.

What could you advise in our case?
Again, thanks a lot for your time! We’d be happy to hear from you back!

Steps to Reproduce:
Open "Language" picker in the Character dialog.

Actual Results:
You don't see "Slavic (Latin)" and "Slavic (Cyrillic)" options.

Expected Results:
I'd like to see "Slavic (Latin)" and "Slavic (Cyrillic)".


Reproducible: Always


User Profile Reset: No



Additional Info:
According to: https://git.libreoffice.org/core/+/master/include/i18nlangtag/lang.h

I can't find anything similar to generic Slavic language codes, so I guess this issue was and is valid from the very beginning.
Comment 1 Ming Hua 2021-11-25 02:12:03 UTC
(In reply to Yaroslav Serhieiev from comment #0)
> According to the BCP47 notation, we’d fit
> best at “sla-Latn”, “sla-Cyrl” language tags (since Interslavic has two
> alphabets: Latin and Cyrillic).
According to Eike[1], who is the developer in charge of locale stuff, languages defined in extensions but not in LibreOffice's built-in language list would just show up in text / font / character attribution with their BCP47 notation, like {ll} or {ll-CC}.  I assume {lll-Ssss} notations would work as well.

> That would be optimal for our current use
> case, since Interslavic doesn’t have (yet) an assigned ISO-639-3 code
I believe even unofficial BCP47 notation should work, but not sure.  And I think the biggest concern would probably be that you use "sla" now and it get assigned to another language by ISO later, causing incompatibility issues.

I've added Eike to CC so that he can better answer your questions and correct my mistakes if any.

1. https://bugs.documentfoundation.org/show_bug.cgi?id=135403#c2
Comment 2 Yaroslav Serhieiev 2021-11-25 07:15:32 UTC
Hello, Ming Hua!

Thanks for your reply!

First, I'll address:

> And I think the biggest concern would probably be that you use "sla" now and it get assigned to another language by ISO later, causing incompatibility issues.

Not exactly. The "sla" ISO code has been already assigned – it is an umbrella code for all the "Slavic languages".

On the other hand, it seems, your input is valid. I followed your advice, and indeed I can see my non-standard BCP47 code at the top of the languages list, i.e.: "✓ {{sla-Latn-x-isv}}".

Now I see it is not a blocker at all. But may I proceed with the "improvement" part of my question, please?

Honestly, with my code, "sla-Latn-x-isv" I was hoping to see it rendered in a list as "Slavic languages (Latin) {x-isv}".

Presumably, to satisfy this behavior, there should be two conditions met:
1) LibreOffice source code should have a mapping to interpret "sla" → "Slavic languages" (and sla-Latn → Slavic languages (Latin), respectively)
2) The language picker rendering (populating) logic should be extended to handle private BCP47 code extensions (x-...) in the described way: [interpret primary code] [(interpret script if there is one)] [{print the remaining x-extension}].

If that makes sense to the contributors as well, I could create another issue for (2) and see if I can find anyone who can help to resolve this.

Thanks in advance for your time and hard work on LibreOffice itself!
Comment 3 Eike Rathke 2021-11-25 22:09:16 UTC
Below I'm just copy-pasting my answer from
https://ask.libreoffice.org/t/could-you-add-a-few-language-codes-for-slavic-auxiliary-zonal-languages/70794/3
the question you apparently created as well.

You can use any syntactically valid (!) BCP 47 language tag as document/paragraph/character attribution language. You can simply enter such tag in the font attribution’s Language combobox, or even better, if you already have a Hunspell dictionary LibreOffice extension that can announce its supported language tag(s) in the dictionaries.xcu Locales property and they will be automatically added to the language list (displayed for example as {sla-Latn} or {sla-Cyrl}). With upcoming version 7.3 even the language/script/country names are displayed along for such a tag, if known.

Btw, Glottolog IDs are not relevant for BCP 47 IANA language tags. If adding your language to ISO 639-3 turns out to be not successful then you might try to register an IANA language variant subtag if you have a very good case.

Also, to prevent future clashes, I recommend to rethink the use of sla-Latn and sla-Cyrl because sla is a generic collective macrolanguage code for Slavic languages. Maybe better suited would be adding a private-use variant subtag, and as you already seem to use x-isv (according to your github page) that would be sla-Latn-x-isv and sla-Cyrl-x-isv. However, the x-isv is a reserved for private use subtag and not interoperable and interpretation in applications depends solely on agreement.
Comment 4 Eike Rathke 2021-11-25 23:02:06 UTC
(In reply to Yaroslav Serhieiev from comment #2)
> and indeed I can see my non-standard BCP47 code at the top of the languages
> list, i.e.: "✓ {{sla-Latn-x-isv}}".
That looks wrong, it should be "✓ {sla-Latn-x-isv}" (note the single {} braces). May it be you defined literally {sla-Latn-x-isv} in your extension's dictionaries.xcu? It should only be the BCP 47 tag sla-Latn-x-isv without braces.


> Honestly, with my code, "sla-Latn-x-isv" I was hoping to see it rendered in
> a list as "Slavic languages (Latin) {x-isv}".
Starting with LibreOffice 7.3 we rely on the ICU library to provide names for language tags that aren't defined internally by LO. Unfortunately the 'sla' tag seems not to have a proper name there either and 'sla-Latn' is displayed as
"sla (Latin) {sla-Latn}"


> Presumably, to satisfy this behavior, there should be two conditions met:
> 1) LibreOffice source code should have a mapping to interpret "sla" →
> "Slavic languages" (and sla-Latn → Slavic languages (Latin), respectively)
We will certainly not add yet another table of language subtags to names mappings for hundreds of languages and tags including translations (the IANA language tags registry currently has 9079 entries..). ICU (International Components for Unicode) provides that with its CLDR (Common Locale Data Repository), and if a subtag name is missing then the best place to add it is there. See https://icu.unicode.org/ and https://cldr.unicode.org/


> 2) The language picker rendering (populating) logic should be extended to
> handle private BCP47 code extensions (x-...) in the described way:
> [interpret primary code] [(interpret script if there is one)] [{print the
> remaining x-extension}].
Seeing that and having tried to enter 'sla-Latn-x-isv' it indeed refuses to accept that with its private-use subtag. I would even say with a good reason, because such are not meant to pollute the document space, but I'm not sure however if that's actually the case there or if there's another reason.


> If that makes sense to the contributors as well, I could create another
> issue for (2) and see if I can find anyone who can help to resolve this.
I'll investigate, for now we can just keep this bug here.
Comment 5 Eike Rathke 2021-11-26 00:16:09 UTC
So, we disallow entering private-use subtags in the combobox for exactly that reason, to prevent accidental pollution and protecting users from shooting themselves in the foot. This should not be changed.

A dictionary extension however can add such tag.
Comment 6 Eike Rathke 2021-11-26 20:03:21 UTC
Btw, is this about Interslavic, a "pan-Slavic auxiliary" language?
https://en.wikipedia.org/wiki/Interslavic
https://en.wikipedia.org/wiki/Pan-Slavic_language
The first mentions Glottolog inte1263 and as IETF code art-x-interslv, so that should probably be art-Latn-x-interslv and art-Cyrl-x-interslv instead?

Fwiw, I find the "assign ISO-639-3 codes out of the "reserved for local use" area (qaa-qtz) to constructed languages" of https://www.kreativekorp.com/clcr/ a wrong approach and art-x-... better suited.
Comment 7 Yaroslav Serhieiev 2021-11-27 11:18:44 UTC
Hello, Eike Rathke!

Thanks for sharing your insights.

> That looks wrong, it should be "✓ {sla-Latn-x-isv}" (note the single {} braces)

Indeed, the double {} braces were just my inaccurate memory. Here's the real screenshot: https://cdn.discordapp.com/attachments/909783580903890964/913353621452754974/unknown.png


> ICU (International Components for Unicode) provides that with its CLDR (Common Locale Data Repository), and if a subtag name is missing then the best place to add it is there.

Thanks. I've checked the contents of their latest release, and indeed there are no "art" (Artificial) or "sla" (Slavic collective language group) codes inside.

> So, we disallow entering private-use subtags in the combobox... A dictionary extension however can add such tag.

So, is there anything you think can be improved on the LibreOffice source code side, in regards to the third-party dictionary extensions? Or, the other way around, is there any naming convention that an extension can leverage to be rendered as "<unrecognized-iso-code> (Latin) {<private-subtag>}"? I'm trying to understand the next action items from our discussions.

> The first mentions Glottolog inte1263 and as IETF code art-x-interslv, so that should probably be art-Latn-x-interslv and art-Cyrl-x-interslv instead?

Hmm, I don't find the ConLang Code Registry's (https://www.kreativekorp.com/clcr/) to be the most authoritative source here, because they nowhere on their site distinguish between constructed and planned (semi-natural) languages here, although they might be right. I am not sure, and this doubt made me post this question today on SO: https://stackoverflow.com/questions/70134064/are-bcp-47-collective-language-code-more-suitable-for-zonal-auxiliary-languages

If I am lucky, maybe Doug Ewell (related to SIL) might look into this.

Getting back to the technical side of the question, won't it be the same? Using "art" or "sla" language code, we'll anyway (probably) be getting the data strings from here:

> 
https://github.com/unicode-org/icu/blob/49dda34fb175240a7724c7e039a270126ff7d900/icu4c/source/data/lang/en.txt

If you search in the file, you'll see no mention of neither "art", nor "sla" code. I mean, I don't see any tangible difference if I follow that "art-" suggestion.

I'm very grateful for your participation on this issue and I'd be happy to know what can be done in the addition from your or my side.

Best regards,
Yaroslav.
Comment 8 Eike Rathke 2021-11-29 13:30:54 UTC
(In reply to Yaroslav Serhieiev from comment #7)
> So, is there anything you think can be improved on the LibreOffice source
> code side, in regards to the third-party dictionary extensions? Or, the
> other way around, is there any naming convention that an extension can
> leverage to be rendered as "<unrecognized-iso-code> (Latin)
> {<private-subtag>}"? I'm trying to understand the next action items from our
> discussions.
I don't quite get what you're asking here. Not predefined tags are rendered as I lined out, asking ICU for a display string and appending the " {language-tag}" for clarity. Whatever ICU gives there. The only thing we could do is adding yet two other tags to the predefined languages list to have proper "Slavic auxiliary (whatever)" language list entries, that almost no one will use and the list is already overpopulated. But that will need to agree on the final tag anyway, so either 'sla-Latn-x-isv' or 'art-Latn-x-interslv' or whatever.


> > The first mentions Glottolog inte1263 and as IETF code art-x-interslv, so that should probably be art-Latn-x-interslv and art-Cyrl-x-interslv instead?
> 
> Hmm, I don't find the ConLang Code Registry's
> (https://www.kreativekorp.com/clcr/) to be the most authoritative source
> here, because they nowhere on their site distinguish between constructed and
> planned (semi-natural) languages here, although they might be right.
Of course they are not an authoritative source, but it seems to be some agreement there in using art-x-interslv, I don't know. So using such seems better to me than sla-x-isv that no other application would understand.

> I am
> not sure, and this doubt made me post this question today on SO:
> https://stackoverflow.com/questions/70134064/are-bcp-47-collective-language-
> code-more-suitable-for-zonal-auxiliary-languages
Let's see if that leads anywhere..


> Getting back to the technical side of the question, won't it be the same?
> Using "art" or "sla" language code, we'll anyway (probably) be getting the
> data strings from here:
> 
> > 
> https://github.com/unicode-org/icu/blob/
> 49dda34fb175240a7724c7e039a270126ff7d900/icu4c/source/data/lang/en.txt
Yes, but if LibreOffice defines the language list entries we can name the display strings anything we like, independent of ICU. Just that there needs to be a decision which language tag to use. (that even more applies to when using only the generated entries from the dictionary extension because once in the wild for documents it needs to stick).
Comment 9 Yaroslav Serhieiev 2021-11-29 13:35:06 UTC
(In reply to Eike Rathke from comment #8)

>  The only thing we could do is adding yet two other tags to the predefined languages list...

Probably this is what I wanted to clarify.

> Let's see if that leads anywhere...

I had to repost it to IETF directly because the question is not simple.

https://mailarchive.ietf.org/arch/browse/ietf-languages/

So, let's wait for some input from there until we take any further decisions.

Thanks for your participation!
Comment 10 Yaroslav Serhieiev 2021-12-08 16:25:35 UTC
So, the final conclusion was to use "art-x-interslv" code (plus, "art-Cyrl-x-interslv").

I'll ask one more time (so we decide to close this issue or not), is there any chance we may see artificial languages rendered as "Artificial {x-extname}" in the language picker, or "ain't gonna happen", so to say?

Thanks in advance!
Comment 11 Eike Rathke 2021-12-15 17:20:53 UTC
As explained earlier, LibreOffice provides UI names (and their translations) for predefined known language tags in the language list, not all arbitrary language codes (for which it asks ICU, so if even that doesn't provide a name for the 'art' tag then there will be no specific UI name other than the tag itself).

What we can do is add two entries to the list of known languages, like
art-x-interslv: Interslavic Latin
art-Cyrl-x-interslv: Interslavic Cyrillic

But, are you sure you want to use art-x-interslv with an implicit but suppressed script Latn instead of explicitly tagging it art-Latn-x-interslv?
Comment 12 Commit Notification 2021-12-18 00:41:54 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/430a6fea4012752eed0c61bff4936e9c366aa750

Resolves: tdf#145853 Add Interslavic Latin|Cyrillic to language list

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 13 Commit Notification 2021-12-18 01:37:35 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "libreoffice-7-3":

https://git.libreoffice.org/core/commit/c25067e46d7d849a584295de365e32c6c7af11bf

Resolves: tdf#145853 Add Interslavic Latin|Cyrillic to language list

It will be available in 7.3.0.0.beta2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 14 Yaroslav Serhieiev 2021-12-18 05:48:14 UTC
Wow, thanks a lot for adding the codes directly! Huge thanks from our volunteers team!

As for the idea with adding an explicit `art-Latn-x-interslv` code - well, it sounds reasonable, as the official site of Interslavic language states:

> Both orthographies are equal, and in published texts, it deserves recommendation to provide versions in both Latin and Cyrillic, so that they can be understood on both sides of the frontier.

If it is not too late, I'd say that it is indeed better to add an explicit Latin script extension. Thanks in advance!
Comment 15 Yaroslav Serhieiev 2021-12-18 05:58:29 UTC
P.S. Two more clarifications:
1. I meant not adding, rather changing art-x-interslv to art-Latn-x-interslv. Maybe I did not formulate it well, so I am reiterating to be on the safe side.
2. Would you need help with providing a list of translations for the entry? I mean like:

en Interslavic
ru Межславянский язык
pl Język międzysłowiański
uk Міжслов'янська мова
sr Међусловенски језик
cs Mezislovanština
sh Međuslovenski jezik
bg Междуславянски език
sk Medzislovančina
hr Međuslavenski jezik
be Міжславянская мова
sl Medslovanščina
mk Меѓусловенски јазик
hsb Mjezysłowjanšćina
rue Міджіславяньскый язык
dsb Mjazysłowjańšćina
cu Мєждоусловѣньскъ ѩꙁꙑкъ
Comment 16 Eike Rathke 2021-12-18 11:59:47 UTC
Translations are out of scope of this RFE, they are done by the translation teams at https://translations.documentfoundation.org/
The UI string freeze for 7-3 next week due to translation handling was actually the reason I committed the change *now*.

I can change the underlying language tag for Latin to art-Latn-x-interslv.
Comment 17 Yaroslav Serhieiev 2021-12-18 12:11:11 UTC
Okay, I see! Thanks a lot for thinking ahead!

So, yeah, whenever you can, please update that -Latn script extension.

In a few days, when the beta is out, we'll test again our spellchecker add-on for Interslavic and let you know whether all works as intended.
Comment 18 Commit Notification 2021-12-18 14:42:13 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/6e91ff7b57a231ca34f619a40297cf6ef1904ea2

Change Interslavic Latin tag to {art-Latn-x-interslv}, tdf#145853 follow-up

It will be available in 7.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 19 Commit Notification 2021-12-18 16:46:01 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "libreoffice-7-3":

https://git.libreoffice.org/core/commit/aab20b3cc8c2379d609531b40b21cb4faece70d0

Change Interslavic Latin tag to {art-Latn-x-interslv}, tdf#145853 follow-up

It will be available in 7.3.0.0.beta2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 20 Yaroslav Serhieiev 2021-12-22 15:38:25 UTC
Hi again!

So, I've checked how it looks now:

* art (Cyrillic, Private-Use=interslv) {art-Cyrl-x-interslv}
* art (Latin, Private-Use=interslv) {art-Latn-x-interslv}

Is it supposed to be so, or maybe we need to wait for the translations?

Thanks in advance!
Comment 21 Eike Rathke 2021-12-27 16:56:21 UTC
You checked in a release (or another build that doesn't have the change). You need to check in a recent master build or upcoming 7.3.0.0.beta2 build as indicated in the commit notifications above.