Bug 92015 - wrong language detection if style:rfc-language-tag is malused
Summary: wrong language detection if style:rfc-language-tag is malused
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
5.0.0.0.beta1
Hardware: Other All
: medium minor
Assignee: Eike Rathke
URL:
Whiteboard: target:7.3.0
Keywords:
Depends on:
Blocks: Language-Detection
  Show dependency treegraph
 
Reported: 2015-06-11 19:47 UTC by Pablo Rodríguez
Modified: 2021-10-22 10:04 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
ODT file with style:rfc-language-tag="grc" (8.19 KB, application/vnd.oasis.opendocument.text)
2015-06-11 19:47 UTC, Pablo Rodríguez
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pablo Rodríguez 2015-06-11 19:47:18 UTC
Created attachment 116471 [details]
ODT file with style:rfc-language-tag="grc"

Hi there,

I have just realized that style:rfc-language-tag with values that don’t include country code aren’t properly recognized by Writer in LibreOffice 5.0beta1.

The attached file includes style:rfc-language-tag="grc" (language is detected as "grc" instead of "Greek, Ancient").

There are a bunch of other languages that shouldn’t require a country to be properly recognized. In fact, these languages are all listed in the language dialog that don’t include country in parentheses.

A very partial list of these languages: Albanian, Armenian, Basque, Bulgarian, Catalan, Coptic, Croatian, Czech, Danish, Finnish, Greek, Irish and Hungarian.

As an additional comment, I think it would be useful to assign language only codes to a a value with country code.

This would assign de-DE to de, es-ES to es, fr-FR to fr, it-IT to it, pt-PT to pt, ro-RO to ro, sv-SE to sv. (The previous list is by no means complete.)

Many thanks for your help,


Pablo
Comment 1 Julien Nabet 2015-06-12 08:40:08 UTC
On which LO version are you + which env (Linux, MacOs, Windows)?
Comment 2 Pablo Rodríguez 2015-06-12 13:59:39 UTC
(In reply to Julien Nabet from comment #1)
> On which LO version are you + which env (Linux, MacOs, Windows)?

Sorry, version is 5.0beta 1 and I’m on Linux (Fedora 20).
Comment 3 Pablo Rodríguez 2015-06-12 14:17:06 UTC
Julien,

the issue still happens with version 5.0 beta 3 in Linux 32bits (I haven’t tested any other OS).
Comment 4 Julien Nabet 2015-06-12 14:54:24 UTC
Thank you for your feedback, I put it back to UNCONFIRMED.
Comment 5 Buovjaga 2015-06-13 15:37:56 UTC
(In reply to Pablo Rodríguez from comment #0)
> Created attachment 116471 [details]
> ODT file with style:rfc-language-tag="grc"

Confirmed language is detected as grc.

Win 7 Pro 64-bit Version: 5.1.0.0.alpha1+
Build ID: d56b125f6c6c18ac40712cefc3cec06530750e15
TinderBox: Win-x86@39, Branch:master, Time: 2015-06-13_07:08:43
Locale: fi-FI (fi_FI)
Comment 6 QA Administrators 2016-09-20 10:02:09 UTC Comment hidden (obsolete)
Comment 7 Pablo Rodríguez 2016-09-20 19:41:30 UTC Comment hidden (obsolete)
Comment 8 Xisco Faulí 2017-09-29 08:53:49 UTC Comment hidden (obsolete)
Comment 9 Urmas 2017-10-22 11:40:17 UTC Comment hidden (obsolete)
Comment 10 QA Administrators 2018-10-23 02:48:37 UTC Comment hidden (obsolete)
Comment 11 QA Administrators 2020-10-23 04:15:26 UTC Comment hidden (obsolete)
Comment 12 Julien Nabet 2021-10-21 15:50:28 UTC
On pc Debian x86-64 with master sources updated today, I could reproduce this.

I noticed this log several times on console:
warn:svtools.misc:1258302:1258302:svtools/source/misc/langtab.cxx:239: Language: 0x249 with unknown name, so returning lang-tag of: qlt (Greece) {grc}
Comment 13 Julien Nabet 2021-10-21 15:55:26 UTC
part of bt from the first console log quoted in my previous comment:
#0  (anonymous namespace)::SvtLanguageTableImpl::GetString(o3tl::strong_int<unsigned short, LanguageTypeTag>) const
    (this=0x7f702b279d40 <rtl::Static<(anonymous namespace)::SvtLanguageTableImpl, (anonymous namespace)::theLanguageTable>::get()::instance>, eType=...) at svtools/source/misc/langtab.cxx:236
#1  0x00007f702af36d95 in SvtLanguageTable::GetLanguageString(o3tl::strong_int<unsigned short, LanguageTypeTag>) (eType=...) at svtools/source/misc/langtab.cxx:251
#2  0x00007f7014472760 in SwTextShell::GetState(SfxItemSet&) (this=0x701de20, rSet=SfxItemSet of pool 0x76706b0 with parent 0x0 and Which ranges: [(11209, 11209)] = {...})
    at sw/source/uibase/shells/textsh1.cxx:1663
#3  0x00007f7014460340 in SfxStubSwTextShellGetState(SfxShell*, SfxItemSet&) (pShell=0x701de20, rSet=SfxItemSet of pool 0x76706b0 with parent 0x0 and Which ranges: [(11209, 11209)] = {...})
    at workdir/SdiTarget/sw/sdi/swslots.hxx:3086
#4  0x00007f702d98d12a in SfxShell::GetSlotState(unsigned short, SfxInterface const*, SfxItemSet*) (this=0x701de20, nSlotId=11209, pIF=0x3c3ae10, pStateSet=0x0) at sfx2/source/control/shell.cxx:493
#5  0x00007f702d92ce53 in SfxDispatcher::QueryState(unsigned short, com::sun::star::uno::Any&) (this=0x926adb0, nSID=11209, rAny=uno::Any(void)) at sfx2/source/control/dispatch.cxx:1862
#6  0x00007f702da176d6 in SfxDispatchController_Impl::addStatusListener(com::sun::star::uno::Reference<com::sun::star::frame::XStatusListener> const&, com::sun::star::util::URL const&) (this=
    0x8d90300, aListener=uno::Reference to (framework::LanguageSelectionMenuController *) 0x7857758, aURL=...) at sfx2/source/control/unoctitm.cxx:760
#7  0x00007f702da17561 in SfxOfficeDispatch::addStatusListener(com::sun::star::uno::Reference<com::sun::star::frame::XStatusListener> const&, com::sun::star::util::URL const&)
    (this=0x8d90260, aListener=uno::Reference to (framework::LanguageSelectionMenuController *) 0x7857758, aURL=...) at sfx2/source/control/unoctitm.cxx:279
#8  0x00007f702eea1766 in framework::LanguageSelectionMenuController::updatePopupMenu() (this=0x78576f0) at framework/source/uielement/langselectionmenucontroller.cxx:254
#9  0x00007f702aff5333 in svt::PopupMenuControllerBase::setPopupMenu(com::sun::star::uno::Reference<com::sun::star::awt::XPopupMenu> const&) (this=0x78576f0, xPopupMenu=
    uno::Reference to (VCLXPopupMenu *) 0x85e64f8) at svtools/source/uno/popupmenucontrollerbase.cxx:356
Comment 14 Julien Nabet 2021-10-21 16:29:41 UTC
Eike: I noticed this line:
include/i18nlangtag/lang.h:630:#define LANGUAGE_USER_ANCIENT_GREEK         LanguageType(0x0649)

But I found no ref confirming that ancient Greek corresponded with 0x0649.
In decimal so debug in gdb gives 585 so 0x0249

I wanted to try this patch:
diff --git a/include/i18nlangtag/lang.h b/include/i18nlangtag/lang.h
index ae434fd0c06a..c6a69e59fb68 100644
--- a/include/i18nlangtag/lang.h
+++ b/include/i18nlangtag/lang.h
@@ -627,7 +627,7 @@ namespace o3tl
 #define LANGUAGE_USER_ARABIC_PALESTINE      LanguageType(0x9801)  /* makeLangID( 0x26, getPrimaryLanguage( LANGUAGE_ARABIC_SAUDI_ARABIA)) */
 #define LANGUAGE_USER_ARABIC_SOMALIA        LanguageType(0x9C01)  /* makeLangID( 0x27, getPrimaryLanguage( LANGUAGE_ARABIC_SAUDI_ARABIA)) */
 #define LANGUAGE_USER_ARABIC_SUDAN          LanguageType(0xA001)  /* makeLangID( 0x28, getPrimaryLanguage( LANGUAGE_ARABIC_SAUDI_ARABIA)) */
-#define LANGUAGE_USER_ANCIENT_GREEK         LanguageType(0x0649)
+#define LANGUAGE_USER_ANCIENT_GREEK         LanguageType(0x0249)
 #define LANGUAGE_USER_ASTURIAN              LanguageType(0x064A)
 #define LANGUAGE_USER_LATGALIAN             LanguageType(0x064B)
 #define LANGUAGE_USER_MAORE                 LanguageType(0x064C)

but it rebuilds a lot of things

Or perhaps should we declare LANGUAGE_USER_ANCIENT_GREEK_2?

Also, just wonder if "649" would be just a typo instead of "249" and so wonder if there exist some real documents using 649 as ancient Greek.
Comment 15 Eike Rathke 2021-10-21 17:42:52 UTC
As the USER in LANGUAGE_USER_ANCIENT_GREEK denotes it is a user defined LCID hence 0x0649 from the set reserved for user defined LCIDs. The 0x0249 is an on-the-fly generated LCID for a language tag {grc} that is not present in our mappings. The value decomposes into primary 0x249 and sub 0x00, whereas the {grc-GR} 0x649 value decomposes into primary 0x249 and sub 0x01, because on-the-fly LCIDs for a language-only tag try to derive from a known language-country LCID. Changing the value in lang.h would be wrong.

There's also nothing wrong with generating an on-the-fly LCID for {grc} if there is no predefined 1:1 mapping. The resulting language listbox entry "qlt (Greece) {grc}" looks odd though, the qlt would be our reserved internal use code for an extended language tag when expressed in a lang::Locale, but {grc} wouldn't need any..

The question is rather why the document hast stored style:rfc-language-tag="grc" at all, because the also present fo:language="grc" fo:country="GR" would be sufficient (and actually result in the known {grc-GR} 0x0469 mapping) and the style:rfc-language-tag not needed at all, and we probably create an extended tag if it is present, on purpose because the fo:language and fo:country would only be a subset, but in this case the style:rfc-language-tag is a subset. Which even is a violation of ODF 19.516 style:rfc-language-tag "It shall only be used if its value cannot be expressed as a valid combination of the fo:language 19.871, fo:script 19.242 and fo:country 19.234 attributes".
https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part3-schema/OpenDocument-v1.3-os-part3-schema.html#__RefHeading__1418006_253892949

Seeing the document was stored using LibreOfficeDev/5.0.0.0.beta1 I really don't care.. but I'll take a look what actually happens when loading.
Comment 16 Commit Notification 2021-10-22 02:03:28 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/cb56ec6aa8f03fbc70c808bd4f519ce9d3c21f7d

Resolves: tdf#92015 Handle malused *:rfc-language-tag ODF violation

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 17 Julien Nabet 2021-10-22 10:04:55 UTC
On pc Debian x86-64 with master sources updated today, I confirm I don't reproduce the pb and got "Greek, Ancient" at bottom of Writer.

Thank you Eike for the very quick fix! :-)