Download it now!
Bug 137742 - Google Docs exports only (ambiguous) "en" language tag text attribute
Summary: Google Docs exports only (ambiguous) "en" language tag text attribute
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.0.2.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: bibisected, bisected
: 132396 136808 136809 137635 137743 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-10-25 17:21 UTC by Larry Tate
Modified: 2021-04-28 16:19 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
Doc that produces the issue I described with language settings. (6.99 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2020-10-25 18:37 UTC, Larry Tate
Details
LoremIpsum created using Google Docs shows issue (6.76 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2020-10-26 10:56 UTC, Uwe Auer
Details
Recovered document issuing "Missing hyphenation data" (8.92 KB, application/vnd.oasis.opendocument.text)
2021-03-31 09:29 UTC, ajlittoz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Larry Tate 2020-10-25 17:21:39 UTC
Description:
Documents I receive from students show {en} as the text language. I have gone into Tools > Options > Language Settings > Languages and checked that English (USA) is the selection in User Interface, Locale Setting, And Default Languages of Documents. However, the setting does not take effect. Documents I receive from students instead appear as having {en} as the language. 

Steps to Reproduce:
1. Go to Tools > Options > Language Settings > Languages and check that English (USA) is the selection in User Interface, Locale Setting, And Default Languages of Documents. 

2. Open a document, discover in sadness that the setting is not taking effect.


Actual Results:
The setting is not forcing documents to use the desired language selection for documents. This requires you to select the entire document, apply the setting, and then continue with your work. Annoying. 

Expected Results:
This should force all documents to use the selected language.


Reproducible: Always


User Profile Reset: Yes



Additional Info:
This should force all documents to use the selected language.
Comment 1 Julien Nabet 2020-10-25 17:34:17 UTC
*** Bug 137743 has been marked as a duplicate of this bug. ***
Comment 2 Mike Kaganski 2020-10-25 18:19:29 UTC
(In reply to Larry Tate from comment #0)
> Description:
> Documents I receive from students show {en} as the text language. I have
> gone into Tools > Options > Language Settings > Languages and checked that
> English (USA) is the selection in User Interface, Locale Setting, And
> Default Languages of Documents. However, the setting does not take effect.
> Documents I receive from students instead appear as having {en} as the
> language. 
> ...
> This should force all documents to use the selected language.

No.
This setting has no relation to what you see when open *existing* documents, only to new documents, or to new text that you may enter to documents that don't define language themselves.

Normally all current complex document formats, like ODF or OOXML (or older binary formats of MSO), include the information about language used when creating them. And *that* information must be used when you open those documents, not something that you define as default for new documents.

So this is not a bug. However, the question is why you see the {en}, and not something like "English (UK)" or "English (South Africa)" and so on. The question is, if this is something wrong in LibreOffice at opening/import stage, or something happening when saving the files (what format are the documents? which application was used to create them?).

You may want to attach a sample document that shows the problem. Then possibly the report may be confirmed, and converted to address the actual bug.
Comment 3 Larry Tate 2020-10-25 18:37:02 UTC
Created attachment 166703 [details]
Doc that produces the issue I described with language settings.
Comment 4 Larry Tate 2020-10-25 18:39:49 UTC
Thank you for that clarification. I am attaching a representative document where the issue appears. These are student essays, so I had to remove virtually all of the text in the document. However, after making the edits and saving the file, upon reopening the same issue I describe is still present.
Comment 5 Mike Kaganski 2020-10-26 05:51:25 UTC
(In reply to Larry Tate from comment #4)

Opening the document, the language indeed is shown as {en}. The document's word/styles.xml contains

> <w:lang w:val="en" ... />

However, I cannot reproduce generating such a value neither using LibreOffice 7.0.3.1, nor with Word 2016. There's no way to select generic "English" in Word's list of languages...

So the question remains: "which application was used to create them?"

And the fact that

> I had to remove virtually all of the text in the document

i.e., you edited it and re-saved in your LibreOffice, made the document changed, and impossible to understand what was used to generate the document.

You could ask one of the students to send you a document with a dummy text - just for inspection...?
Comment 6 Uwe Auer 2020-10-26 10:48:04 UTC
Just a hint: Google Docs downloading .docx file creates such  <w:lang w:val="en" /> entry in styles.xml.
Comment 7 Uwe Auer 2020-10-26 10:56:02 UTC
Created attachment 166731 [details]
LoremIpsum created using Google Docs shows issue

Attached a Google Docs dummy text, which has been set to English and shows "{en}" in status bar for the language in use.
Comment 8 Mike Kaganski 2020-10-26 11:49:38 UTC
(In reply to Uwe Auer from comment #6)

Thanks!
So what should be the issue here?
Should it be a NAB?
Or maybe should it be an enhancement to enable a fallback for generic en case, to use some dictionary (which? en-GB? en-US?)?
Comment 9 Larry Tate 2020-10-26 11:56:09 UTC
I'm getting this problem on 30 out of 30 submissions, so I don't think this is coming exclusively from Google Docs. 

I am working now on securing a dummy document from a student that I can share with you and surveying the class about their OS and software for composing. Standby! And thanks.
Comment 10 Uwe Auer 2020-10-26 12:10:27 UTC
(In reply to Larry Tate from comment #9)
> I'm getting this problem on 30 out of 30 submissions, so I don't think this
> is coming exclusively from Google Docs. 
> 

This makes me feel that there is a template, already having this setting and which has been distributed to all students.
Comment 11 Uwe Auer 2020-10-26 12:27:33 UTC
(In reply to Mike Kaganski from comment #8)

> Or maybe should it be an enhancement to enable a fallback for generic en
> case, to use some dictionary (which? en-GB? en-US?)?

Hmm, from a pure users perspective: Pop up a selection list, stating that the documents language setting is ambiguous and have the user select the variant of their language (offered languages restricted to the settings '{en}-*.'). Finally, to not touch/change the existing document, force a save to a new document (Just my thoughts, you may immediately forget that).

More pragmatic: On open inform user about 
i)  ambiguity/incompleteness in language setting
ii) forced a change to en-US (or e.g de-DE, if it was '{de}'-whatever it could be)


From a more technical perspective: Not a bug of LibreOffice but a the creating application's bug (but in fact I'm not aware of any standard here).
Comment 12 Mike Kaganski 2020-10-26 12:34:16 UTC
(In reply to Uwe Auer from comment #11)
> More pragmatic: On open inform user about 
> i)  ambiguity/incompleteness in language setting
> ii) forced a change to en-US (or e.g de-DE, if it was '{de}'-whatever it
> could be)

This seems consistent with currently existing infobar when a hyphenation dictionary is missing. It would feel logical from my PoV.
Comment 13 Larry Tate 2020-10-26 12:36:31 UTC
(In reply to Uwe Auer from comment #10)
> (In reply to Larry Tate from comment #9)
> > I'm getting this problem on 30 out of 30 submissions, so I don't think this
> > is coming exclusively from Google Docs. 
> > 
> 
> This makes me feel that there is a template, already having this setting and
> which has been distributed to all students.

One thought I'd considered is that these docs are all downloaded from our college's LMS (Canvas). I suppose it is possible that these documents are altered in some way before they are offered to the professor for download...
Comment 14 Heiko Tietze 2020-10-26 15:10:06 UTC
Not much that UX people can contribute here. At least GDocs export function is broken and due to false language information the spellchecking won't work as expected.

Adding a warning / infobar sounds not really actionable to me. The user would be supposed to understand what {en} means, what consequences this setting has, and how to solve the issue.

My take: either we fix it silently and convert ISO 639-1 codes (or whatever it is) into proper language tags or just blame others (=> NOB).

(needsUX needs UX-advice at CC)
Comment 15 Mike Kaganski 2020-10-26 15:38:41 UTC
(In reply to Heiko Tietze from comment #14)
> Not much that UX people can contribute here. At least GDocs export function
> is broken and due to false language information the spellchecking won't work
> as expected.

Well ... GDocs export function is not broken (unfortunately). The "en" is a valid BCP-47 tag ... and there's no requirement that language there include also country part of the tag. GDocs apparently don't discriminate their English dictionaries for countries, or have one of those dictionaries set as "generic English" ... and they are formally correct.

We add supported languages as required: people ask Eike to add this locale, or that locale ... and then provide locale data and translations and dictionaries ... and it appears in our list. Nothing prevents us - or any other OOXML- or ODF-conformant application - to use "en" locale (as opposed to e.g. "en-US").
Comment 16 Mike Kaganski 2020-10-26 15:57:56 UTC
(In reply to Heiko Tietze from comment #14)
> My take: either we fix it silently and convert ISO 639-1 codes (or whatever
> it is) into proper language tags or just blame others (=> NOB).

But hardcode-mapping "en" to "en-US" (as MS Word does) seems a sane way to solve this.
Comment 17 Eike Rathke 2020-10-26 19:11:58 UTC
Other than displaying the language as "{en}" (which means "en" is a valid language tag but there is no "English (generic)" or some such language/locale list entry), is there an actual problem with those documents? Of course spell-checking doesn't work with that because there's no indication which English dictionary is to be used, unless the system provides one for only "en" (which AFAIK no system has for good reasons).

Yes, we could map a bare "en" to "en-US" but that could be equally wrong if instead it should had been "en-GB" (or some other). In Google's great US-centric manner the en-US might be desired here, but..
Asking the user or popping up infobars or making up other fallbacks is not an option because we accept and preserve *any* syntactically valid BCP47 language tag, also unknown to us, on purpose.

Btw, I could not see a language attribution in the GDocs UI, spell-checking seems to use some language recognition, e.g. using German words in an English UI checks fine and downloading as .docx or .odt also results in "en" attribute so that crap is useless anyway.
Comment 18 Mike Kaganski 2020-10-29 06:09:34 UTC
Not to contradict Eike (personally I totally agree with his assessment), just wanted to mention two related discussions, which incidentally show how ambiguous this is - so this basically is expected to support Eike's PoV:

1. In this one, a random person argues that for "Guessing the missing parts" problem, *the rule is to select the "original country" of the language. The exceptions are mostly based on population* (mentioning en, pt, and zh as those exception cases).

https://stackoverflow.com/questions/2500066/if-you-have-an-application-localized-in-pt-br-and-pt-pt-what-language-you-shoul

2. This one shows that the first fallback for English locales is chosen by Google to be "International English variant", which is en-GB: "After opening a bug report on Google, defaulting to en_GB and not default strings.xml, they mentioned that this in the intended behaviour for Android N above".

https://stackoverflow.com/questions/45511769/localization-for-canada-defaults-to-uk-should-default-to-us
Comment 19 Heiko Tietze 2020-10-29 10:10:01 UTC
So let's do the silent conversion. Since we default to en_US (and ship the localization) I would use rather this (surprised that Google defaults to en_GB).
Comment 20 Mike Kaganski 2021-03-26 09:57:01 UTC
Note that the problem is not limited to "en". For example, "fr", "zh", "it" are also affected:

https://ask.libreoffice.org/en/question/286107
https://ask.libreoffice.org/en/question/291826
https://ask.libreoffice.org/en/question/289004
Comment 21 Mike Kaganski 2021-03-26 10:04:07 UTC
"el", "ja", "ka" ... :

https://ask.libreoffice.org/en/question/293486
https://ask.libreoffice.org/en/question/284727
https://ask.libreoffice.org/en/question/280102
https://ask.libreoffice.org/en/question/298632
https://ask.libreoffice.org/en/question/287776
...

It's pretty annoying to people, and many are affected. I make it "normal" instead of "trivial".
Comment 22 Mike Kaganski 2021-03-26 10:23:53 UTC
*** Bug 137635 has been marked as a duplicate of this bug. ***
Comment 23 Mike Kaganski 2021-03-26 10:24:17 UTC
*** Bug 136808 has been marked as a duplicate of this bug. ***
Comment 24 Mike Kaganski 2021-03-26 10:24:26 UTC
*** Bug 136809 has been marked as a duplicate of this bug. ***
Comment 25 Mike Kaganski 2021-03-26 10:29:20 UTC
*** Bug 132396 has been marked as a duplicate of this bug. ***
Comment 26 ajlittoz 2021-03-31 09:27:56 UTC
This might help to find a solution.

While wotking on AskLO question 286107, I made an experiment with Writer 7.0.5.2 Linux 5.11 VCL kf5 to try and understand when the message was issued. Some users complain that the misbehaviour happens on standard .odt files having never seen Google Docs.

A suggested solution was to uncheck the auto-hyphenate box in the paragraph style Text Flow tab.

I could not create the misbehaviour at first. But after having tampered a lot with Tools>Options, I made Writer crash. After the crash, I let the auto-recovery rebuild the document. When done, the message was present although the language pack is installed and hunspell modules too.

I looked at the .fodt version but did not see anything obvious as I am not familiar with the details of encoding. However if I change fo:language="fr" to fo:language="fr_FR", there is no longer any message on open. The trick works also for uninstalled language.

Where I am puzzled is the fact that I created then a fresh document from scratch, saved it and made sure it opens without "missing hyphenation data" message. When I looked at its XML, the fo:language attributes were simply set to "fr" without country code (which is in attribute fo:country=…).

So the cause of the problem may be somewhere else.

I'm attaching the faulty file for analysis. Its paragraph styles have been a bit modified after the crash but the mishap is still there.
Comment 27 ajlittoz 2021-03-31 09:29:08 UTC
Created attachment 170851 [details]
Recovered document issuing "Missing hyphenation data"
Comment 28 Eike Rathke 2021-03-31 09:59:22 UTC
(In reply to ajlittoz from comment #26)
> I looked at the .fodt version but did not see anything obvious as I am not
> familiar with the details of encoding. However if I change fo:language="fr"
> to fo:language="fr_FR", there is no longer any message on open. The trick
> works also for uninstalled language.
fo:language="fr_FR" is wrong though, the fo:language attribute is to contain *only* the language.

> When I looked at its XML, the fo:language attributes were simply
> set to "fr" without country code (which is in attribute fo:country=…).
Which is correct.
fo:language="fr" fo:country="FR" denotes the fr-FR language tag.


> I'm attaching the faulty file for analysis. Its paragraph styles have been a
> bit modified after the crash but the mishap is still there.
All four <style:text-properties> have both fo:language="fr" fo:country="FR" attributes. There is no fo:language="fr" alone.

The missing hyphenation data messsage when opening the document is reproducible for me, though understandably because I don't have any French spell-checking or hyphenation installed.

If it happens also if French hyphenation data is installed then it's likely not because the document would contain wrong language attribution but something else.
Comment 29 ajlittoz 2021-03-31 10:31:55 UTC
(In reply to Eike Rathke from comment #28)
> (In reply to ajlittoz from comment #26)
> All four <style:text-properties> have both fo:language="fr" fo:country="FR"
> attributes. There is no fo:language="fr" alone.

I was short in my description. The fo:country attribute was there of course.

I tried fr_FR in the .fodt of it just to see, even if I knew this was redundant and contradictory with fo:country.

> The missing hyphenation data messsage when opening the document is
> reproducible for me, though understandably because I don't have any French
> spell-checking or hyphenation installed.

Initially, I "tagged" the first paragraph it_IT because this hyphenation package is not installed. After the crash, to my surprise, the message requested fr, not it. When I changed fr to fr_FR (which is wrong), then the message requested it.

> If it happens also if French hyphenation data is installed then it's likely
> not because the document would contain wrong language attribution but
> something else.

This is why I attached it, being unable to interpret the rest of it.

Make a .fodt from the attached document and change the fo:language for one you have installed. Open it in Writer. Does the message still display?
Comment 30 Eike Rathke 2021-03-31 10:55:57 UTC
Fwiw, code pointers for that:

Check for hyphenator to output the info message happens in sw/source/core/text/inftxt.cxx line 1496 of SwTextFormatInfo::IsHyphenate() with

  if (!xHyph->hasLocale(g_pBreakIt->GetLocale(eTmp)))

where eTmp is LCID 0x040C (1036) for fr-FR and GetLocale() call correctly results in lang::Locale("fr","FR","") and then in my debug build where I do have French hyphenation available xHyph->hasLocale() returns true and does not complain.

The info message in case the locale is not found also only outputs the language, not the full language tag, so that single "fr" there is explained.
Comment 31 Eike Rathke 2021-03-31 10:59:20 UTC
However, I doubt this is even related to the original problem of Google documents being broken by specifying only a language, could you please submit another bug for that? Thanks.
Comment 32 ajlittoz 2021-03-31 12:41:22 UTC
(In reply to Eike Rathke from comment #31)
> However, I doubt this is even related to the original problem of Google
> documents being broken by specifying only a language, could you please
> submit another bug for that? Thanks.

See bug 141384.
Comment 33 Xisco Faulí 2021-04-28 16:19:10 UTC
For the record, the brackets started to be displayed after

https://cgit.freedesktop.org/libreoffice/core/commit/?id=bde834ee6b0cb43cebece47cac55cc9b80aadc24

author	Eike Rathke <erack@redhat.com>	2017-03-14 11:52:52 +0100
committer	Eike Rathke <erack@redhat.com>	2017-03-14 12:48:22 +0100
commit	bde834ee6b0cb43cebece47cac55cc9b80aadc24 (patch)
tree	ef3f5ebe8340d0e0392905ec3cfab033953a6425
parent	bf63e5a3a6ae458ffe10061c1bcf969a534760c5 (diff)
display raw language tags in curly brackets