Bug 137742 - Google Docs exports only (ambiguous) "en" language tag text attribute
Summary: Google Docs exports only (ambiguous) "en" language tag text attribute
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.0.2.2 release
Hardware: All All
: medium normal
Assignee: Eike Rathke
URL:
Whiteboard: target:7.3.0 target:7.2.1
Keywords: bibisected, bisected
: 132396 137635 137743 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-10-25 17:21 UTC by Larry Tate
Modified: 2022-02-05 15:01 UTC (History)
9 users (show)

See Also:
Crash report or crash signature:


Attachments
Doc that produces the issue I described with language settings. (6.99 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2020-10-25 18:37 UTC, Larry Tate
Details
LoremIpsum created using Google Docs shows issue (6.76 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2020-10-26 10:56 UTC, [REDACTED]
Details
Recovered document issuing "Missing hyphenation data" (8.92 KB, application/vnd.oasis.opendocument.text)
2021-03-31 09:29 UTC, ajlittoz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Larry Tate 2020-10-25 17:21:39 UTC
Description:
Documents I receive from students show {en} as the text language. I have gone into Tools > Options > Language Settings > Languages and checked that English (USA) is the selection in User Interface, Locale Setting, And Default Languages of Documents. However, the setting does not take effect. Documents I receive from students instead appear as having {en} as the language. 

Steps to Reproduce:
1. Go to Tools > Options > Language Settings > Languages and check that English (USA) is the selection in User Interface, Locale Setting, And Default Languages of Documents. 

2. Open a document, discover in sadness that the setting is not taking effect.


Actual Results:
The setting is not forcing documents to use the desired language selection for documents. This requires you to select the entire document, apply the setting, and then continue with your work. Annoying. 

Expected Results:
This should force all documents to use the selected language.


Reproducible: Always


User Profile Reset: Yes



Additional Info:
This should force all documents to use the selected language.
Comment 1 Julien Nabet 2020-10-25 17:34:17 UTC
*** Bug 137743 has been marked as a duplicate of this bug. ***
Comment 2 Mike Kaganski 2020-10-25 18:19:29 UTC
(In reply to Larry Tate from comment #0)
> Description:
> Documents I receive from students show {en} as the text language. I have
> gone into Tools > Options > Language Settings > Languages and checked that
> English (USA) is the selection in User Interface, Locale Setting, And
> Default Languages of Documents. However, the setting does not take effect.
> Documents I receive from students instead appear as having {en} as the
> language. 
> ...
> This should force all documents to use the selected language.

No.
This setting has no relation to what you see when open *existing* documents, only to new documents, or to new text that you may enter to documents that don't define language themselves.

Normally all current complex document formats, like ODF or OOXML (or older binary formats of MSO), include the information about language used when creating them. And *that* information must be used when you open those documents, not something that you define as default for new documents.

So this is not a bug. However, the question is why you see the {en}, and not something like "English (UK)" or "English (South Africa)" and so on. The question is, if this is something wrong in LibreOffice at opening/import stage, or something happening when saving the files (what format are the documents? which application was used to create them?).

You may want to attach a sample document that shows the problem. Then possibly the report may be confirmed, and converted to address the actual bug.
Comment 3 Larry Tate 2020-10-25 18:37:02 UTC
Created attachment 166703 [details]
Doc that produces the issue I described with language settings.
Comment 4 Larry Tate 2020-10-25 18:39:49 UTC
Thank you for that clarification. I am attaching a representative document where the issue appears. These are student essays, so I had to remove virtually all of the text in the document. However, after making the edits and saving the file, upon reopening the same issue I describe is still present.
Comment 5 Mike Kaganski 2020-10-26 05:51:25 UTC
(In reply to Larry Tate from comment #4)

Opening the document, the language indeed is shown as {en}. The document's word/styles.xml contains

> <w:lang w:val="en" ... />

However, I cannot reproduce generating such a value neither using LibreOffice 7.0.3.1, nor with Word 2016. There's no way to select generic "English" in Word's list of languages...

So the question remains: "which application was used to create them?"

And the fact that

> I had to remove virtually all of the text in the document

i.e., you edited it and re-saved in your LibreOffice, made the document changed, and impossible to understand what was used to generate the document.

You could ask one of the students to send you a document with a dummy text - just for inspection...?
Comment 6 [REDACTED] 2020-10-26 10:48:04 UTC
Just a hint: Google Docs downloading .docx file creates such  <w:lang w:val="en" /> entry in styles.xml.
Comment 7 [REDACTED] 2020-10-26 10:56:02 UTC
Created attachment 166731 [details]
LoremIpsum created using Google Docs shows issue

Attached a Google Docs dummy text, which has been set to English and shows "{en}" in status bar for the language in use.
Comment 8 Mike Kaganski 2020-10-26 11:49:38 UTC
(In reply to Uwe Auer from comment #6)

Thanks!
So what should be the issue here?
Should it be a NAB?
Or maybe should it be an enhancement to enable a fallback for generic en case, to use some dictionary (which? en-GB? en-US?)?
Comment 9 Larry Tate 2020-10-26 11:56:09 UTC
I'm getting this problem on 30 out of 30 submissions, so I don't think this is coming exclusively from Google Docs. 

I am working now on securing a dummy document from a student that I can share with you and surveying the class about their OS and software for composing. Standby! And thanks.
Comment 10 [REDACTED] 2020-10-26 12:10:27 UTC
(In reply to Larry Tate from comment #9)
> I'm getting this problem on 30 out of 30 submissions, so I don't think this
> is coming exclusively from Google Docs. 
> 

This makes me feel that there is a template, already having this setting and which has been distributed to all students.
Comment 11 [REDACTED] 2020-10-26 12:27:33 UTC
(In reply to Mike Kaganski from comment #8)

> Or maybe should it be an enhancement to enable a fallback for generic en
> case, to use some dictionary (which? en-GB? en-US?)?

Hmm, from a pure users perspective: Pop up a selection list, stating that the documents language setting is ambiguous and have the user select the variant of their language (offered languages restricted to the settings '{en}-*.'). Finally, to not touch/change the existing document, force a save to a new document (Just my thoughts, you may immediately forget that).

More pragmatic: On open inform user about 
i)  ambiguity/incompleteness in language setting
ii) forced a change to en-US (or e.g de-DE, if it was '{de}'-whatever it could be)


From a more technical perspective: Not a bug of LibreOffice but a the creating application's bug (but in fact I'm not aware of any standard here).
Comment 12 Mike Kaganski 2020-10-26 12:34:16 UTC
(In reply to Uwe Auer from comment #11)
> More pragmatic: On open inform user about 
> i)  ambiguity/incompleteness in language setting
> ii) forced a change to en-US (or e.g de-DE, if it was '{de}'-whatever it
> could be)

This seems consistent with currently existing infobar when a hyphenation dictionary is missing. It would feel logical from my PoV.
Comment 13 Larry Tate 2020-10-26 12:36:31 UTC
(In reply to Uwe Auer from comment #10)
> (In reply to Larry Tate from comment #9)
> > I'm getting this problem on 30 out of 30 submissions, so I don't think this
> > is coming exclusively from Google Docs. 
> > 
> 
> This makes me feel that there is a template, already having this setting and
> which has been distributed to all students.

One thought I'd considered is that these docs are all downloaded from our college's LMS (Canvas). I suppose it is possible that these documents are altered in some way before they are offered to the professor for download...
Comment 14 Heiko Tietze 2020-10-26 15:10:06 UTC
Not much that UX people can contribute here. At least GDocs export function is broken and due to false language information the spellchecking won't work as expected.

Adding a warning / infobar sounds not really actionable to me. The user would be supposed to understand what {en} means, what consequences this setting has, and how to solve the issue.

My take: either we fix it silently and convert ISO 639-1 codes (or whatever it is) into proper language tags or just blame others (=> NOB).

(needsUX needs UX-advice at CC)
Comment 15 Mike Kaganski 2020-10-26 15:38:41 UTC
(In reply to Heiko Tietze from comment #14)
> Not much that UX people can contribute here. At least GDocs export function
> is broken and due to false language information the spellchecking won't work
> as expected.

Well ... GDocs export function is not broken (unfortunately). The "en" is a valid BCP-47 tag ... and there's no requirement that language there include also country part of the tag. GDocs apparently don't discriminate their English dictionaries for countries, or have one of those dictionaries set as "generic English" ... and they are formally correct.

We add supported languages as required: people ask Eike to add this locale, or that locale ... and then provide locale data and translations and dictionaries ... and it appears in our list. Nothing prevents us - or any other OOXML- or ODF-conformant application - to use "en" locale (as opposed to e.g. "en-US").
Comment 16 Mike Kaganski 2020-10-26 15:57:56 UTC
(In reply to Heiko Tietze from comment #14)
> My take: either we fix it silently and convert ISO 639-1 codes (or whatever
> it is) into proper language tags or just blame others (=> NOB).

But hardcode-mapping "en" to "en-US" (as MS Word does) seems a sane way to solve this.
Comment 17 Eike Rathke 2020-10-26 19:11:58 UTC
Other than displaying the language as "{en}" (which means "en" is a valid language tag but there is no "English (generic)" or some such language/locale list entry), is there an actual problem with those documents? Of course spell-checking doesn't work with that because there's no indication which English dictionary is to be used, unless the system provides one for only "en" (which AFAIK no system has for good reasons).

Yes, we could map a bare "en" to "en-US" but that could be equally wrong if instead it should had been "en-GB" (or some other). In Google's great US-centric manner the en-US might be desired here, but..
Asking the user or popping up infobars or making up other fallbacks is not an option because we accept and preserve *any* syntactically valid BCP47 language tag, also unknown to us, on purpose.

Btw, I could not see a language attribution in the GDocs UI, spell-checking seems to use some language recognition, e.g. using German words in an English UI checks fine and downloading as .docx or .odt also results in "en" attribute so that crap is useless anyway.
Comment 18 Mike Kaganski 2020-10-29 06:09:34 UTC
Not to contradict Eike (personally I totally agree with his assessment), just wanted to mention two related discussions, which incidentally show how ambiguous this is - so this basically is expected to support Eike's PoV:

1. In this one, a random person argues that for "Guessing the missing parts" problem, *the rule is to select the "original country" of the language. The exceptions are mostly based on population* (mentioning en, pt, and zh as those exception cases).

https://stackoverflow.com/questions/2500066/if-you-have-an-application-localized-in-pt-br-and-pt-pt-what-language-you-shoul

2. This one shows that the first fallback for English locales is chosen by Google to be "International English variant", which is en-GB: "After opening a bug report on Google, defaulting to en_GB and not default strings.xml, they mentioned that this in the intended behaviour for Android N above".

https://stackoverflow.com/questions/45511769/localization-for-canada-defaults-to-uk-should-default-to-us
Comment 19 Heiko Tietze 2020-10-29 10:10:01 UTC
So let's do the silent conversion. Since we default to en_US (and ship the localization) I would use rather this (surprised that Google defaults to en_GB).
Comment 20 Mike Kaganski 2021-03-26 09:57:01 UTC
Note that the problem is not limited to "en". For example, "fr", "zh", "it" are also affected:

https://ask.libreoffice.org/en/question/286107
https://ask.libreoffice.org/en/question/291826
https://ask.libreoffice.org/en/question/289004
Comment 21 Mike Kaganski 2021-03-26 10:04:07 UTC
"el", "ja", "ka" ... :

https://ask.libreoffice.org/en/question/293486
https://ask.libreoffice.org/en/question/284727
https://ask.libreoffice.org/en/question/280102
https://ask.libreoffice.org/en/question/298632
https://ask.libreoffice.org/en/question/287776
...

It's pretty annoying to people, and many are affected. I make it "normal" instead of "trivial".
Comment 22 Mike Kaganski 2021-03-26 10:23:53 UTC
*** Bug 137635 has been marked as a duplicate of this bug. ***
Comment 23 Mike Kaganski 2021-03-26 10:24:17 UTC
*** Bug 136808 has been marked as a duplicate of this bug. ***
Comment 24 Mike Kaganski 2021-03-26 10:24:26 UTC
*** Bug 136809 has been marked as a duplicate of this bug. ***
Comment 25 Mike Kaganski 2021-03-26 10:29:20 UTC
*** Bug 132396 has been marked as a duplicate of this bug. ***
Comment 26 ajlittoz 2021-03-31 09:27:56 UTC Comment hidden (off-topic)
Comment 27 ajlittoz 2021-03-31 09:29:08 UTC Comment hidden (off-topic)
Comment 28 Eike Rathke 2021-03-31 09:59:22 UTC Comment hidden (off-topic)
Comment 29 ajlittoz 2021-03-31 10:31:55 UTC Comment hidden (off-topic)
Comment 30 Eike Rathke 2021-03-31 10:55:57 UTC Comment hidden (off-topic)
Comment 31 Eike Rathke 2021-03-31 10:59:20 UTC Comment hidden (off-topic)
Comment 32 ajlittoz 2021-03-31 12:41:22 UTC Comment hidden (off-topic)
Comment 33 Xisco Faulí 2021-04-28 16:19:10 UTC
For the record, the brackets started to be displayed after

https://cgit.freedesktop.org/libreoffice/core/commit/?id=bde834ee6b0cb43cebece47cac55cc9b80aadc24

author	Eike Rathke <erack@redhat.com>	2017-03-14 11:52:52 +0100
committer	Eike Rathke <erack@redhat.com>	2017-03-14 12:48:22 +0100
commit bde834ee6b0cb43cebece47cac55cc9b80aadc24 (patch)
tree ef3f5ebe8340d0e0392905ec3cfab033953a6425
parent bf63e5a3a6ae458ffe10061c1bcf969a534760c5 (diff)
display raw language tags in curly brackets
Comment 34 Eike Rathke 2021-08-10 17:35:29 UTC
(In reply to Mike Kaganski from comment #20)
> Note that the problem is not limited to "en". For example, "fr", "zh", "it"
> are also affected:
(In reply to Mike Kaganski from comment #21)
> "el", "ja", "ka" ... :
Those are quite useless though because none of them provided a sample document or answered the question whether their document has been processed by GoogleDocs (except one who claims they didn't but also did not provide a sample document), and the message displayed only the language, not the full language tag; which I fixed with https://gerrit.libreoffice.org/c/core/+/119020 (also for 7-2 and 7-1)
Comment 35 Eike Rathke 2021-08-10 17:51:44 UTC
I'll see if I can make something out of the known GDocs 'en' case at least..
Comment 36 Commit Notification 2021-08-14 23:47:27 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/23f17b7ea6fbd2f422c7e40192ae60e4df25224c

Resolves: tdf#137742 Workaround cheesy Google Docs writing language-only tags

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 37 Eike Rathke 2021-08-14 23:49:00 UTC
Pending review https://gerrit.libreoffice.org/c/core/+/120438 for 7-2
Comment 38 Commit Notification 2021-08-16 10:18:54 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "libreoffice-7-2":

https://git.libreoffice.org/core/commit/118eb9e426fe729324347685f986ff9e78d49483

Resolves: tdf#137742 Workaround cheesy Google Docs writing language-only tags

It will be available in 7.2.1.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 39 Commit Notification 2021-08-18 16:55:05 UTC
Xisco Fauli committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/f6a04457b8aa227deb9402e6406ea843fabfcbb0

tdf#137742: sw_ooxmlexport16: Add unittest

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 40 eisa01 2022-02-05 15:01:23 UTC
*** Bug 144273 has been marked as a duplicate of this bug. ***