Bug 139185 - "creatine" is detected as a Romanian word
Summary: "creatine" is detected as a Romanian word
Status: NEEDINFO
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
7.0.4.2 release
Hardware: All All
: medium minor
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Language-Detection
  Show dependency treegraph
 
Reported: 2020-12-23 12:03 UTC by Dan Dascalescu
Modified: 2024-05-28 06:19 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
"creatine" is actually a US English word (43.54 KB, image/png)
2020-12-23 12:03 UTC, Dan Dascalescu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dan Dascalescu 2020-12-23 12:03:33 UTC
Created attachment 168450 [details]
"creatine" is actually a US English word

Not sure if this should be filed against a dictionary component, please re-file accordingly.
Comment 1 Mike Kaganski 2020-12-23 12:29:36 UTC
What is specifically wrong with the screenshot, and why do you say in the title that "creatine" is detected as a Romanian word? At least the image does not make it obvious.

What I see is that it detects a spelling error on "creatine" written in an unknown language (the status bar, which could tell the language information, has not fit on the screenshot); and that there is a "Word is Romanian" suggestion - again, unclear why, given that there's no OS and LO configuration information provided in the report. I would guess that it simply suggests user's locale, or maybe from the list of installed dictionaries, or somesuch, without any relation to whether it thinks the word is Romanian or not.

And only if it does not underline it when set to Romanian; or if there's a reason to believe that it shows this suggestion exactly because of the guess, and not because there are installed components that it suggests, can we think that the preamble is correct ...
Comment 2 Ming Hua 2020-12-23 12:36:35 UTC
I'm not sure including all amino acid names in the general-purpose English dictionary is a good idea.

I don't know anything about Romanian, but "creatine" is probably indeed a common Romanian word, therefore you see the suggestion.  It only appears if you have Romanian dictionary installed (and maybe enabled)?

You can always solve your problem locally by adding "creatine" to your user's dictionary using the "Add to Dictionary" menu item, but you probably already know that.
Comment 3 Ming Hua 2020-12-23 16:14:14 UTC
(In reply to Ming Hua from comment #2)
> It only appears if you have Romanian dictionary installed (and maybe enabled)
I take this back.

I was testing in Writer and didn't see the "Word is Romanian (Romania)" menu item like Dan's screenshot showed.  Now that I've tested in Calc, I can see the same menu even if I don't have Romanian dictionary installed.

Version: 7.1.0.0.beta1 (x64)
Build ID: 828a45a14a0b954e0e539f5a9a10ca31c81d8f53
CPU threads: 2; OS: Windows 10.0 Build 18363; UI render: default; VCL: win
Locale: zh-CN (zh_CN); UI: zh-CN
Calc: threaded

Chinese locale and UI, default western text in Tools > Options > Language Settings > Languages is set to "English (USA)", the text "Cellucor creatine" in a cell is detected as English according to the status bar, yet the context menu when right-clicking on "creatine" still gives "Word is Romanian..." and "Paragraph is Romanian..." items.
Comment 4 Mike Kaganski 2020-12-23 16:39:34 UTC
Looking into the code, OP seems to have guessed right.

The menu items are created in EditView::ExecuteSpellPopup (editeng/source/editeng/editview.cxx). It uses a language guesser, implemented in lingucomponent/source/languageguessing/guesslang.cxx.

When used for a single word, EditView::CheckLanguage tries four languages:
* The default document language from "Tools/Options - Language Settings - Languages: Western";
* The one from "Tools/Options - Language Settings - Languages: User interface";
* The one from "Tools/Options - Language Settings - Languages: Locale setting";
* en-US.
If they have active dictionaries, then first of them is used further.

When checking paragraph text, the language guesser uses libexttextcat [1] to perform a "fingerprint-based" guessing. It looks highly unreliable, based on the evidence...

I suppose it is the same as (part of) tdf#66051. Personally I would just drop it.

[1] https://wiki.documentfoundation.org/Libexttextcat
Comment 5 Mike Kaganski 2020-12-23 16:40:48 UTC
(In reply to Mike Kaganski from comment #4)
> Personally I would just drop it.

... I mean, just drop the language guesser. I don't see it doing anything useful.
Comment 6 Dan Dascalescu 2020-12-23 19:25:14 UTC
Agree, I would like to disable the language guesser altogether (is there a way to do that?) for the performance gain, because I only use English in my documents (part of an effort to advocate for using English universally, since the costs of translation, globally, exceed those of eliminating hunger, http://bit.ly/translation-vs-world-hunger, but that's a totally separate story).

FWIW, I don't have any locales installed either. I'm coincidentally Romanian and "creatine" is not a Romanian word actually (https://dexonline.ro/definitie/creatine).

I would advocate for including it in the English dictionary because it is more than just another amino acid; it's probably the second most popular supplement in the fitness industry.
Comment 7 Mike Kaganski 2020-12-24 07:38:40 UTC
See also: "Language Guessing" at https://www.openoffice.org/development/releases/2.3.0.html
Comment 8 Michael Bauer 2021-12-08 13:00:04 UTC
I agree the language guesser is not working well but I don't think kicking it is the solution. It is - even though I have nothing Romanian installed on my PC ANYWHERE - suggesting Romanian to me.

In any case, as a simple solution, use Marco Pinto's English dictionary? It's not on the LO extensions site but on the OO one: https://extensions.openoffice.org/en/project/english-dictionaries-apache-openoffice 
It's much more up do date than what LO seems to bundle and it certainly passes  creatine as an English word for me.
Comment 9 Mike Kaganski 2021-12-08 13:12:58 UTC
(In reply to Michael Bauer from comment #8)

Marco Pinto is a great LO contributor: https://gerrit.libreoffice.org/q/owner:marcoagpinto%2540sapo.pt

So it's definitely not that "It's much more up do date than what LO seems to bundle" - and the issue of a single word in the dictionary would not solve the underlying issue of "random" guessing of applicable languages based on what is focused (which is what your bug 95274 is about, either).

And anyone is of course welcome to provide contributions to our dictionaries :-) - see https://wiki.documentfoundation.org/Development/Dictionaries
Comment 10 Stéphane Guillou (stragu) 2024-05-28 06:19:38 UTC
I suggest closing as duplicate of bug 95274, as it boils down the the same issue: libexttextcat not doing a great job, at least in how we use it currently.
Any objection?