Description: Commit 2f0ddaeeb4323ac99afd35d2c4fda643c9ee8bcf "tdf#97393 Update English Dictionaries to 2016.05.01 release" completely removed morphological data from the English spelling dictionaries, resulting broken stemming and affixing in the English thesaurus. Steps to Reproduce: Ask for the synonym of "mice" (plural form of "mouse). Actual Results: No synonyms for the word "mice". Expected Results: Stemming "mice" to "mouse", suggesting affixed synonyms, e.g. "rodents". Reproducible: Always User Profile Reset: No Additional Info: See Bug 97393, the origin of the regression.
I suggest to revert the dictionary changes, adding the new words and their word forms to the end of the original .dic file. This can keep the results of the past and new dictionary developments, fixing the serious regression.
(Note: and old, but still relevant mail about developing thesauri with stemming and affixation in English and in other languages: ---------- Forwarded message --------- Feladó: Németh László <xxx> Date: 2010. szept. 28., K, 2:33 Subject: Re: [lingu-dev] Adding affixation to a thesaurus To: <dev@lingucomponent.openoffice.org> Hi, [From my previous letters, with new links]: The new stemming in OpenOffice.org thesaurus works in most languages without spelling dictionary modification (for example, the word form "cats" has synonyms in English now), but for morphological generation (for example, listing "kitties" synonym instead of "kitty" for "cats" in English) and word forms without (real) stems need some new dictionary data. See the issue 19563 (http://www.openoffice.org/issues/show_bug.cgi?id=19563), Hunspell manual (https://sourceforge.net/projects/hunspell/files/Hunspell/Documentation/hunspell4.pdf, morphological analysis section) morphological regression tests, analyze tool and new -s/-m options of the hunspell executable in the Hunspell distribution. The standalone OpenOffice.org MyThes thesaurus has a configuration option to test your thesaurus with stemming and affixation: https://sourceforge.net/projects/hunspell/files/MyThes/1.2.1/mythes-1.2.1.tar.gz See README.NEW and README for compiling. Test example Make an input.txt file with two lines, "rodents" and "consumed", and run MyThes with the test dictionary: ./example morph.idx morph.dat input.txt morph.aff morph.dic Thesaurus uses encoding ISO8859-1 stem: rodent rodent has 1 meanings meaning 0: (n) mouse mice stem: consume consume has 1 meanings meaning 0: (v) eat eaten, ate ingested The example Hunspell dictionary (meanings of the morphological fields: po: part of speech category ts: terminal suffix al: allomorph st: stem is: inflectional suffix, see http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754#Morphological%20analysis): $ cat morph.dic 8 rodent/S po:n ts:nom mouse po:n al:mice ts:nom mice po:n st:mouse is:plur consume/TQD po:v ts:present ingest/TQD po:v ts:present eat/QT po:v al:ate al:eaten ts:present ate po:v st:eat is:past_1 eaten po:v st:eat is:past_2 $ cat morph.aff # example for morphological analysis, stemming and generation SFX D Y 4 SFX D 0 ed [^e] is:past_1 SFX D 0 d e is:past_1 SFX D 0 ed [^e] is:past_2 SFX D 0 d e is:past_2 SFX S Y 1 SFX S 0 s . is:plur SFX Q Y 1 SFX Q 0 s . is:sg_3 SFX T Y 2 SFX T 0 ing [^e] is:pr_part SFX T e ing e is:pr_part and the thesaurus (without any extra morphological information): $ cat morph.dat ISO8859-1 mouse|1 (n)|rodent rodent|1 (n)|mouse eat|1 (v)|consume|ingest consume|1 (v)|eat|ingest ingest|1 (v)|eat|consume Regards, László 2010/9/27 Andrea Pescetti <xxx>: > Reading http://www.openoffice.org/issues/show_bug.cgi?id=114774 I > understood that the OOo thesaurus support affixation, i.e., that if > "river" admits "stream" as a synonym, then looking for a synonym of > "rivers" will bring up "streams". > > Now, this never worked in the Italian thesaurus. Only the base form is > proposed. I mean, if "piccolo" (Italian for "small") admits > "limitato" (Italian for "limited") as a synonym, looking for synonyms of > the plural form "piccoli" does not show the plural "limitati", but the > base form "limitato". And this happens for all words, in OOo 3.2.1 too, > where the English thesaurus has the affixation working and is unaffected > by the issue mentioned above. > > It should thus be possible to improve the Italian thesaurus so that it > supports affixation like the English one. Can anybody point me to some > resources on how to do it? I had a look at > http://lingucomponent.openoffice.org/thesaurus.html but I wasn't able to > find an answer there. > > Thanks, > Andrea Pescetti - Italian N-L Project Lead.) For thesaurus development, the latest MyThes distribution with stemming and affixation: https://sourceforge.net/projects/hunspell/files/MyThes/1.2.4/mythes-1.2.4.tar.gz
@Németh Giving a quick look at your e-mail and here I don't understand it much. It is me who maintains the GB and ZA dictionaries. On 1-JAN-2026, I will also maintain US+CA+AU (after one year preparing it). Do I have to do anything special on them? I just work on the .DIC and .AFF files. Thanks!
(In reply to Marco A.G.Pinto from comment #3) > @Németh > > Giving a quick look at your e-mail and here I don't understand it much. > > It is me who maintains the GB and ZA dictionaries. > > On 1-JAN-2026, I will also maintain US+CA+AU (after one year preparing it). > > Do I have to do anything special on them? > > I just work on the .DIC and .AFF files. > > Thanks! Hi Marco, Instead of adding new words to LibreOffice's English dictionaries, they were replaced with their old, incomplete versions (but with new words). I just realized, not only the thesaurus, but the metaphone algorithm was disabled, i.e. the competitive English suggestion during spell checking by mistake. Reverting the following dictionaries commits: $ git log --oneline -- en_US.dic 4fa9419 Updated the English dictionaries: GB+US+CA+AU+ZA 4fb0103 (tag: libreoffice-6-4-branch-point) Updated the English dictionaries: GB+US+CA+AU 605e1d1 Updated the English dictionaries: GB+AU+CA+US+Extension logo dbcea2a English dictionaries: add ref to package-description.txt 6feecdc (hu_update) Updated the English dictionaries: GB + US + CA + AU 66a5dd1 Update English dictionaries c875ba1 tdf#97393, tdf#100019: updated EN (CA, GB, US, ZA) dictionaries 2f0ddae tdf#97393 Update English Dictionaries to 2016.05.01 release And adding the new words to the end of the dic files (without flags, i.e. in uncompressed format, simply words) will fix these regressions immediately, keeping your great work, too! I'd love to do it too, if only to save my thesaurus work for the community (also I integrated Bjoern Jacke's metaphone code with Hunspell), so I'll try to get support for it. Best regards, László
(In reply to László Németh from comment #4) > (In reply to Marco A.G.Pinto from comment #3) > > @Németh > > > > Giving a quick look at your e-mail and here I don't understand it much. > > > > It is me who maintains the GB and ZA dictionaries. > > > > On 1-JAN-2026, I will also maintain US+CA+AU (after one year preparing it). > > > > Do I have to do anything special on them? > > > > I just work on the .DIC and .AFF files. > > > > Thanks! > > Hi Marco, > > Instead of adding new words to LibreOffice's English dictionaries, they were > replaced with their old, incomplete versions (but with new words). I just > realized, not only the thesaurus, but the metaphone algorithm was disabled, > i.e. the competitive English suggestion during spell checking by mistake. > > Reverting the following dictionaries commits: > > $ git log --oneline -- en_US.dic > 4fa9419 Updated the English dictionaries: GB+US+CA+AU+ZA > 4fb0103 (tag: libreoffice-6-4-branch-point) Updated the English > dictionaries: GB+US+CA+AU > 605e1d1 Updated the English dictionaries: GB+AU+CA+US+Extension logo > dbcea2a English dictionaries: add ref to package-description.txt > 6feecdc (hu_update) Updated the English dictionaries: GB + US + CA + AU > 66a5dd1 Update English dictionaries > c875ba1 tdf#97393, tdf#100019: updated EN (CA, GB, US, ZA) dictionaries > 2f0ddae tdf#97393 Update English Dictionaries to 2016.05.01 release > > And adding the new words to the end of the dic files (without flags, i.e. in > uncompressed format, simply words) will fix these regressions immediately, > keeping your great work, too! > > I'd love to do it too, if only to save my thesaurus work for the community > (also I integrated Bjoern Jacke's metaphone code with Hunspell), so I'll try > to get support for it. > > Best regards, > László László, Please, I am perplexed, should I continue to release updates for the dictionaries (and commit them to Gerrit)? Notice that both GB and ZA have the same .AFF, with just the language changed. My next commit to Gerrit will be in October to give me time to work on the US+CA+AU for 2026. On 1-JAN-2026, it will make one year since I started working on my version of the U.S. dictionary and also adding words for my versions of CA and AU. Thanks!
> should I continue to release updates for the dictionaries (and commit them to Gerrit)? We shouldn't create regressions again after fixing this issue, so we must change the process definitely. My original idea was the automation (see the attached scripts of ooo#19563 – http://www.openoffice.org/issues/show_bug.cgi?id=19563), but now I suggested to extend only en_US etc. .dic files instead of replacing them (this would be much faster, and : git revert, and appending a plain text file to the end of the .dic files). And I can imagine the automation of the last solution, so you don't need to change anything.