167649 – Broken stemming and affixing in English thesaurus by losing morphological data of English spelling dictionaries

Bug 167649 - Broken stemming and affixing in English thesaurus by losing morphological data of English spelling dictionaries

Summary: Broken stemming and affixing in English thesaurus by losing morphological dat...

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Linguistic (show other bugs)
Version: (earliest affected)	5.2.0.4 release
Hardware:	All All

Importance:	medium major
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	regression

Depends on:
Blocks:

Reported:	2025-07-23 12:43 UTC by László Németh
Modified:	2025-07-24 12:58 UTC (History)
CC List:	3 users (show)

See Also:
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description László Németh 2025-07-23 12:43:55 UTC

Description:
Commit 2f0ddaeeb4323ac99afd35d2c4fda643c9ee8bcf
"tdf#97393 Update English Dictionaries to 2016.05.01 release" completely removed morphological data from the English spelling dictionaries, resulting broken stemming and affixing in the English thesaurus.


Steps to Reproduce:
Ask for the synonym of "mice" (plural form of "mouse).

Actual Results:
No synonyms for the word "mice".

Expected Results:
Stemming "mice" to "mouse", suggesting affixed synonyms, e.g. "rodents".


Reproducible: Always


User Profile Reset: No

Additional Info:
See Bug 97393, the origin of the regression.

Comment 1 László Németh 2025-07-23 15:03:44 UTC

I suggest to revert the dictionary changes, adding the new words and their word forms to the end of the original .dic file. This can keep the results of the past and new dictionary developments, fixing the serious regression.

Comment 2 László Németh 2025-07-23 15:17:15 UTC

(Note: and old, but still relevant mail about developing thesauri with stemming and affixation in English and in other languages:

---------- Forwarded message ---------
Feladó: Németh László <xxx>
Date: 2010. szept. 28., K, 2:33
Subject: Re: [lingu-dev] Adding affixation to a thesaurus
To: <dev@lingucomponent.openoffice.org>


Hi,

[From my previous letters, with new links]:

The new stemming in OpenOffice.org thesaurus works in most languages
without spelling dictionary modification (for example, the word form
"cats" has synonyms in English now), but for morphological generation
(for example, listing "kitties" synonym instead of "kitty" for "cats"
in English) and word forms without (real) stems need some new
dictionary data. See the issue 19563
(http://www.openoffice.org/issues/show_bug.cgi?id=19563), Hunspell
manual (https://sourceforge.net/projects/hunspell/files/Hunspell/Documentation/hunspell4.pdf,
morphological analysis section)
morphological regression tests, analyze tool and new -s/-m options of
the hunspell executable in the Hunspell distribution.

The standalone OpenOffice.org MyThes thesaurus
has a configuration option to test your thesaurus with stemming and affixation:
https://sourceforge.net/projects/hunspell/files/MyThes/1.2.1/mythes-1.2.1.tar.gz

See README.NEW and README for compiling.

Test example

Make an input.txt file with two lines, "rodents" and "consumed", and
run MyThes with the
test dictionary:
./example morph.idx morph.dat input.txt morph.aff morph.dic

Thesaurus uses encoding ISO8859-1

stem: rodent
rodent has 1 meanings
   meaning 0: (n) mouse
       mice

stem: consume
consume has 1 meanings
   meaning 0: (v) eat
       eaten, ate
       ingested

The example Hunspell dictionary (meanings of the morphological fields:
po: part of speech category
ts: terminal suffix
al: allomorph
st: stem
is: inflectional suffix, see
http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754#Morphological%20analysis):

$ cat morph.dic
8
rodent/S        po:n        ts:nom
mouse   po:n    al:mice ts:nom
mice    po:n st:mouse        is:plur
consume/TQD     po:v ts:present
ingest/TQD      po:v ts:present
eat/QT  po:v    al:ate  al:eaten        ts:present
ate     po:v    st:eat  is:past_1
eaten   po:v    st:eat  is:past_2

$ cat morph.aff
# example for morphological analysis, stemming and generation
SFX D Y 4
SFX D   0 ed [^e] is:past_1
SFX D   0 d e     is:past_1
SFX D   0 ed [^e] is:past_2
SFX D   0 d e     is:past_2

SFX S Y 1
SFX S   0 s . is:plur

SFX Q Y 1
SFX Q   0 s . is:sg_3

SFX T Y 2
SFX T   0 ing [^e] is:pr_part
SFX T   e ing e    is:pr_part

and the thesaurus (without any extra morphological information):

$ cat morph.dat
ISO8859-1
mouse|1
(n)|rodent
rodent|1
(n)|mouse
eat|1
(v)|consume|ingest
consume|1
(v)|eat|ingest
ingest|1
(v)|eat|consume

Regards,
László


2010/9/27 Andrea Pescetti <xxx>:
> Reading http://www.openoffice.org/issues/show_bug.cgi?id=114774 I
> understood that the OOo thesaurus support affixation, i.e., that if
> "river" admits "stream" as a synonym, then looking for a synonym of
> "rivers" will bring up "streams".
>
> Now, this never worked in the Italian thesaurus. Only the base form is
> proposed. I mean, if "piccolo" (Italian for "small") admits
> "limitato" (Italian for "limited") as a synonym, looking for synonyms of
> the plural form "piccoli" does not show the plural "limitati", but the
> base form "limitato". And this happens for all words, in OOo 3.2.1 too,
> where the English thesaurus has the affixation working and is unaffected
> by the issue mentioned above.
>
> It should thus be possible to improve the Italian thesaurus so that it
> supports affixation like the English one. Can anybody point me to some
> resources on how to do it? I had a look at
> http://lingucomponent.openoffice.org/thesaurus.html but I wasn't able to
> find an answer there.
>
> Thanks,
>  Andrea Pescetti - Italian N-L Project Lead.)

For thesaurus development, the latest MyThes distribution with stemming and affixation: https://sourceforge.net/projects/hunspell/files/MyThes/1.2.4/mythes-1.2.4.tar.gz

Comment 3 Marco A.G.Pinto 2025-07-24 03:20:12 UTC

@Németh

Giving a quick look at your e-mail and here I don't understand it much.

It is me who maintains the GB and ZA dictionaries.

On 1-JAN-2026, I will also maintain US+CA+AU (after one year preparing it).

Do I have to do anything special on them?

I just work on the .DIC and .AFF files.

Thanks!

Comment 4 László Németh 2025-07-24 10:09:01 UTC

(In reply to Marco A.G.Pinto from comment #3)
> @Németh
> 
> Giving a quick look at your e-mail and here I don't understand it much.
> 
> It is me who maintains the GB and ZA dictionaries.
> 
> On 1-JAN-2026, I will also maintain US+CA+AU (after one year preparing it).
> 
> Do I have to do anything special on them?
> 
> I just work on the .DIC and .AFF files.
> 
> Thanks!

Hi Marco,

Instead of adding new words to LibreOffice's English dictionaries, they were replaced with their old, incomplete versions (but with new words). I just realized, not only the thesaurus, but the metaphone algorithm was disabled, i.e. the competitive English suggestion during spell checking by mistake.

Reverting the following dictionaries commits:

$ git log --oneline -- en_US.dic
4fa9419 Updated the English dictionaries: GB+US+CA+AU+ZA
4fb0103 (tag: libreoffice-6-4-branch-point) Updated the English dictionaries: GB+US+CA+AU
605e1d1 Updated the English dictionaries: GB+AU+CA+US+Extension logo
dbcea2a English dictionaries: add ref to package-description.txt
6feecdc (hu_update) Updated the English dictionaries: GB + US + CA + AU
66a5dd1 Update English dictionaries
c875ba1 tdf#97393, tdf#100019: updated EN (CA, GB, US, ZA) dictionaries
2f0ddae tdf#97393 Update English Dictionaries to 2016.05.01 release

And adding the new words to the end of the dic files (without flags, i.e. in uncompressed format, simply words) will fix these regressions immediately, keeping your great work, too!

I'd love to do it too, if only to save my thesaurus work for the community (also I integrated Bjoern Jacke's metaphone code with Hunspell), so I'll try to get support for it.

Best regards,
László

Comment 5 Marco A.G.Pinto 2025-07-24 10:31:47 UTC

(In reply to László Németh from comment #4)
> (In reply to Marco A.G.Pinto from comment #3)
> > @Németh
> > 
> > Giving a quick look at your e-mail and here I don't understand it much.
> > 
> > It is me who maintains the GB and ZA dictionaries.
> > 
> > On 1-JAN-2026, I will also maintain US+CA+AU (after one year preparing it).
> > 
> > Do I have to do anything special on them?
> > 
> > I just work on the .DIC and .AFF files.
> > 
> > Thanks!
> 
> Hi Marco,
> 
> Instead of adding new words to LibreOffice's English dictionaries, they were
> replaced with their old, incomplete versions (but with new words). I just
> realized, not only the thesaurus, but the metaphone algorithm was disabled,
> i.e. the competitive English suggestion during spell checking by mistake.
> 
> Reverting the following dictionaries commits:
> 
> $ git log --oneline -- en_US.dic
> 4fa9419 Updated the English dictionaries: GB+US+CA+AU+ZA
> 4fb0103 (tag: libreoffice-6-4-branch-point) Updated the English
> dictionaries: GB+US+CA+AU
> 605e1d1 Updated the English dictionaries: GB+AU+CA+US+Extension logo
> dbcea2a English dictionaries: add ref to package-description.txt
> 6feecdc (hu_update) Updated the English dictionaries: GB + US + CA + AU
> 66a5dd1 Update English dictionaries
> c875ba1 tdf#97393, tdf#100019: updated EN (CA, GB, US, ZA) dictionaries
> 2f0ddae tdf#97393 Update English Dictionaries to 2016.05.01 release
> 
> And adding the new words to the end of the dic files (without flags, i.e. in
> uncompressed format, simply words) will fix these regressions immediately,
> keeping your great work, too!
> 
> I'd love to do it too, if only to save my thesaurus work for the community
> (also I integrated Bjoern Jacke's metaphone code with Hunspell), so I'll try
> to get support for it.
> 
> Best regards,
> László

László,

Please, I am perplexed, should I continue to release updates for the dictionaries (and commit them to Gerrit)?

Notice that both GB and ZA have the same .AFF, with just the language changed.

My next commit to Gerrit will be in October to give me time to work on the US+CA+AU for 2026.

On 1-JAN-2026, it will make one year since I started working on my version of the U.S. dictionary and also adding words for my versions of CA and AU.

Thanks!

Comment 6 László Németh 2025-07-24 12:58:15 UTC

> should I continue to release updates for the dictionaries (and commit them to Gerrit)?

We shouldn't create regressions again after fixing this issue, so we must change the process definitely. My original idea was the automation (see the attached scripts of ooo#19563 – http://www.openoffice.org/issues/show_bug.cgi?id=19563), but now I suggested to extend only en_US etc. .dic files instead of replacing them (this would be much faster, and : git revert, and appending a plain text file to the end of the .dic files). And I can imagine the automation of the last solution, so you don't need to change anything.