Bug 167649 - Broken stemming and affixing in English thesaurus by losing morphological data of English spelling dictionaries (also broken spelling suggestions by removing phonetic rules)
Summary: Broken stemming and affixing in English thesaurus by losing morphological dat...
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
5.2.0.4 release
Hardware: All All
: medium major
Assignee: Not Assigned
URL: https://lists.freedesktop.org/archive...
Whiteboard:
Keywords: bibisectRequest, regression
Depends on:
Blocks: Spell-Checking Thesaurus Dictionaries
  Show dependency treegraph
 
Reported: 2025-07-23 12:43 UTC by László Németh
Modified: 2025-12-18 08:32 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
The dictionary and affix file format for Hunspell support (41.71 KB, application/pdf)
2025-08-01 21:47 UTC, V Stuart Foote
Details
Nemeth's script from ooo - 2008 (5.63 KB, application/x-zip-compressed)
2025-09-30 19:24 UTC, Marco A.G.Pinto
Details
.AFF for five variants on 1-JAN-2026 (47.48 KB, application/octet-stream)
2025-11-15 18:03 UTC, Marco A.G.Pinto
Details

Note You need to log in before you can comment on or make changes to this bug.
Description László Németh 2025-07-23 12:43:55 UTC
Description:
Commit 2f0ddaeeb4323ac99afd35d2c4fda643c9ee8bcf
"tdf#97393 Update English Dictionaries to 2016.05.01 release" completely removed morphological data from the English spelling dictionaries, resulting broken stemming and affixing in the English thesaurus.


Steps to Reproduce:
Ask for the synonym of "mice" (plural form of "mouse).

Actual Results:
No synonyms for the word "mice".

Expected Results:
Stemming "mice" to "mouse", suggesting affixed synonyms, e.g. "rodents".


Reproducible: Always


User Profile Reset: No

Additional Info:
See Bug 97393, the origin of the regression.
Comment 1 László Németh 2025-07-23 15:03:44 UTC
I suggest to revert the dictionary changes, adding the new words and their word forms to the end of the original .dic file. This can keep the results of the past and new dictionary developments, fixing the serious regression.
Comment 2 László Németh 2025-07-23 15:17:15 UTC
(Note: and old, but still relevant mail about developing thesauri with stemming and affixation in English and in other languages:

---------- Forwarded message ---------
Feladó: Németh László <xxx>
Date: 2010. szept. 28., K, 2:33
Subject: Re: [lingu-dev] Adding affixation to a thesaurus
To: <dev@lingucomponent.openoffice.org>


Hi,

[From my previous letters, with new links]:

The new stemming in OpenOffice.org thesaurus works in most languages
without spelling dictionary modification (for example, the word form
"cats" has synonyms in English now), but for morphological generation
(for example, listing "kitties" synonym instead of "kitty" for "cats"
in English) and word forms without (real) stems need some new
dictionary data. See the issue 19563
(http://www.openoffice.org/issues/show_bug.cgi?id=19563), Hunspell
manual (https://sourceforge.net/projects/hunspell/files/Hunspell/Documentation/hunspell4.pdf,
morphological analysis section)
morphological regression tests, analyze tool and new -s/-m options of
the hunspell executable in the Hunspell distribution.

The standalone OpenOffice.org MyThes thesaurus
has a configuration option to test your thesaurus with stemming and affixation:
https://sourceforge.net/projects/hunspell/files/MyThes/1.2.1/mythes-1.2.1.tar.gz

See README.NEW and README for compiling.

Test example

Make an input.txt file with two lines, "rodents" and "consumed", and
run MyThes with the
test dictionary:
./example morph.idx morph.dat input.txt morph.aff morph.dic

Thesaurus uses encoding ISO8859-1

stem: rodent
rodent has 1 meanings
   meaning 0: (n) mouse
       mice

stem: consume
consume has 1 meanings
   meaning 0: (v) eat
       eaten, ate
       ingested

The example Hunspell dictionary (meanings of the morphological fields:
po: part of speech category
ts: terminal suffix
al: allomorph
st: stem
is: inflectional suffix, see
http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754#Morphological%20analysis):

$ cat morph.dic
8
rodent/S        po:n        ts:nom
mouse   po:n    al:mice ts:nom
mice    po:n st:mouse        is:plur
consume/TQD     po:v ts:present
ingest/TQD      po:v ts:present
eat/QT  po:v    al:ate  al:eaten        ts:present
ate     po:v    st:eat  is:past_1
eaten   po:v    st:eat  is:past_2

$ cat morph.aff
# example for morphological analysis, stemming and generation
SFX D Y 4
SFX D   0 ed [^e] is:past_1
SFX D   0 d e     is:past_1
SFX D   0 ed [^e] is:past_2
SFX D   0 d e     is:past_2

SFX S Y 1
SFX S   0 s . is:plur

SFX Q Y 1
SFX Q   0 s . is:sg_3

SFX T Y 2
SFX T   0 ing [^e] is:pr_part
SFX T   e ing e    is:pr_part

and the thesaurus (without any extra morphological information):

$ cat morph.dat
ISO8859-1
mouse|1
(n)|rodent
rodent|1
(n)|mouse
eat|1
(v)|consume|ingest
consume|1
(v)|eat|ingest
ingest|1
(v)|eat|consume

Regards,
László


2010/9/27 Andrea Pescetti <xxx>:
> Reading http://www.openoffice.org/issues/show_bug.cgi?id=114774 I
> understood that the OOo thesaurus support affixation, i.e., that if
> "river" admits "stream" as a synonym, then looking for a synonym of
> "rivers" will bring up "streams".
>
> Now, this never worked in the Italian thesaurus. Only the base form is
> proposed. I mean, if "piccolo" (Italian for "small") admits
> "limitato" (Italian for "limited") as a synonym, looking for synonyms of
> the plural form "piccoli" does not show the plural "limitati", but the
> base form "limitato". And this happens for all words, in OOo 3.2.1 too,
> where the English thesaurus has the affixation working and is unaffected
> by the issue mentioned above.
>
> It should thus be possible to improve the Italian thesaurus so that it
> supports affixation like the English one. Can anybody point me to some
> resources on how to do it? I had a look at
> http://lingucomponent.openoffice.org/thesaurus.html but I wasn't able to
> find an answer there.
>
> Thanks,
>  Andrea Pescetti - Italian N-L Project Lead.)

For thesaurus development, the latest MyThes distribution with stemming and affixation: https://sourceforge.net/projects/hunspell/files/MyThes/1.2.4/mythes-1.2.4.tar.gz
Comment 3 Marco A.G.Pinto 2025-07-24 03:20:12 UTC
@Németh

Giving a quick look at your e-mail and here I don't understand it much.

It is me who maintains the GB and ZA dictionaries.

On 1-JAN-2026, I will also maintain US+CA+AU (after one year preparing it).

Do I have to do anything special on them?

I just work on the .DIC and .AFF files.

Thanks!
Comment 4 László Németh 2025-07-24 10:09:01 UTC
(In reply to Marco A.G.Pinto from comment #3)
> @Németh
> 
> Giving a quick look at your e-mail and here I don't understand it much.
> 
> It is me who maintains the GB and ZA dictionaries.
> 
> On 1-JAN-2026, I will also maintain US+CA+AU (after one year preparing it).
> 
> Do I have to do anything special on them?
> 
> I just work on the .DIC and .AFF files.
> 
> Thanks!

Hi Marco,

Instead of adding new words to LibreOffice's English dictionaries, they were replaced with their old, incomplete versions (but with new words). I just realized, not only the thesaurus, but the metaphone algorithm was disabled, i.e. the competitive English suggestion during spell checking by mistake.

Reverting the following dictionaries commits:

$ git log --oneline -- en_US.dic
4fa9419 Updated the English dictionaries: GB+US+CA+AU+ZA
4fb0103 (tag: libreoffice-6-4-branch-point) Updated the English dictionaries: GB+US+CA+AU
605e1d1 Updated the English dictionaries: GB+AU+CA+US+Extension logo
dbcea2a English dictionaries: add ref to package-description.txt
6feecdc (hu_update) Updated the English dictionaries: GB + US + CA + AU
66a5dd1 Update English dictionaries
c875ba1 tdf#97393, tdf#100019: updated EN (CA, GB, US, ZA) dictionaries
2f0ddae tdf#97393 Update English Dictionaries to 2016.05.01 release

And adding the new words to the end of the dic files (without flags, i.e. in uncompressed format, simply words) will fix these regressions immediately, keeping your great work, too!

I'd love to do it too, if only to save my thesaurus work for the community (also I integrated Bjoern Jacke's metaphone code with Hunspell), so I'll try to get support for it.

Best regards,
László
Comment 5 Marco A.G.Pinto 2025-07-24 10:31:47 UTC
(In reply to László Németh from comment #4)
> (In reply to Marco A.G.Pinto from comment #3)
> > @Németh
> > 
> > Giving a quick look at your e-mail and here I don't understand it much.
> > 
> > It is me who maintains the GB and ZA dictionaries.
> > 
> > On 1-JAN-2026, I will also maintain US+CA+AU (after one year preparing it).
> > 
> > Do I have to do anything special on them?
> > 
> > I just work on the .DIC and .AFF files.
> > 
> > Thanks!
> 
> Hi Marco,
> 
> Instead of adding new words to LibreOffice's English dictionaries, they were
> replaced with their old, incomplete versions (but with new words). I just
> realized, not only the thesaurus, but the metaphone algorithm was disabled,
> i.e. the competitive English suggestion during spell checking by mistake.
> 
> Reverting the following dictionaries commits:
> 
> $ git log --oneline -- en_US.dic
> 4fa9419 Updated the English dictionaries: GB+US+CA+AU+ZA
> 4fb0103 (tag: libreoffice-6-4-branch-point) Updated the English
> dictionaries: GB+US+CA+AU
> 605e1d1 Updated the English dictionaries: GB+AU+CA+US+Extension logo
> dbcea2a English dictionaries: add ref to package-description.txt
> 6feecdc (hu_update) Updated the English dictionaries: GB + US + CA + AU
> 66a5dd1 Update English dictionaries
> c875ba1 tdf#97393, tdf#100019: updated EN (CA, GB, US, ZA) dictionaries
> 2f0ddae tdf#97393 Update English Dictionaries to 2016.05.01 release
> 
> And adding the new words to the end of the dic files (without flags, i.e. in
> uncompressed format, simply words) will fix these regressions immediately,
> keeping your great work, too!
> 
> I'd love to do it too, if only to save my thesaurus work for the community
> (also I integrated Bjoern Jacke's metaphone code with Hunspell), so I'll try
> to get support for it.
> 
> Best regards,
> László

László,

Please, I am perplexed, should I continue to release updates for the dictionaries (and commit them to Gerrit)?

Notice that both GB and ZA have the same .AFF, with just the language changed.

My next commit to Gerrit will be in October to give me time to work on the US+CA+AU for 2026.

On 1-JAN-2026, it will make one year since I started working on my version of the U.S. dictionary and also adding words for my versions of CA and AU.

Thanks!
Comment 6 László Németh 2025-07-24 12:58:15 UTC
> should I continue to release updates for the dictionaries (and commit them to Gerrit)?

We shouldn't create regressions again after fixing this issue, so we must change the process definitely. My original idea was the automation (see the attached scripts of ooo#19563 – http://www.openoffice.org/issues/show_bug.cgi?id=19563), but now I suggested to extend only en_US etc. .dic files instead of replacing them (this would be much faster, and : git revert, and appending a plain text file to the end of the .dic files). And I can imagine the automation of the last solution, so you don't need to change anything.
Comment 7 V Stuart Foote 2025-08-01 21:47:23 UTC
Created attachment 202142 [details]
The dictionary and affix file format for Hunspell support

Attachment is documentation of the Dictionary files and annotation needed for the integrated Hunspell dictionary and thesaurus support.
Comment 8 Buovjaga 2025-09-30 19:12:36 UTC
(In reply to László Németh from comment #6)
> > should I continue to release updates for the dictionaries (and commit them to Gerrit)?
> 
> We shouldn't create regressions again after fixing this issue, so we must
> change the process definitely. My original idea was the automation (see the
> attached scripts of ooo#19563 –
> http://www.openoffice.org/issues/show_bug.cgi?id=19563)

AOO Bugzilla is no longer accessible to unregistered users and registration is disabled. Can you please attach the scripts to this report?
Comment 9 Marco A.G.Pinto 2025-09-30 19:24:49 UTC
Created attachment 203057 [details]
Nemeth's script from ooo - 2008

Here is Nemeth's script from 2008.
Comment 10 Shantanu 2025-10-16 11:44:51 UTC
Please postpone your plans of "taking over" the en_US word list and add the morphological data to your current list.

For example, you have the word "best" in your list:
best/SGD

Just append it with the additional information about the word. The root word of "better" is "good". It helps if you establish that link using the st: tag, like this:
best/SGD st:good

By adding the metadata about the word, it also helps in other areas, such as the thesaurus, as explained by László Németh.
Comment 11 Marco A.G.Pinto 2025-10-16 20:55:31 UTC
Not even Kevin's dictionaries have that morphologic information on them:
https://github.com/en-wl/wordlist
http://wordlist.aspell.net

Kevin is the “Grandfather” of the English dictionaries.

If it takes, I will pay to Collabora to implement what I suggested in the e-mail to the admins, but I can only pay in December that is when I receive the Christmas pension in double (Christmas complement).
Comment 12 Shantanu 2025-10-17 07:59:41 UTC
Kevin's wordlist is not designed exclusively for Hunspell. It is also utilized by Aspell that do not support morphological tags such as ph: or st:
These tags can be found even in low-resource languages like Marathi or Thai. Using the ph: tag helps generate more accurate suggestions in cases where Hunspell's logic falls short. This metadata would also be valuable for those studying the .dic and .aff files to understand a language.

If you still have access to the older wordlist, it would be straightforward to compare it with your current list and incorporate the additional lexical information. Use this AWK script:
https://gist.github.com/shantanuo/4383fe10ac218e1d57ae4082db2840c3

Paying Collabora is a very expensive option that I will not even consider. I do not think that your Christmas pension would be sufficient to cover that.
Comment 13 Marco A.G.Pinto 2025-10-26 09:21:39 UTC
Heya, all,

I was going to check Shantanu's script and while fetching the English dictionaries from Gerrit using Cygwin:
   cd ~/lode/dev/core/dictionaries
   git checkout master
   git pull -r

The dictionaries Németh reverted have no morphological information on them, they look like the clear text ones I commit many months ago.

Does this mean that on 1-JAN, I can go back to committing normally (clear text ones)?

I did notice about some LightProof files and alike in the folder.


Thanks!
Comment 14 Marco A.G.Pinto 2025-11-15 18:03:47 UTC
Created attachment 204009 [details]
.AFF for five variants on 1-JAN-2026

@Németh,

Regarding the missing phonetics issue, all the five .AFFs will have it.

See the attached .AFF that is how the new .AFF files will look like on 1-JAN-2026.

Only the name of the variant, the version and the -ise/-ize/both will be changed  in the .AFFs.
Comment 15 László Németh 2025-11-15 18:42:53 UTC
(In reply to Marco A.G.Pinto from comment #14)
> Created attachment 204009 [details]
> .AFF for five variants on 1-JAN-2026
> 
> @Németh,
> 
> Regarding the missing phonetics issue, all the five .AFFs will have it.
> 
> See the attached .AFF that is how the new .AFF files will look like on
> 1-JAN-2026.
> 
> Only the name of the variant, the version and the -ise/-ize/both will be
> changed  in the .AFFs.

@Marco: Thanks! Because of the Unicode encoding of the dictionaries, it's possible, that it's not enough to put back the phonetic code to fix the regression in spelling suggestion, but I'm going to check it.
Comment 16 Kevin Atkinson 2025-12-12 10:19:03 UTC
Hi,

Sorry if this is off-topic, but I am the current maintainer of the US/CA/AU Hunspell dictionaries, and there were a few things I thought needed to be cleared up.

I want to clarify first that the project is not abandoned, nor have I stepped down.  I realize that the lack of updates may make it look that way, but I have been maintaining SCOWL for over 25 years now.  Even after long periods of inactivity, I always manage to swing back around and create new releases.  I also realize that many users are frustrated by my lack of responsiveness when requesting new words; however, I now have plans to address those issues.  I hope to create a new release sometime in early 2026.  For details on my future plans, see: https://github.com/en-wl/wordlist/issues/394

The next release will move the project to a new format.  This new format stores words as lemmas with POS and all derived forms (including irregular ones).  This will take some work, but the new information can likely help provide better-quality morphological data.  I am actively looking for someone to help with this: https://github.com/en-wl/wordlist/discussions/432

The new release will also make it possible to provide better-quality word lists with hyphenated compounds and proper abbreviations (ending with a period, rather than having it stripped off), if that is something you could use.

I hope we can work more closely together in the future.  The best way to reach out to me is via the GitHub project: https://github.com/en-wl/wordlist

Thank you,
Kevin Atkinson
Comment 17 Marco A.G.Pinto 2025-12-14 07:13:54 UTC
(In reply to Kevin Atkinson from comment #16)
> Hi,
> 
> Sorry if this is off-topic, but I am the current maintainer of the US/CA/AU
> Hunspell dictionaries, and there were a few things I thought needed to be
> cleared up.
> 
> I want to clarify first that the project is not abandoned, nor have I
> stepped down.  I realize that the lack of updates may make it look that way,
> but I have been maintaining SCOWL for over 25 years now.  Even after long
> periods of inactivity, I always manage to swing back around and create new
> releases.  I also realize that many users are frustrated by my lack of
> responsiveness when requesting new words; however, I now have plans to
> address those issues.  I hope to create a new release sometime in early
> 2026.  For details on my future plans, see:
> https://github.com/en-wl/wordlist/issues/394
> 
> The next release will move the project to a new format.  This new format
> stores words as lemmas with POS and all derived forms (including irregular
> ones).  This will take some work, but the new information can likely help
> provide better-quality morphological data.  I am actively looking for
> someone to help with this: https://github.com/en-wl/wordlist/discussions/432
> 
> The new release will also make it possible to provide better-quality word
> lists with hyphenated compounds and proper abbreviations (ending with a
> period, rather than having it stripped off), if that is something you could
> use.
> 
> I hope we can work more closely together in the future.  The best way to
> reach out to me is via the GitHub project: https://github.com/en-wl/wordlist
> 
> Thank you,
> Kevin Atkinson

Heya, Kevin,

For one year that I have been working on my versions of US+CA+AU.

I have created/added thousands of patterns/words in my own tool to properly handle them.

GB+US+CA+AU will each have around 280 000 words.

The ZA one will have 300 000+.

I would like to ask if you can release a script in 2026 that will create a static file with all morphologic information that could be hardcoded into LibreOffice and work with the five English dictionaries.

Having non plaintext in the dictionaries files will probably only work with LibreOffice and OpenOffice, and it will break all (or almost all) other apps that will use the files.

This is the suggestion I made in this ticket or in others (I can't remember where I wrote it).

Thanks.
Comment 18 Kevin Atkinson 2025-12-15 08:36:35 UTC
I want to make one point unambiguous for LibreOffice maintainers and reviewers:

I am the upstream maintainer for the en_US/en_CA/en_AU sources and I have not stepped down.  I do not approve LibreOffice switching those variants to a fork or an alternate upstream (explicitly or implicitly), including as part of any Jan 1, 2026 plan.

I expect to publish the next upstream release by the end of February 2026.  Please DO NOT MERGE any change that replaces or re-roots LibreOffice's en_US/en_CA/en_AU sources away from the upstream I maintain until that upstream release is published and evaluated.

Thanks,
Kevin
Comment 19 Marco A.G.Pinto 2025-12-17 21:22:08 UTC
Heya,

Just to clarify: I’m no longer proposing changes to the English dictionaries in LibreOffice core, and I won’t be committing dictionary updates to Gerrit.

I’ll continue maintaining my alternative English dictionaries via the LibreOffice extensions site, independently of what LibreOffice ships by default.

I remain part of the LibreOffice contributor community, just not through core dictionary commits.

Thanks.