167730 – English dictionaries: future maintenance

Bug 167730 - English dictionaries: future maintenance

Summary: English dictionaries: future maintenance

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Linguistic (show other bugs)
Version: (earliest affected)	unspecified
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	Dictionaries
	Show dependency tree / graph

Reported:	2025-07-30 08:54 UTC by Marco A.G.Pinto
Modified:	2025-08-06 03:39 UTC (History)
CC List:	6 users (show)

See Also:	167649
Crash report or crash signature:

Attachments
new flags WIP (42.85 KB, application/octet-stream) 2025-07-30 08:54 UTC, Marco A.G.Pinto	Details
default text for .aff 2026+ (1.23 KB, application/octet-stream) 2025-07-30 08:54 UTC, Marco A.G.Pinto	Details
README file for 2026 (14.59 KB, text/plain) 2025-07-30 08:55 UTC, Marco A.G.Pinto	Details
GitHub text 2026+ (ODT) (36.53 KB, application/vnd.oasis.opendocument.text) 2025-07-30 08:55 UTC, Marco A.G.Pinto	Details
GitHub text 2026+ (txt) (5.60 KB, text/plain) 2025-07-30 08:56 UTC, Marco A.G.Pinto	Details
Nemeth's script from ooo (5.63 KB, application/x-zip-compressed) 2025-08-06 03:34 UTC, Marco A.G.Pinto	Details
All fixes I have done suggested by ChatGPT 4.1 (6.29 KB, text/plain) 2025-08-06 03:35 UTC, Marco A.G.Pinto	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marco A.G.Pinto 2025-07-30 08:54:26 UTC

Created attachment 202073 [details]
new flags WIP

Heya, everyone,

This ticket is related to:
https://bugs.documentfoundation.org/show_bug.cgi?id=167649

It appears that my commits (releases) “ruined” the original .AFF files.

I was on Libera channel trying to get additional information about it and the feeling I had is that the original .AFF files had morphological information regarding words (link provided by Cloph):
https://gerrit.libreoffice.org/c/dictionaries/+/25348

It is impossible for me to go back to the old .AFF files, since in these 14 or 15 years of development I have fixed tons of bugs in flags and added new features and new flags.

Could I suggest that the “AM” flags could be made independently of the .AFF files?


What I mean is that instead of having each .AFF file loaded with “AM”s they could be a file in the dictionaries folder and apply to all five variants of English.

On 1-JAN-2026 I will “take over” U.S. + Canada + Australia and I have been heavily patching flags to deal with the U.S. verbs and alike.

One year later, 1-JAN-2027, I have planned the 5th generation of English dictionaries.

I am attaching here the 2026 WIP files for you to see.

Please help find a solution, but the best way would be a separate file for all five variants.

Thanks!

Your friend,
      >Marco A.G.Pinto
       ---------------

Comment 1 Marco A.G.Pinto 2025-07-30 08:54:56 UTC

Created attachment 202074 [details]
default text for .aff 2026+

Comment 2 Marco A.G.Pinto 2025-07-30 08:55:24 UTC

Created attachment 202075 [details]
README file for 2026

Comment 3 Marco A.G.Pinto 2025-07-30 08:55:52 UTC

Created attachment 202076 [details]
GitHub text 2026+ (ODT)

Comment 4 Marco A.G.Pinto 2025-07-30 08:56:12 UTC

Created attachment 202077 [details]
GitHub text 2026+ (txt)

Comment 5 Marco A.G.Pinto 2025-07-30 09:54:53 UTC

Forgot László Németh.

Comment 6 Julien Nabet 2025-07-30 17:02:53 UTC

Andras: thought you might be interested in this one since it concerns dictionaries.

Comment 7 Marco A.G.Pinto 2025-08-04 11:22:02 UTC

Guys,

I was looking at the 2016 files, and they have different numbers of AMs depending on the English variant.

May I join the 5 variants AMs, remove duplicates and sort them alphabetically?

For example:
AM 1834
AM ts:0
AM st:abatis ts:Ns
AM ts:0 al:abode
AM ts:0 st:abide
AM st:ax ts:Ns
AM st:addendum ts:Ns


I would have them all without duplicates and sorted alphabetically, and then I would just replace the AM 1834 (in this example) with the total number of AM flags.

If this resolves the whole issue, on 1-JAN-2026 I will commit to Gerrit with me already taking over U.S. + Canada + Australia.

U.S. is difficult to take over, that is why I am dedicating one full year to it.

Comment 8 László Németh 2025-08-05 07:53:51 UTC

(In reply to Marco A.G.Pinto from comment #7)

Hi Marco,

AM/AF (Alias Morphology/Alias Flag vector) are only for replacing flag vectors and morphological description with an index in the dic file to compress the dictionary, see man (5) hunspell, and makealias:

$ makealias -h
makealias: make alias compressed dic and aff files
Usage: makealias [--minimize-diff old_file_without_file_extension] file.dic file.aff

> AM 1834
> AM ts:0 #1
> AM st:abatis ts:Ns #2

In the example above, "1" in the dic file means "ts:0", "2" means "st:abatis ts:Ns" etc. It's not possible to reorder AM lines without changing the indices in the .dic file, if we don't want to lose the information, which word has got the stem "abatis" in the .dic file. Fortunately we don't need AM/AF at all.

The working strategies to get back the lost functionality:

1) using my original script attached to the OpenOffice.org issue, which extends the dictionaries with morphological description: real stems ("st:") and the other affixed forms ("am:" ~allomorphs) (and use the result directly or its smaller version compressed with makealias).

or

2) add new word to the original .dic file with alias indices. The new words cannot contain flags, so it must create "unmunched" version from the new words, listening all of their affixed forms. To create this word list, you can use Kevin Hendrick's original "unmunch", or my scipt "wordforms" (part of the Hunspell tools).

hunspell/src/tools$ ./wordforms 
Usage: wordforms [-s | -p] dictionary.aff dictionary.dic word
-s: print only suffixed forms
-p: print only prefixed forms

Comment 9 Marco A.G.Pinto 2025-08-06 03:34:36 UTC

Created attachment 202202 [details]
Nemeth's script from ooo

The script I downloaded from ooo.

Comment 10 Marco A.G.Pinto 2025-08-06 03:35:24 UTC

Created attachment 202203 [details]
All fixes I have done suggested by ChatGPT 4.1

Comment 11 Marco A.G.Pinto 2025-08-06 03:39:36 UTC

(In reply to László Németh from comment #8)
> (In reply to Marco A.G.Pinto from comment #7)
> 
> Hi Marco,
> 
> AM/AF (Alias Morphology/Alias Flag vector) are only for replacing flag
> vectors and morphological description with an index in the dic file to
> compress the dictionary, see man (5) hunspell, and makealias:
> 
> $ makealias -h
> makealias: make alias compressed dic and aff files
> Usage: makealias [--minimize-diff old_file_without_file_extension] file.dic
> file.aff
> 
> > AM 1834
> > AM ts:0 #1
> > AM st:abatis ts:Ns #2
> 
> In the example above, "1" in the dic file means "ts:0", "2" means "st:abatis
> ts:Ns" etc. It's not possible to reorder AM lines without changing the
> indices in the .dic file, if we don't want to lose the information, which
> word has got the stem "abatis" in the .dic file. Fortunately we don't need
> AM/AF at all.
> 
> The working strategies to get back the lost functionality:
> 
> 1) using my original script attached to the OpenOffice.org issue, which
> extends the dictionaries with morphological description: real stems ("st:")
> and the other affixed forms ("am:" ~allomorphs) (and use the result directly
> or its smaller version compressed with makealias).
> 
> or
> 
> 2) add new word to the original .dic file with alias indices. The new words
> cannot contain flags, so it must create "unmunched" version from the new
> words, listening all of their affixed forms. To create this word list, you
> can use Kevin Hendrick's original "unmunch", or my scipt "wordforms" (part
> of the Hunspell tools).
> 
> hunspell/src/tools$ ./wordforms 
> Usage: wordforms [-s | -p] dictionary.aff dictionary.dic word
> -s: print only suffixed forms
> -p: print only prefixed forms

Nemeth or any other developers,

I have done all the fixes I could suggested by GPT and also installed the two Hunspell related packages on my VM with Ubuntu 24.04.

I still get errors even reducing the .DIC to just two or three entries for testing.

parsing line: #  Z --> S
parsed in 13 prefixes and 53 suffixes
.awk: line 1: improper use of next
cat: /home/marco-pinto/Desktop/nemeth/pos/part-of-speech.txt: No such file or directory
.cat: /home/marco-pinto/Desktop/nemeth/agid/infl.txt: No such file or directory
.awk: line 1: regular expression compile failed (syntax error ^* or ^+)
^*
.cat: /tmp/z.aff: No such file or directory
awk: line 1: improper use of next
.......
Verifying. Different words (if not 0, check /tmp/diff.log): 0
Alias compression...
52 
0/201,204 
0th/205,203 
1/201,202 
1st/205 
1th/203,300 
2/201,204 
2nd/205 
2th/203,300 
3/201,204 
3rd/205 
3th/203,300 
4/201,204 
4th/205,203 
5/201,204 
5th/205,203 
6/201,204 
6th/205,203 
7/201,204 
7th/205,203 
8/201,204 
8th/205,203 
9/201,204 
9th/205,203 
10s/205,203 
20s/205,203 
30s/205,203 
40s/205,203 
50s/205,203 
60s/205,203 
70s/205,203 
80s/205,203 
90s/205,203 
100s/205,203 
200s/205,203 
300s/205,203 
400s/205,203 
500s/205,203 
600s/205,203 
700s/205,203 
800s/205,203 
900s/205,203 
1000s/205,203 
2000s/205,203 
'10s 
'20s 
'30s 
'40s 
'50s 
'60s 
'70s 
'80s 
'90s 
.marco-pinto@marco-pinto-VirtualBox:~/Desktop/nemeth$