Bug 100019 - Upstreaming to English Dictionaries extension
Summary: Upstreaming to English Dictionaries extension
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Extensions (show other bugs)
Version:
(earliest affected)
5.2.0.0.alpha1
Hardware: All All
: medium minor
Assignee: Not Assigned
URL: http://extensions.libreoffice.org/ext...
Whiteboard: target:5.3.0 target:5.2.0
Keywords:
Depends on:
Blocks: Dictionaries
  Show dependency treegraph
 
Reported: 2016-05-23 22:37 UTC by Aron Budea
Modified: 2017-06-24 11:37 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Diff of en_ZA.dic (1.14 MB, text/plain)
2016-05-23 22:37 UTC, Aron Budea
Details
Current GB .AFF (28.29 KB, text/plain)
2016-05-31 13:09 UTC, Marco A.G.Pinto
Details
Hyphenation patterns for US+GB, v2011.10.07 (91.70 KB, application/zip)
2016-06-01 04:23 UTC, Aron Budea
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Aron Budea 2016-05-23 22:37:46 UTC
Created attachment 125252 [details]
Diff of en_ZA.dic

These are some differences in the previous LibreOffice dictionary compared to English Dictionaries (2016.05.01) that should be discussed and resolved.

1. hyph_en_GB/hyph_en_US seem to be a bit newer in LO than in English Dictionaries, I suggest using the newer one

version 2011-10-07 compared to version 2010-03-16
files: hyph_en_GB.dic, hyph_en_US.dic, README_hyph_en_GB.txt, README_hyph_en_US.txt


2. en_GB.aff in LO had these comments that English Dictionaries was missing, not sure about what these rules are (I can see that there isn't NOSUGGEST or COMPOUNDRULE), please verify

# 2008-12-18 - NOSUGGEST, NUMBER/COMPOUNDRULE patches (nemeth AT OOo)
# 2010-03-09 (nemeth AT OOo)
#  - UTF-8 encoded dictionary:
#       - fix em-dash problem of OOo 3.2 by BREAK
#       - suggesting words with typographical apostrophes
#       - recognizing words with Unicode f ligatures
#  - add phonetic suggestion (Copyright (C) 2000 Björn Jacke, see the end of the file)


3. there were some changes in the AU, GB and ZA dictionaries in bug 61660, please verify, and make those changes at least in GB (not sure about what to do with AU and ZA)
See bug 61660 and this commit: https://cgit.freedesktop.org/libreoffice/dictionaries/commit/?id=7e4239060266bf238b5e6692ed10d548c37572d5


4. en_ZA had a significant amount of mutual differences, I'm attaching the result of a diff, all the entries with "-" are missing from the newer dictionary.
I'm not the one to evaluate whether they should be included or not, but I noticed it, and wanted to point it out.
Comment 1 Aron Budea 2016-05-23 22:50:49 UTC
Okay, 4. is not an issue, once the diff is sorted from 2nd character, it turns out the words aren't missing from the newer dictionary, just placed somewhere else (it seems to not be sorted as a whole, but consists of several sorted parts).
Comment 2 Marco A.G.Pinto 2016-05-31 13:09:14 UTC
Created attachment 125417 [details]
Current GB .AFF
Comment 3 Marco A.G.Pinto 2016-05-31 13:14:29 UTC
@Aron:

I was just editing the GB README and, at line 34, we have:
---

This is a locally hosted copy of the English dictionaries with fixed dash handling and new ligature and phonetic suggestion support extension:
http://extensions.openoffice.org/en/node/3785

Original version of the en_GB dictionary:
http://www.openoffice.org/issues/show_bug.cgi/id=72145

OpenOffice.org patch and morphological extension.

The morphological extension based on Wordlist POS and AGID data
created by Kevin Atkinson and released on http://wordlist.sourceforge.net.

Other fixes:

OOo Issue 48060 - add numbers with affixes by COMPOUNDRULE (1st, 111th, 1990s etc.)
OOo Issue 29112, 55498 - add NOSUGGEST flags to taboo words
New REP items (better suggestions for accented words and a few mistakes)
OOo Issue 63541 - remove *dessicated

2008-12-18 nemeth AT OOo

---

With a closer look, one could add the text from your comment at the "2008-12-18" but I need someone to create the compound rule for me because I don't know how to do it (found attached the current GB .AFF) and also the NOSUGGEST (Németh, please tell us if NOSUGGEST is automatic by adding an "!" to the words in the .DIC or if needs something else to work.

Remember that this .AFF isn't automatically generated from a wordlist like the US/CA, so I can't copy (I think) the compound rules because I believe they use codes that are used in the affixes/suffixes.

PS->Could someone attach the most recent hyphenation of US+GB in a compressed archive here for me to update in the OXT? When I used Git do create a folder in my desktop, the date in the downloaded files became the current date. Maybe the date it is not important though... :-)

Thanks!

Kind regards,
      >Marco A.G.Pinto
       ---------------
Comment 4 Marco A.G.Pinto 2016-05-31 13:15:43 UTC
I have just added Németh to the Cc.
Comment 5 Marco A.G.Pinto 2016-05-31 13:37:29 UTC
[14:31] <marcoagpinto> I don't know how to add compounding or whatever it is called to the .AFF (1st, 2nd, 3rd, blah blah)
Comment 6 Aron Budea 2016-06-01 04:23:02 UTC
Created attachment 125431 [details]
Hyphenation patterns for US+GB, v2011.10.07

Thank you for the update, Marco. Here's the zip with the current hyphenation patterns, FYI file endings are in unix format.
Comment 7 Marco A.G.Pinto 2016-06-24 19:47:52 UTC
Hello!

I have just uploaded/updated the OXT to V2016-07-01:
http://extensions.libreoffice.org/extension-center/english-dictionaries/

Please noticed that I will go to the North of Portugal on Tuesday on vacation and will have limited Internet access during one week.


Here are the changes in the OXT:
MAGP 2016-07-01

Updated the hyphenation patterns to 2011-10-07 (from LibreOffice):
- US + GB

Updated the Dictionaries:
- British (Marco A.G.Pinto)*
  * British has 1107 new words (2016-06-01) + 738 new words (2016-07-01).
    It now uses NOSUGGEST keyword for offensive words.
    It now uses COMPOUNDING (Áron Budea)
Comment 9 Adolfo Jayme Barrientos 2016-07-23 09:13:56 UTC
Cherry-picked again for the final 5.2.0 release candidate as 258bf15aac7975e1202558b6d922be8a9a072b37