Bug 117408 - Clean up dictionary file headers from licenses and whitespace
Summary: Clean up dictionary file headers from licenses and whitespace
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
(earliest affected)
Hardware: All All
: medium normal
Assignee: Not Assigned
Whiteboard: reviewed:2022
Keywords: topicCleanup
Depends on:
Blocks: Dictionaries
  Show dependency treegraph
Reported: 2018-05-03 12:36 UTC by Pander
Modified: 2022-08-09 13:31 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Note You need to log in before you can comment on or make changes to this bug.
Description Pander 2018-05-03 12:36:43 UTC
Please, remove license information from dictionary files. The .dic files should be as clean as possible. License information should be stored in the appropriate README, LICENSE or COPYRIGHT files. There is also more place and avoids that license info is maintained in multiple places.

Additionally, and most importantly, encoding problems can arise from characters with diacritics in license information, especially names of authors. On top of that, this information is added in different ways, by using whitespace, # or /

1) For Danish, remove on the first line all after the number, including the whitespace

161315 # (c) Stavekontrolden.dk

- https://cgit.freedesktop.org/libreoffice/dictionaries/tree/da_DK/da_DK.dic

2) For German, remove line numbers 2 to 18, where line 18 is an empty line and the rest start with #

- https://cgit.freedesktop.org/libreoffice/dictionaries/tree/de/de_AT_frami.dic
- https://cgit.freedesktop.org/libreoffice/dictionaries/tree/de/de_CH_frami.dic
- https://cgit.freedesktop.org/libreoffice/dictionaries/tree/de/de_DE_frami.dic

(Something similar has been found in the non-frami German dictionaries. If possible, address those too.)

3) For Italian, remove line numbers 2 to 34 that start with #

- https://cgit.freedesktop.org/libreoffice/dictionaries/tree/it_IT/it_IT.dic

4) For Guarani, remove whitespace and word "wordlist" from the first line and remove the second line that is empty

- https://cgit.freedesktop.org/libreoffice/dictionaries/tree/gug/gug.dic

5) For Dutch, remove the last empty line

- https://cgit.freedesktop.org/libreoffice/dictionaries/tree/nl_NL/nl_NL.dic#n142520

6) For Arabic, remove empty line number 13553

- https://cgit.freedesktop.org/libreoffice/dictionaries/tree/ar/ar.dic#n13553
- https://bugs.documentfoundation.org/show_bug.cgi?id=117389

7) For Nepal, remove empty line number 38029. Note that this is better observed in the plain file (second url).

- https://cgit.freedesktop.org/libreoffice/dictionaries/tree/ne_NP/ne_NP.dic#n38029
- https://cgit.freedesktop.org/libreoffice/dictionaries/plain/ne_NP/ne_NP.dic

8) After cleaning up these files, please check also that the line count in the first line is correct. I.e. the total lines in the files excludes (if I'm not mistaken):
- the first line
- any line starting with comment
- any line starting with slash
- any empty lines
- any lines with only whitespace

This could be a general QA check for the dictionary files. I've noticed these minor improvements as developing for Hunspell/Nuspell and have scripts available for QA or reporting on this. I'm willing to contribute these, however I am completely unfamiliar with the LibreOffice development habitat.
Comment 1 Pander 2018-05-03 12:57:00 UTC

9) Convert .aff and .dic files from DOS format line terminators to UNIX format line terminators with e.g. `flip -u` or `flip -b -u` This concerns:

- hu_HU/hu_HU.aff: Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
- pt_BR/pt_BR.dic: Non-ISO extended-ASCII text, with CRLF line terminators
- pt_BR/pt_BR.aff: ISO-8859 text, with CRLF line terminators
- ru_RU/ru_RU.dic: ISO-8859 text, with CRLF, LF line terminators
- ne_NP/ne_NP.dic: UTF-8 Unicode text, with CRLF, LF line terminators

Some extra inspection regarding long lines should be done for:
- da_DK/da_DK.aff: UTF-8 Unicode text, with very long lines
- si_LK/si_LK.dic: UTF-8 Unicode text, with very long lines

See also:
- for i in `find dictionaries -type f|grep -v hyph`; do file $i; done|grep 'long lines'
- for i in `find dictionaries -type f|grep -v hyph`; do file $i; done|grep 'line terminators'
Comment 2 Xisco Faulí 2019-02-11 17:17:52 UTC
Adolfo, any opinion here ?
Comment 3 Xisco Faulí 2019-03-21 12:01:54 UTC
@Sophi, do you think we could turn this issue into an easyhack ?
Comment 4 sophie 2019-03-25 12:13:35 UTC
(In reply to Xisco Faulí from comment #3)
> @Sophi, do you think we could turn this issue into an easyhack ?

I guess yes, it seems Pander has well documented the issue already.
Comment 5 Xisco Faulí 2019-03-28 19:17:42 UTC
Let's turn this into an easy hack then...
Comment 6 Aron Budea 2021-02-13 06:33:09 UTC
Dictionaries in LO are usually downstream, thus these changes should be done where the originals are maintained. Eg. the Italian dictionaries are now maintained by LibreItalia.

Finding out if other dictionaries are still maintained, and if by the same person/group as in the current readmes does not look like a lot of fun, but probably that's what should be done.
Comment 7 Hossein 2022-08-09 13:28:39 UTC
Re-evaluating the EasyHack in 2022

This task is still relevant, and it is not finished yet. The credits lines are still there in the dictionary files, and other cleanups are yet to be done.

But asking someone to find the source of dictionaries and update them in the upstream is not a straightforward, well defined project that can be useful for the EasyHackers. Therefore, I am removing the EasyHack keyword from this issue.

Although some of the files are not updated regularly (even yearly), finding the link of upstream projects, and mentioning them here can help. I should also state that cleaning up the files here in the dictionaries repository can be also helpful, at least for some of the rarely updated files.