Please, remove license information from dictionary files. The .dic files should be as clean as possible. License information should be stored in the appropriate README, LICENSE or COPYRIGHT files. There is also more place and avoids that license info is maintained in multiple places.
Additionally, and most importantly, encoding problems can arise from characters with diacritics in license information, especially names of authors. On top of that, this information is added in different ways, by using whitespace, # or /
1) For Danish, remove on the first line all after the number, including the whitespace
161315 # (c) Stavekontrolden.dk
2) For German, remove line numbers 2 to 18, where line 18 is an empty line and the rest start with #
(Something similar has been found in the non-frami German dictionaries. If possible, address those too.)
3) For Italian, remove line numbers 2 to 34 that start with #
4) For Guarani, remove whitespace and word "wordlist" from the first line and remove the second line that is empty
5) For Dutch, remove the last empty line
6) For Arabic, remove empty line number 13553
7) For Nepal, remove empty line number 38029. Note that this is better observed in the plain file (second url).
8) After cleaning up these files, please check also that the line count in the first line is correct. I.e. the total lines in the files excludes (if I'm not mistaken):
- the first line
- any line starting with comment
- any line starting with slash
- any empty lines
- any lines with only whitespace
This could be a general QA check for the dictionary files. I've noticed these minor improvements as developing for Hunspell/Nuspell and have scripts available for QA or reporting on this. I'm willing to contribute these, however I am completely unfamiliar with the LibreOffice development habitat.
9) Convert .aff and .dic files from DOS format line terminators to UNIX format line terminators with e.g. `flip -u` or `flip -b -u` This concerns:
- hu_HU/hu_HU.aff: Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
- pt_BR/pt_BR.dic: Non-ISO extended-ASCII text, with CRLF line terminators
- pt_BR/pt_BR.aff: ISO-8859 text, with CRLF line terminators
- ru_RU/ru_RU.dic: ISO-8859 text, with CRLF, LF line terminators
- ne_NP/ne_NP.dic: UTF-8 Unicode text, with CRLF, LF line terminators
Some extra inspection regarding long lines should be done for:
- da_DK/da_DK.aff: UTF-8 Unicode text, with very long lines
- si_LK/si_LK.dic: UTF-8 Unicode text, with very long lines
- for i in `find dictionaries -type f|grep -v hyph`; do file $i; done|grep 'long lines'
- for i in `find dictionaries -type f|grep -v hyph`; do file $i; done|grep 'line terminators'
Adolfo, any opinion here ?
@Sophi, do you think we could turn this issue into an easyhack ?
(In reply to Xisco Faulí from comment #3)
> @Sophi, do you think we could turn this issue into an easyhack ?
I guess yes, it seems Pander has well documented the issue already.
Let's turn this into an easy hack then...