Please fix the following for the Arabic dictionary file https://cgit.freedesktop.org/libreoffice/dictionaries/tree/ar/ar.dic 1) remove left-to-right (LTR) mark in line 13870" ﺐﻳﺭﻮﺗ<200e>/60 and in line 48332: ﻢﺗﺩﺎﻨﻳ<200e>/169 The copy-pastes here are a bit mangled. Search e.g. with vim for Ctrl+U 200e . Please, also trace any (upstream) scripts used to generate this dic files for these characters and fix it also there. 2) remove right-to-left (RTL) mark in line 23883 ﺇ<200f>ﺘﺑﺎﻋ/65 and in line 52995 ﺃﻮﻨﺗﺍﺮﻳﻭ<200f>/228 11 and in line 53323 ﻮﻴﻟﺯ<200f>/228 11 and in line 53338 ﻱﻮﻨﺴﻛﻭ<200f>/228 18 The copy-pastes here are a bit mangled. Search e.g. with vim for Ctrl+U 200f . Please, also trace any (upstream) scripts used to generate this dic files for these characters and fix it also there. 3) Around line number 54767, remove these lines: 54767 :::::::::::::: 54768 verb.huns.dic 54769 :::::::::::::: If needed, replace it with ################# # verb.huns.dic # ################# (Note the # also on the end to be robust and safe for LTR processing.) Please, also check any (upstream) scripts that might have injected this. 4) Around line number 52828, remove these lines: 52828 :::::::::::::: 52829 Condidate3.4.dic 52830 :::::::::::::: If needed, replace it with #################### # Condidate3.4.dic # #################### (Note the # also on the end to be robust and safe for LTR processing.) Please, also check any (upstream) scripts that might have injected this. 5) Around line number 13554, remove these lines: 13553 <empty line> 13554 :::::::::::::: 13555 names.dic 13556 :::::::::::::: 13557 50000 If needed, replace it with ############# # names.dic # ############# (Note the # also on the end to be robust and safe for LTR processing.) Please, also check any (upstream) scripts that might have injected this. 6) Around line number 13011, remove these lines: 13011 :::::::::::::: 13012 tools.dic 13013 :::::::::::::: 13014 ##### 2 If needed, replace it with ############# # tools.dic # ############# (Note the # also on the end to be robust and safe for LTR processing.) Please, also check any (upstream) scripts that might have injected this. 7) Around line number 1, remove these lines: 1 465929 1 2 :::::::::::::: 3 stopwords.dic 4 :::::::::::::: If needed, replace it with ################# # stopwords.dic # ################# (Note the # also on the end to be robust and safe for LTR processing.) Please, also check any (upstream) scripts that might have injected this. 8) Any lines with a # at only one end, should also get a # on the other end. Examples are these lines: 13558 ###أسماء 3 13614 #القارات 13628 #البلدان 13847 #العواصم 52819 ##اﻷسماء 4 52823 #تأليف 5 There are almost 30 lines with (balanced and unbalanced) comments. Perhaps see upstream which comments can be solved (if they are temporarily disabling dictionary words) or which comments can be removed completely, such as #####. Other balanced comments are welcome. 9) After fixing 7), the first line, before any lines with Arabic words, should contain the total number of lines of the file. Omitting lines starting with # and this first line may be done when calculating this number, but a few lines extra for this file of almost 500,000 lines is not a problem. A few lines too few will cost a little bit at initialization of the spell checker as the number in the first line is used to allocate minimally enough memory. What ever is lacking will be allocated dynamically later but costs some processing and memory power.
Hello Pander, Thanks for the detailed description. Would you mind creating a patch yourself and submitting it to gerrit? More info -> https://wiki.documentfoundation.org/Development/gerrit/SubmitPatch
Hi Xisco, Unfortunately, I do not have the time for that. This is already stretching the scope at the moment for me within the Nuspell project. (The development of a pure C++ replacement of Hunspell.) 10) for https://cgit.freedesktop.org/libreoffice/dictionaries/tree/pt_PT/pt_PT.dic the file should not start with whitespace. 11) The use of brackets in https://cgit.freedesktop.org/libreoffice/dictionaries/tree/ne_NP/ne_NP.dic for श्रीसिया (नउ.टा.जा.) and for (क्रियो) will have no effect as words fed into the spell checker will be tokenized in such way that they are stripped of brackets and spaces. The author can remove these two words or change them into श्रीसिया नउ.टा.जा. and क्रियो Note that this has to be checked with an export on Nepalese spelling.
@Sophi, do you know who is in charge of the arabic dictionary ?
1. I suggest people use the link to the raw file: https://cgit.freedesktop.org/libreoffice/dictionaries/plain/ar/ar.dic rather than to the annotated source view, which takes much longer to load. Plus, don't open it in your browser - it has > 450k lines :-) 2. I wonder if we don't have these marks somewhere in the Hebrew or Farsi dictionaries.
Upstream is http://ayaspell.sourceforge.net/ Getting this fixed is important for further development and verification of the spell checker. Thanks if this can be moved forward.
https://git.libreoffice.org/dictionaries/+/fea4ea689cd27d4c0bd981fdc01225d3bfacfc2d
*facepalm*, the correct commit is https://git.libreoffice.org/dictionaries/+/c5a06ed0bf1d10fcbc160e590d01ebc22271ba23 Anyway, please ensure this is solved upstream as well :)
Super! Does this need double checking? What about point 2. from Eyal? Can someone check that please? Perhaps that needs creation of another issue.