Bug 117389 - Remove unneed TRL and LTR marks in Arabic (ar) dictionary file and fix header and comments
Summary: Remove unneed TRL and LTR marks in Arabic (ar) dictionary file and fix header...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
6.1.0.0.alpha1+
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:6.3.0
Keywords:
Depends on:
Blocks: RTL-CTL RTL-Arabic-and-Farsi
  Show dependency treegraph
 
Reported: 2018-05-02 12:48 UTC by Pander
Modified: 2019-05-01 09:56 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Pander 2018-05-02 12:48:55 UTC
Please fix the following for the Arabic dictionary file https://cgit.freedesktop.org/libreoffice/dictionaries/tree/ar/ar.dic


1) remove left-to-right (LTR) mark in line 13870"

   ﺐﻳﺭﻮﺗ<200e>/60

and in line 48332:

   ﻢﺗﺩﺎﻨﻳ<200e>/169

The copy-pastes here are a bit mangled. Search e.g. with vim for Ctrl+U 200e . Please, also trace any (upstream) scripts used to generate this dic files for these characters and fix it also there.


2) remove right-to-left (RTL) mark in line 23883

   ﺇ<200f>ﺘﺑﺎﻋ/65

and in line 52995

   ﺃﻮﻨﺗﺍﺮﻳﻭ<200f>/228      11

and in line 53323

   ﻮﻴﻟﺯ<200f>/228  11

and in line 53338

   ﻱﻮﻨﺴﻛﻭ<200f>/228        18

The copy-pastes here are a bit mangled. Search e.g. with vim for Ctrl+U 200f . Please, also trace any (upstream) scripts used to generate this dic files for these characters and fix it also there.


3) Around line number 54767, remove these lines:

   54767 ::::::::::::::
   54768 verb.huns.dic
   54769 ::::::::::::::

If needed, replace it with

   #################
   # verb.huns.dic #
   #################

(Note the # also on the end to be robust and safe for LTR processing.)

Please, also check any (upstream) scripts that might have injected this.


4) Around line number 52828, remove these lines:

   52828 ::::::::::::::
   52829 Condidate3.4.dic
   52830 ::::::::::::::

If needed, replace it with

   ####################
   # Condidate3.4.dic #
   ####################

(Note the # also on the end to be robust and safe for LTR processing.)

Please, also check any (upstream) scripts that might have injected this.


5) Around line number 13554, remove these lines:

   13553 <empty line>
   13554 ::::::::::::::
   13555 names.dic
   13556 ::::::::::::::
   13557 50000

If needed, replace it with

   #############
   # names.dic #
   #############

(Note the # also on the end to be robust and safe for LTR processing.)

Please, also check any (upstream) scripts that might have injected this.


6) Around line number 13011, remove these lines:

   13011 ::::::::::::::
   13012 tools.dic
   13013 ::::::::::::::
   13014 #####	2

If needed, replace it with

   #############
   # tools.dic #
   #############

(Note the # also on the end to be robust and safe for LTR processing.)

Please, also check any (upstream) scripts that might have injected this.


7) Around line number 1, remove these lines:


   1 465929	1
   2 ::::::::::::::
   3 stopwords.dic
   4 ::::::::::::::

If needed, replace it with

   #################
   # stopwords.dic #
   #################

(Note the # also on the end to be robust and safe for LTR processing.)

Please, also check any (upstream) scripts that might have injected this.


8) Any lines with a # at only one end, should also get a # on the other end. Examples are these lines:

   13558 ###أسماء	3

   13614 #القارات

   13628 #البلدان

   13847 #العواصم

   52819 ##اﻷسماء	4

   52823 #تأليف	5

There are almost 30 lines with (balanced and unbalanced) comments. Perhaps see upstream  which comments can be solved (if they are temporarily disabling dictionary words) or which comments can be removed completely, such as #####. Other balanced comments are welcome.


9) After fixing 7), the first line, before any lines with Arabic words, should contain the total number of lines of the file.

Omitting lines starting with # and this first line may be done when calculating this number, but a few lines extra for this file of almost 500,000 lines is not a problem. A few lines too few will cost a little bit at initialization of the spell checker as the number in the first line is used to allocate minimally enough memory. What ever is lacking will be allocated dynamically later but costs some processing and memory power.
Comment 1 Xisco Faulí 2018-06-04 10:00:47 UTC
Hello Pander,
Thanks for the detailed description.

Would you mind creating a patch yourself and submitting it to gerrit? More info -> https://wiki.documentfoundation.org/Development/gerrit/SubmitPatch
Comment 2 Pander 2018-06-24 10:27:03 UTC
Hi Xisco,

Unfortunately, I do not have the time for that. This is already stretching the scope at the moment for me within the Nuspell project. (The development of a pure C++ replacement of Hunspell.)



10) for https://cgit.freedesktop.org/libreoffice/dictionaries/tree/pt_PT/pt_PT.dic the file should not start with whitespace.



11) The use of brackets in https://cgit.freedesktop.org/libreoffice/dictionaries/tree/ne_NP/ne_NP.dic for

श्रीसिया (नउ.टा.जा.)

and for 

(क्रियो)

will have no effect as words fed into the spell checker will be tokenized in such way that they are stripped of brackets and spaces. The author can remove these two words or change them into

श्रीसिया
नउ.टा.जा.

and

क्रियो

Note that this has to be checked with an export on Nepalese spelling.
Comment 3 Xisco Faulí 2018-12-27 09:28:17 UTC
@Sophi, do you know who is in charge of the arabic dictionary ?
Comment 4 Eyal Rozenberg 2018-12-27 10:56:37 UTC
1. I suggest people use the link to the raw file: https://cgit.freedesktop.org/libreoffice/dictionaries/plain/ar/ar.dic rather than to the annotated source view, which takes much longer to load. Plus, don't open it in your browser - it has > 450k lines :-)
2. I wonder if we don't have these marks somewhere in the Hebrew or Farsi dictionaries.
Comment 5 Pander 2019-03-31 19:51:39 UTC
Upstream is http://ayaspell.sourceforge.net/

Getting this fixed is important for further development and verification of the spell checker. Thanks if this can be moved forward.
Comment 6 Adolfo Jayme Barrientos 2019-05-01 06:56:41 UTC Comment hidden (spam)
Comment 7 Adolfo Jayme Barrientos 2019-05-01 06:57:46 UTC
*facepalm*, the correct commit is https://git.libreoffice.org/dictionaries/+/c5a06ed0bf1d10fcbc160e590d01ebc22271ba23

Anyway, please ensure this is solved upstream as well :)
Comment 8 Pander 2019-05-01 09:56:32 UTC
Super! Does this need double checking?

What about point 2. from Eyal? Can someone check that please? Perhaps that needs creation of another issue.