Description: The LightProof grammar checker is useful, but there are other grammar engines out there that are worth enabling inside LibreOffice. LightProof could be used as sample prototype code for new checkers, or it could be extended to support multiple. These are the good engines I’ve come across: 1. LinkGrammar https://www.abisource.com/projects/link-grammar/ Here is Abiword code that calls it: http://svn.abisource.com/abiword/trunk/plugins/grammar/linkgrammarwrap/LinkGrammarWrap.cpp 2. Spacy: https://spacy.io/ 3. Nltk: https://www.nltk.org/ 4. PyTorch – NLP: https://github.com/PetrochukM/PyTorch-NLP LinkGrammar is C++ but has Python wrappers. The others are Python based. There could be more worth considering. I’m not sure which one is the best or whether they have an easy CheckGrammar() API ;-) The Link Grammar could be an Easy Hack. Steps to Reproduce: Fire up writer Actual Results: No great grammar checkers Expected Results: Sophisticated suggestions Reproducible: Always User Profile Reset: Yes Additional Info: User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
For reference, Lightproof is here: https://cgit.freedesktop.org/libreoffice/lightproof It's a great baseline for prototyping at least. There is Language Tool. I found the UI was not HidPI aware yet and it caused LibreOffice to stutter. I think it could be complications related to the Java VM. Those things could get fixed one day, and for some people LanguageTool is good enough, and that's great and we should recommend it also. This bug report is about tracking for other options.
I received info today about Spacy with a link: https://spacy.io/usage/processing-pipelines#custom-components And some sample code for calling Hunspell as a sample: https://github.com/tokestermw/spacy_hunspell
@Keith, many thanks for the nice collection of the interesting projects. Recent Lightproof usage in LibreOffice is based on separated instances of English, Hungarian, Brazilian-Portuguese and Russian Lightproof modules, sometimes with language specific Python codes, so using Lightproof as a prototype is a natural thing, too, also for a new language or for an improved/alternative version of a language module. My next plan here is to create a simple API with easy accessible multilingual data, (for example in module extras/ of LibreOffice source tree) to give a minimal punctuation and typical unambiguous grammar mistake checker for every supported languages. (Or, separate Lightproof in a library, as recent libnumbertext integration: http://www.numbertext.org, https://bugs.documentfoundation.org/show_bug.cgi?id=117171). This could handle most of the worst/ugliest/avoided/taboo mistakes, avoiding unintentional false alarms or https://en.wikipedia.org/wiki/Linguistic_discrimination. Offering deep learning grammar checkers or extenstions/options is a nice idea. Also spell checking could be improved this way. My only fear is that there is no deep learning to learn professional proofreading, because of combination of incomplete/inaccurate training data and often incredible complex rules. [A funny story: I just consulted several orthography books and grammar teachers to fix the bad evaluation of the elementary level dictation of my son. The teacher and me (as the author of the Hungarian spelling dictionary) couldn't recognize the applicable case of orthography of special geographical proper names at once. In fact, the problem was here the uncommon text of the dictation, chosen by the untrained teacher.] For deep learning, we must select the working, and skip the not working automation, and unfortunately, this is not an easy task (see http://libreoffice.hu/grammar-checking-in-libreoffice/). But you are right, the recent English grammar checker has got very limited features. Maybe the fastest method to improve English grammar checking is to use optionally an online API, like http://www.afterthedeadline.com/api.slp or other freely available services.
Hi Laszlo, Glad those links are helpful. It could be useful to factor out some common punctuation, etc. rules that work across languages. I wouldn't move LightProof to a library until it gets to 50K lines :-) Increasingly people realize AI is useless without data, so we can hope more useful open datasets like Imagenet will become popular. Grammar is a complicated problem, but that is what deep learning works best at. The challenge today is that lots of people are building NL research toolkits, and aren't trying to package up NL processing and great suggestions into one component. An online service could be a cool hackathon project, but it's a privacy, performance, etc. challenge. We need to make the free desktop smart to stay relevant. Inference should be able to run on low-end CPUs. A modern processor can do 30 Imagenet inferences per second. I don't know what that translates into grammar checking, but it is probably more than enough. There are people exploring ways to compress models, like Squeezenet: https://arxiv.org/abs/1602.07360 The nice thing about the Link Grammar checker is that it seems to have an easy API already. Do you think that could be an easy hack given the AbiWord wrapper code and LightProof? It's not deep learning, but it should be a good way to explore the possibilities of integrating a good engine.
I found an interesting Python library for text processing that builds on top of NLTK. https://textblob.readthedocs.io/en/dev/