Description: The option “Toools” – “AutoCorrect” – “Apply” does not return the correct results for some words while using Marathi language pack that was downloaded from... https://extensions.libreoffice.org/en/extensions/show/marathi-spellchecker पुनरावलोकन is changed to पुनरावलोकण which is wrong and the entry does not exist in auto correct list. However an entry ‘कन’ to ‘कण’ exist in the source. But that should apply to only 2 letter words and not to 7 letters long word. The english words are formatted correctly. The bug applies only to Marathi (and may be some other languages). Let me explain with an example. The word “adn” is changed to “and” correctly using the same “tools – autocorrect – apply” option. But should the word “madn” (or any other random word that contains the term) be changed to “mand”? No. It should not. In case of Marathi it is changing the “sub-string” instead of looking for the “entire” word. Steps to Reproduce: 1. Install Marathi dictionary 2. Type the word पुनरावलोकन in writer 3. From Tools select AutoCorrect and then Apply Actual Results: The word changes to पुनरावलोकण due to sub-string match which is wrong. Expected Results: The word should not be changed. Reproducible: Always User Profile Reset: No Additional Info: Tools – AutoCorrect – While Typing works as expected. But I am not able to apply the autocorrect list after typing because of the strange behavior mentioned above. I changed autocorrect options, but got the same results.
Created attachment 172256 [details] sub-string replaced while applying autocorrect list
Shantanu, thank you for reporting the bug. I'm not sure, if the problem is caused by LibreOffice. Have you also asked the developer of the extension? => NEEDINFO
I can reproduce the bug without the extension. Type these 4 lines in Writer: adn madn adnिadn adnतadn When you apply auto-correct, only the first one should change. Right? and madn adnिand adnतadn The third line has changed but not the forth. And the first half of third line is unchanged. Interestingly, when I type the words, it works as expected. The bug can be reproduced only if I use tools - autocorrect - apply. The Devnagari characters like "ि" should not be considered as space. These characters contain in almost all Hindi/ Marathi words.
use the dispatcher instead of gotoEndOfWord method as suggested here... https://stackoverflow.com/questions/67947672/can-you-print-the-wavy-lines-generated-by-spell-check-in-writer
Thank you for reporting the bug. it seems you're using an old version of LibreOffice. Could you please try to reproduce it with the latest version of LibreOffice from https://www.libreoffice.org/download/libreoffice-fresh/ ? I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' if the bug is still present in the latest version.
Reproduced using: Version: 7.1.4.2 (x64) / LibreOffice Community Build ID: a529a4fab45b75fefc5b6226684193eb000654f6 CPU threads: 1; OS: Windows 10.0 Build 17763; UI render: Skia/Raster; VCL: win Locale: en-US (en_US); UI: en-US Calc: threaded Can you please post your output when you apply autocorrect to the list mentioned in my post (comment 3)?
Shantanu, it seems, that nobody could confirm your bug report. An new major release is now available. So could you please retest again with the latest version of LO (LO 7.5) and give feedback? => NEEDINFO
It is confirmed that the problem still exists: adn madn adnिadn adnतadn will be changed into: and madn adnिand adnतadn Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: 5a86dd3a5008d13a5ca1f687e4602311f0a7be45 CPU threads: 12; OS: Linux 5.15; UI render: default; VCL: gtk3 Locale: en-US (en_US.UTF-8); UI: en-US Calc: threaded
It seems to me that the problem is related to "Use replacement table" option in Tools/AutoCorrect. I guess there are two issues here: 1. "Use replacement table" doesn't work while typing, because if you type "adn", it is not changed into "and". (a new ticket maybe?) 2. "Use replacement table" doesn't behave correctly when it comes to some non-English characters. Maybe someone can confirm what is expected in this ticket.
Expected: The second half on the third line should not be changed to 'and'. It should remain as 'adn' because the substring 'adn' is part of a word and not a new word, unlike what is shown on the first line. As a result of this bug, I am unable to use autocorrect for Marathi as it is changing certain parts of words in an unpredictable manner. This issue may also be present for other languages, but I am unable to verify. Apply - Autocorrect is an awesome feature that is too good to miss. Microsoft word does not support this. However, English does not have any problems.
>> Have you also asked the developer of the extension? No. Because I am the developer! :)
[Automated Action] NeedInfo-To-Unconfirmed
The issue is caused by u_charType recognizes character "ि" as U_COMBINING_SPACING_MARK, where cclass_Unicode::getCharType returns BASE_FORM|PRINTABLE [1]. It is not considered as LetterNumeric by [2], so "ि" is considered as a word seperator by [3]. [1] https://cgit.freedesktop.org/libreoffice/core/tree/i18npool/source/characterclassification/cclass_unicode.cxx#:~:text=return%20BASE_FORM%7CPRINTABLE%3B [2] https://cgit.freedesktop.org/libreoffice/core/tree/unotools/source/i18n/charclass.cxx#:~:text=bool%20CharClass%3A%3AisLetterNumeric(%20const%20OUString%26%20rStr%2C%20sal_Int32%20nPos%20)%20const [3] https://cgit.freedesktop.org/libreoffice/core/tree/sw/source/core/edit/autofmt.cxx#:~:text=if%20(!(rAppCC.isLetterNumeric(*pText%2C%20sal_Int32(nPos))
I can work on this, but I need more information about how to make the changes. Should I modify isLetterNumeric function [1] to take into account BASE_FORM [2]? Or should I modify the function here [3] to take into account BASE_FORM [2]? [1] https://cgit.freedesktop.org/libreoffice/core/tree/unotools/source/i18n/charclass.cxx#:~:text=bool%20CharClass%3A%3AisLetterNumeric(%20const%20OUString%26%20rStr%2C%20sal_Int32%20nPos%20)%20const [2] https://cgit.freedesktop.org/libreoffice/core/tree/i18npool/source/characterclassification/cclass_unicode.cxx#:~:text=case%20U_COMBINING_SPACING_MARK%3A%0A%20%20%20%20%20%20%20%20return-,BASE_FORM,-%7CPRINTABLE%3B%0A%0A%20%20%20%20//%20Print%0A%20%20%20%20case [3] https://cgit.freedesktop.org/libreoffice/core/tree/sw/source/core/edit/autofmt.cxx#:~:text=if%20(!(rAppCC.isLetterNumeric(*pText%2C%20sal_Int32(nPos))
Or maybe it is a something related to unicode? "ि" is part of the U_COMBINING_SPACING_MARK categoery [1]. I'm not familiar with that, but from its naming, it is not a letter or a character, but a mark or spacing. [1] https://www.fileformat.info/info/unicode/category/Mc/list.htm
The name "U_COMBINING_SPACING_MARK" is misleading. They are ligatures heavily used in Devanagari. (Hindi, Marathi etc) For e.g. ऀ ँ ं ः ऺ ऻ ़ ा ि ी ु ू ृ ॄ ॅ ॆ े ै ॉ ॊ ो ौ ् ॎ ॏ ॑ ॒ ॕ ॖ ॗ ॢ ॣ In other words all the characters in the Devanagari group that have a circle in them are incorrectly treated as spaces in Libreoffice. https://en.wikipedia.org/wiki/Devanagari_(Unicode_block) Including ligatures in isLetterNumeric should solve this problem.
The easy way to solve this issue is to consider U_COMBINING_SPACING_MARK as characters. However, I'm not sure whether the rest apart from Devanagari should also be considered as characters. Any idea on this?
I am sure U_COMBINING_SPACING_MARKs are not the real spacing marks (in any script). By the way, I have removed certain auto correct entries those may trigger the bug. For e.g. I removed 'कन' > 'कण' and a few other.
(In reply to Baole Fang from comment #14) > I can work on this, but I need more information about how to make the > changes. > > Should I modify isLetterNumeric function [1] to take into account BASE_FORM > [2]? > Or should I modify the function here [3] to take into account BASE_FORM [2]? Modifying the code in autofmt.cxx is the safest bet. isLetterNumeric() is used in many other places and it is not clear if marks (spacing or non-spacing) can be safely considered letters in all these contexts, but for autofmt.cxx case we are sure they can’t be considered word separators. (ideally that code should be using break iterators to detect word boundaries, but it seems to have too many special cases for this to be practical).
Is it possible to create a branch with this patch and make it available for testing?
It is under review: https://gerrit.libreoffice.org/c/core/+/153509
Baole Fang committed a patch related to this issue. It has been pushed to "master": https://git.libreoffice.org/core/commit/caab94a3e0387bde05538cff91ff13446f330785 tdf#142437: Fix word boundary detection in autocorrect It will be available in 24.2.0. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Baole Fang committed a patch related to this issue. It has been pushed to "libreoffice-7-6": https://git.libreoffice.org/core/commit/a6d35a7940a2c72594b470aec341c867e6faf82c tdf#142437: Fix word boundary detection in autocorrect It will be available in 7.6.0.0.beta2. The patch should be included in the daily builds available at https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: https://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.