142437 – Applying autocorrect wordlist changes sub-strings for Marathi

Bug 142437 - Applying autocorrect wordlist changes sub-strings for Marathi

Summary: Applying autocorrect wordlist changes sub-strings for Marathi

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Writer (show other bugs)
Version: (earliest affected)	3.3.0 release
Hardware:	All All

Importance:	medium normal
Assignee:	Baole Fang

URL:
Whiteboard:	target:24.2.0 target:7.6.0.0.beta2
Keywords:

Depends on:
Blocks:	AutoCorrect-Complete
	Show dependency tree / graph

Reported:	2021-05-23 01:04 UTC by Shantanu
Modified:	2023-10-04 09:04 UTC (History)
CC List:	2 users (show)

See Also:	117651 128192 https://github.com/hunspell/hunspell/issues/927 157258
Crash report or crash signature:

Attachments
sub-string replaced while applying autocorrect list (22.05 KB, image/png) 2021-05-23 01:05 UTC, Shantanu	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Shantanu 2021-05-23 01:04:03 UTC

Description:
The option “Toools” – “AutoCorrect” – “Apply” does not return the correct results for some words while using Marathi language pack that was downloaded from...
https://extensions.libreoffice.org/en/extensions/show/marathi-spellchecker

पुनरावलोकन is changed to पुनरावलोकण which is wrong and the entry does not exist in auto correct list. However an entry ‘कन’ to ‘कण’ exist in the source. But that should apply to only 2 letter words and not to 7 letters long word.

The english words are formatted correctly. The bug applies only to Marathi (and may be some other languages). Let me explain with an example. The word “adn” is changed to “and” correctly using the same “tools – autocorrect – apply” option. But should the word “madn” (or any other random word that contains the term) be changed to “mand”? No. It should not. In case of Marathi it is changing the “sub-string” instead of looking for the “entire” word.

Steps to Reproduce:
1. Install Marathi dictionary
2. Type the word पुनरावलोकन in writer
3. From Tools select AutoCorrect and then Apply

Actual Results:
The word changes to पुनरावलोकण due to sub-string match which is wrong.

Expected Results:
The word should not be changed.

Reproducible: Always

User Profile Reset: No

Additional Info:
Tools – AutoCorrect – While Typing works as expected. But I am not able to apply the autocorrect list after typing because of the strange behavior mentioned above. I changed autocorrect options, but got the same results.

Comment 1 Shantanu 2021-05-23 01:05:57 UTC

Created attachment 172256 [details]
sub-string replaced while applying autocorrect list

Comment 2 Dieter 2021-06-08 06:30:25 UTC

Shantanu, thank you for reporting the bug. I'm not sure, if the problem is caused by LibreOffice. Have you also asked the developer of the extension?
=> NEEDINFO

Comment 3 Shantanu 2021-06-09 04:36:15 UTC

I can reproduce the bug without the extension. Type these 4 lines in Writer:

adn
madn
adnिadn 
adnतadn

When you apply auto-correct, only the first one should change. Right?

and
madn
adnिand 
adnतadn

The third line has changed but not the forth. And the first half of third line is unchanged. Interestingly, when I type the words, it works as expected. The bug can be reproduced only if I use tools - autocorrect - apply.

The Devnagari characters like "ि" should not be considered as space. These characters contain in almost all Hindi/ Marathi words.

Comment 4 Shantanu 2021-07-05 02:15:28 UTC

use the dispatcher instead of gotoEndOfWord method as suggested here...

https://stackoverflow.com/questions/67947672/can-you-print-the-wavy-lines-generated-by-spell-check-in-writer

Comment 5 Xisco Faulí 2021-11-08 16:11:36 UTC

Thank you for reporting the bug.
it seems you're using an old version of LibreOffice.
Could you please try to reproduce it with the latest version of LibreOffice from https://www.libreoffice.org/download/libreoffice-fresh/ ?
I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' if the bug is still present in the latest version.

Comment 6 Shantanu 2021-12-01 04:16:53 UTC

Reproduced using:

Version: 7.1.4.2 (x64) / LibreOffice Community
Build ID: a529a4fab45b75fefc5b6226684193eb000654f6
CPU threads: 1; OS: Windows 10.0 Build 17763; UI render: Skia/Raster; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded

Can you please post your output when you apply autocorrect to the list mentioned in my post (comment 3)?

Comment 7 Dieter 2023-02-06 18:26:40 UTC

Shantanu, it seems, that nobody could confirm your bug report. An new major release is now available. So could you please retest again with the latest version of LO (LO 7.5) and give feedback?

=> NEEDINFO

Comment 8 Baole Fang 2023-06-19 16:51:03 UTC

It is confirmed that the problem still exists:

adn
madn
adnिadn 
adnतadn

will be changed into:

and
madn
adnिand 
adnतadn

Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 5a86dd3a5008d13a5ca1f687e4602311f0a7be45
CPU threads: 12; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: threaded

Comment 9 Baole Fang 2023-06-19 17:01:02 UTC

It seems to me that the problem is related to "Use replacement table" option in Tools/AutoCorrect.

I guess there are two issues here:
1. "Use replacement table" doesn't work while typing, because if you type "adn", it is not changed into "and". (a new ticket maybe?)
2. "Use replacement table" doesn't behave correctly when it comes to some non-English characters.

Maybe someone can confirm what is expected in this ticket.

Comment 10 Shantanu 2023-06-20 03:18:22 UTC

Expected: The second half on the third line should not be changed to 'and'. It should remain as 'adn' because the substring 'adn' is part of a word and not a new word, unlike what is shown on the first line.

As a result of this bug, I am unable to use autocorrect for Marathi as it is changing certain parts of words in an unpredictable manner. This issue may also be present for other languages, but I am unable to verify. Apply - Autocorrect is an awesome feature that is too good to miss. Microsoft word does not support this.

However, English does not have any problems.

Comment 11 Shantanu 2023-06-20 03:27:30 UTC

>> Have you also asked the developer of the extension?
No. Because I am the developer! :)

Comment 12 QA Administrators 2023-06-21 03:14:06 UTC Comment hidden (obsolete)

[Automated Action] NeedInfo-To-Unconfirmed

Comment 13 Baole Fang 2023-06-21 20:53:11 UTC

The issue is caused by u_charType recognizes character "ि" as U_COMBINING_SPACING_MARK, where cclass_Unicode::getCharType returns BASE_FORM|PRINTABLE [1]. It is not considered as LetterNumeric by [2], so "ि" is considered as a word seperator by [3].


[1] https://cgit.freedesktop.org/libreoffice/core/tree/i18npool/source/characterclassification/cclass_unicode.cxx#:~:text=return%20BASE_FORM%7CPRINTABLE%3B
[2] https://cgit.freedesktop.org/libreoffice/core/tree/unotools/source/i18n/charclass.cxx#:~:text=bool%20CharClass%3A%3AisLetterNumeric(%20const%20OUString%26%20rStr%2C%20sal_Int32%20nPos%20)%20const
[3] https://cgit.freedesktop.org/libreoffice/core/tree/sw/source/core/edit/autofmt.cxx#:~:text=if%20(!(rAppCC.isLetterNumeric(*pText%2C%20sal_Int32(nPos))

Comment 14 Baole Fang 2023-06-21 20:59:44 UTC

I can work on this, but I need more information about how to make the changes.

Should I modify isLetterNumeric function [1] to take into account BASE_FORM [2]?
Or should I modify the function here [3] to take into account BASE_FORM [2]?

[1] https://cgit.freedesktop.org/libreoffice/core/tree/unotools/source/i18n/charclass.cxx#:~:text=bool%20CharClass%3A%3AisLetterNumeric(%20const%20OUString%26%20rStr%2C%20sal_Int32%20nPos%20)%20const
[2] https://cgit.freedesktop.org/libreoffice/core/tree/i18npool/source/characterclassification/cclass_unicode.cxx#:~:text=case%20U_COMBINING_SPACING_MARK%3A%0A%20%20%20%20%20%20%20%20return-,BASE_FORM,-%7CPRINTABLE%3B%0A%0A%20%20%20%20//%20Print%0A%20%20%20%20case
[3] https://cgit.freedesktop.org/libreoffice/core/tree/sw/source/core/edit/autofmt.cxx#:~:text=if%20(!(rAppCC.isLetterNumeric(*pText%2C%20sal_Int32(nPos))

Comment 15 Baole Fang 2023-06-21 21:04:37 UTC

Or maybe it is a something related to unicode? "ि" is part of the U_COMBINING_SPACING_MARK categoery [1]. I'm not familiar with that, but from its naming, it is not a letter or a character, but a mark or spacing.

[1] https://www.fileformat.info/info/unicode/category/Mc/list.htm

Comment 16 Shantanu 2023-06-22 03:41:18 UTC

The name "U_COMBINING_SPACING_MARK" is misleading. They are ligatures heavily used in Devanagari. (Hindi, Marathi etc)  For e.g.

 ऀ   ँ   ं   ः   ऺ   ऻ  ़   ा   ि   ी   ु   ू   ृ   ॄ   ॅ   ॆ   े   ै   ॉ   ॊ   ो   ौ   ्   ॎ   ॏ  ॑  ॒   ॕ   ॖ   ॗ   ॢ   ॣ

In other words all the characters in the Devanagari group that have a circle in them are incorrectly treated as spaces in Libreoffice. 

https://en.wikipedia.org/wiki/Devanagari_(Unicode_block)

Including ligatures in isLetterNumeric should solve this problem.

Comment 17 Baole Fang 2023-06-22 04:05:46 UTC

The easy way to solve this issue is to consider U_COMBINING_SPACING_MARK as characters. However, I'm not sure whether the rest apart from Devanagari should also be considered as characters. Any idea on this?

Comment 18 Shantanu 2023-06-22 05:18:21 UTC

I am sure U_COMBINING_SPACING_MARKs are not the real spacing marks (in any script).
By the way, I have removed certain auto correct entries those may trigger the bug. For e.g. I removed 'कन' > 'कण' and a few other.

Comment 19 Khaled Hosny 2023-06-22 09:29:34 UTC

(In reply to Baole Fang from comment #14)
> I can work on this, but I need more information about how to make the
> changes.
> 
> Should I modify isLetterNumeric function [1] to take into account BASE_FORM
> [2]?
> Or should I modify the function here [3] to take into account BASE_FORM [2]?

Modifying the code in autofmt.cxx is the safest bet. isLetterNumeric() is used in many other places and it is not clear if marks (spacing or non-spacing) can be safely considered letters in all these contexts, but for autofmt.cxx case we are sure they can’t be considered word separators.

(ideally that code should be using break iterators to detect word boundaries, but it seems to have too many special cases for this to be practical).

Comment 20 Shantanu 2023-06-23 08:37:34 UTC

Is it possible to create a branch with this patch and make it available for testing?

Comment 21 Baole Fang 2023-06-23 16:16:45 UTC

It is under review:
https://gerrit.libreoffice.org/c/core/+/153509

Comment 22 Commit Notification 2023-06-23 20:22:02 UTC

Baole Fang committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/caab94a3e0387bde05538cff91ff13446f330785

tdf#142437: Fix word boundary detection in autocorrect

It will be available in 24.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 23 Commit Notification 2023-06-23 23:44:58 UTC

Baole Fang committed a patch related to this issue.
It has been pushed to "libreoffice-7-6":

https://git.libreoffice.org/core/commit/a6d35a7940a2c72594b470aec341c867e6faf82c

tdf#142437: Fix word boundary detection in autocorrect

It will be available in 7.6.0.0.beta2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.