44580 – share autocorrect replacement table for misc. language subgroups

Bug 44580 - share autocorrect replacement table for misc. language subgroups

Summary: share autocorrect replacement table for misc. language subgroups

Status:	NEW

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Linguistic (show other bugs)
Version: (earliest affected)	Inherited From OOo
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	difficultyMedium, easyHack, skillCpp, topicCleanup

Depends on:
Blocks:	AutoCorrect-Complete
	Show dependency tree / graph

Reported:	2012-01-08 08:43 UTC by tommy27
Modified:	2020-03-12 15:40 UTC (History)
CC List:	9 users (show)

See Also:	48729 79276
Crash report or crash signature:

Attachments
autocorrect testkit (1.81 KB, application/zip) 2014-08-20 06:57 UTC, tommy27	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description tommy27 2012-01-08 08:43:29 UTC

this bug is the "LibO twin" of OOo issue 101224 that I opened in april 09:
https://issues.apache.org/ooo/show_bug.cgi?id=101224

I'm replicating here to propose it as an easy-hack:
http://wiki.documentfoundation.org/Development/Easy_Hacks#Progress

DESCRIPTION

Lib autocorrect replacement tables are stored as .dat files in this path under Windows: "....User\LibreOffice 3\user\autocorr" (Windows)

there is one “universal replacement table” called "acor_.dat" whose entries are
applied in any language you are writing in.

there are also separate replacement tables for all language variants:
- UK English --> acor_en-GB.dat
- USA English --> acor_en-US.dat

the same applies to all spanish, german and even italian subvariants
(i.e. Italian --> acor_it-IT.dat; Swiss Italian --> acor_it-CH.dat)

however those .dat files are not mutual... this separate subtype policy must be kept because of the minority of words that have different spelling among language variants

For example i could set a:
- “colour -> color” entry in the acor-en_US.dat file and a
- “color -> colour” entry in the acor-en_GB.dat file

there's however the vast majority of words that have exactly the same
spelling... let's take an example: “yellow” which is the same in England, USA,
South Africa, Australia, Canada etc. etc.

if you come with a typing error like “yrllow” you should set an autocorrect
entry in each of the localized english .dat files... it would be too time
consuming...

It would be much user friendly and time saving to have a “non localized” "acor-
en.dat" file whose entries are shared by all english subtypes.

it would be great to have something similar to the the “universal replacement table” acor_.dat but restricted to certain language groups.
something like:

- acor_en.dat working on both UK, US, AUS etc. ect. english variants
- acor_it-ALL.dat working both on italian and swiss language

TECHNICAL INFO

LibO developer John Holesovsky AKA Kendy gave me some interesting hints how to fix the problem on the developer mailing list. here's what he said:

the code you want to play with is editeng/source/misc/svxacorr.cxx .

http://docs.libreoffice.org/editeng/html/svxacorr_8cxx_source.html

You probably want to tweak SvxAutoCorrect::SearchWordsInList() so that
it fallbacks to 'en' in case the word is not found in 'en_US', or
something like that; but you will have to tweak some code around that
probably too, in order to load the shared acor_XY.dat in addition to
acor_XY_AB.dat, etc.

I don't think it is hard; but some constructs used in that piece of code
are not too obvious, my favorite is this condition:

else if( ( FStatHelper::IsDocument( sUserDirFile ) ||
FStatHelper::IsDocument( sShareDirFile =
GetAutoCorrFileName( eLang, sal_False, sal_False ) ) ) ||
( sShareDirFile = sUserDirFile, bNewFile ))

Comment 1 Michael Meeks 2012-01-09 04:31:53 UTC

easy-hack-ising :-) I think there is enough here to go on - I'm happy to mentor.

Comment 2 tommy27 2012-03-11 23:34:11 UTC

whoever tries hacking this should not probably alter the current behaviour of adding autocorrect entries by right click menu which adds entries to the current language of the document

I mean, if you write a document in America English, and you accept a "right click" autocorrect suggestion, this should go (like it does right now) in the acor_en-US.dat file.

I think about the common acor_en.dat for all english language variants, as a "replacement table accesuble only" database, just as the acor_.dat file (common autocorrect database for all language).

Comment 3 jam 2012-04-30 20:11:57 UTC

I've taken a while looking at this and don't feel confident enough in what I know about the codebase to feel like I can commit a fix.

Comment 4 tommy27 2012-05-01 00:55:08 UTC

@jam@jamandbees.net

sorry to hear that, but at least you tried so I appreciate your efforts anyway.

Comment 5 Florian Reisinger 2012-05-18 09:13:29 UTC

Deteted "Easyhack" from summary

Comment 6 Björn Michaelsen 2013-10-04 18:46:13 UTC

adding LibreOffice developer list as CC to unresolved EasyHacks for better visibility.

see e.g. http://nabble.documentfoundation.org/minutes-of-ESC-call-td4076214.html for details

Comment 7 Robinson Tryon (qubit) 2013-10-19 00:25:22 UTC Comment hidden (obsolete)

Removing comma from whiteboard (please use a space to delimit values in this field)
https://wiki.documentfoundation.org/QA/Bugzilla/Fields/Whiteboard#Getting_Started

Comment 8 Julien Nabet 2014-08-17 15:52:53 UTC

Tommy27: I tried to use a generic solution for fdo#79276. Would you have some time to give it a try? (need 4.2.6/4.3.1)
For example, I don't know if it's ok with a old profile.

Comment 9 tommy27 2014-08-18 13:14:23 UTC

WOW!!! Well done Julien, your fix for Bug 79276 could represent a solution for the current bug as well.

If I copy one of those autocorrect .dat file and manually remove the final sublocalization tag (i.e. acor_it-IT.dat → acor_it.dat) and I place it in the autocorr subfolder of the user profile it will work as an unlocalized autocorrect version for that language.

That means that autocorrect entries in the acor_it.dat file will be applied either in documents written in Italian (Italy) or in Italian (Switzerland).

The same will apply to an acor_en.dat file which could be an universal autocorrect replacement for all english variants as well.

The only thing which is missing is that those unlocalized acor.dat files actually are not shown in the UI of the Tools/autocorrect options/Replace, so you have no way to edit od add or remove those entries.

If you find a way to make those unlocalized acor.dat files editable in the UI the fix will be complete.

We also have to decide how those unlocalized autocorrect lists should look in the language list....

I mean, we have Italian (Italy) for acor_it-IT.dat and Italian (Switzerland) for  acor_it-CH.dat, what we should visualize for acor_it.dat?  

Maybe we should keep it simple and display it only as Italian rather than Italian (unlocalized) or Italian (common) or Italian (General) etc.etc.

Comment 10 Julien Nabet 2014-08-18 18:19:20 UTC

Tommy27: I imagined generic language files more like a base for standard or use dictionaries not as a generic dictionary per se. However, I'm not i18n expert at all and let Andras speak.

For example, I put a selection in en-US and another in fr-FR then I added 1 word for each.
I found the result in wordbook/standard.dic (from a brand new profile with master sources updated some days ago):
OOoUserDict1
lang: <none>
type: positive
---
stiro
stari

Is it ok or not, I don't know (I hadn't made this test before).

Andras: put you in cc of this one because I'm not sure what we should do now.

Comment 11 tommy27 2014-08-18 19:38:46 UTC

Sorry but I do Not understand what You are talking about   these are list for automatic correction of typing errors not dictionaries    See The " yellow "example in the original description

Comment 12 Julien Nabet 2014-08-18 20:13:28 UTC

Oups forget what I told, of course you're right :-)

Comment 13 Julien Nabet 2014-08-18 21:14:58 UTC

(sorry again for my previous comment, I was focus on dictionaries)

A second issue about editing generic unlocalized autocorrect list is what to do with localized ones (if they've been generated) once the unlocalized autocorrect list is changed? Should we try to spread the change in localized autocorrect lists? If yes, what to do if there's a conflict?

Comment 14 tommy27 2014-08-18 21:27:49 UTC

Unlocalized file should have its own list and should not mix with localized file
the reason was explained here

(In reply to comment #0)
>...
> 
> there are also separate replacement tables for all language variants:

> however those .dat files are not mutual... this separate subtype policy must
> be kept because of the minority of words that have different spelling among
> language variants
>  
> For example i could set a:
> - “colour -> color”  entry in the acor-en_US.dat file and a
> - “color -> colour” entry in the acor-en_GB.dat file
>  
> there's however the vast majority of words that have exactly the same
> spelling... let's take an example: “yellow” which is the same in England,
> USA,
> South Africa, Australia, Canada etc. etc.
>  
> if you come with a typing error like “yrllow” you should set an autocorrect
> entry in each of the localized english .dat files... it would be too time
> consuming...
>  
> It would be much user friendly and time saving to have a “non localized”
> "acor-en.dat" file whose entries are shared by all english subtypes.
> 
> it would be great to have something similar to the the “universal
> replacement table” acor_.dat but restricted to certain language groups.
> something like: 
> 
> - acor_en.dat working on both UK, US, AUS etc. ect. english variants 
> - acor_it.dat   working both on italian and swiss language

Comment 15 Julien Nabet 2014-08-18 21:41:59 UTC

Tommy27: 
Just to be sure to understand, it would mean:
- a first file for initial unlocalized file
- a second file for unlocalized autocorrect if you edit the unlocalized list
- a third file for your localized autocorrect if you edit localized list
=> So autocorrect process should search in second and third file first (in which order? A user could have made a mistake and put a same word to replace but a different replacement) and if there's none of these files, should search in first file only
Is it correct?

Comment 16 tommy27 2014-08-19 13:44:50 UTC

first of all we have to define the exact position of those autocorrect files.

default replacements are under ...\LibreOffice 4\share\autocorr

these are use for first time use of the autocorrect engine and are copied into the user profile which should be under ...LibreOffice 4\user\autocorr

further edits (addition of new entries, removal or modification of existing one) will affect the files in the "user" profile, not those under "share"

so in a french scenario, since you have an unlocalized version under "share" which is acor_fr.dat, when you use it for the first time in a french(france) document it will be copied under "user" as acor_fr-FR.dat and will apply just to french(france) documents and not to other variants like french(canada).

if you wanna an unlocalized version of the french autocorrect list, you have to manually copy the acor_fr.dat from "share" and place it under "user"

this will work and apply replacements either in french(france) or in french(canada) documents.

the problem is that actually you don't see the unlocalized french list in the autocorrect options dropdown menus, so further edits are not possible.

the code should be tweaked to display unlocalized language list as well.

actually you see:

French (France)      --> acor_fr-FR.dat
French (Canada)      --> acor_fr-CA.dat
etc. etc.

while you should be able to see:

French               --> acor_fr.dat
French (France)      --> acor_fr-FR.dat
French (Canada)      --> acor_fr-CA.dat
etc. etc.

Comment 17 tommy27 2014-08-19 13:46:54 UTC

(In reply to comment #15)
> Tommy27: 
> Just to be sure to understand, it would mean:
> - a first file for initial unlocalized file
> - a second file for unlocalized autocorrect if you edit the unlocalized list
> - a third file for your localized autocorrect if you edit localized list
> => So autocorrect process should search in second and third file first (in
> which order? A user could have made a mistake and put a same word to replace
> but a different replacement) and if there's none of these files, should
> search in first file only
> Is it correct?

I made a test to see how the code behaves in front of conflicts.

let's say you have:
color → colour in acor_en-GB.dat
colour → color in acor_en-US.dat

each one will apply only respectively in british english and american english documents with no conflicts.

If you instead have a:
color → colour in acor_en.dat 
it will apply to american english documents as well

so it means that currently the autocorrect engine looks first in the unlocalized version (acor_en.dat) rather than the localized version (acor_en-US.dat) which doesn't look good to me.

In my opinion when you have conflicts, the autocorrect engine should look first in the autocorrect list which is specific for the document language, in this case (acor_en-US.dat), and only in a second time in the unlocalized version (acor_en.dat) if there's no replacement in the previous file.

Comment 18 Julien Nabet 2014-08-19 16:32:05 UTC

(In reply to comment #17)
> (In reply to comment #15)
> > Tommy27: 
> > Just to be sure to understand, it would mean:
> > - a first file for initial unlocalized file
> > - a second file for unlocalized autocorrect if you edit the unlocalized list
> > - a third file for your localized autocorrect if you edit localized list
> > => So autocorrect process should search in second and third file first (in
> > which order? A user could have made a mistake and put a same word to replace
> > but a different replacement) and if there's none of these files, should
> > search in first file only
> > Is it correct?
> 
> I made a test to see how the code behaves in front of conflicts.
> 
> let's say you have:
> color → colour in acor_en-GB.dat
> colour → color in acor_en-US.dat
> 
> each one will apply only respectively in british english and american
> english documents with no conflicts.
> 
> If you instead have a:
> color → colour in acor_en.dat 
> it will apply to american english documents as well
> 
> so it means that currently the autocorrect engine looks first in the
> unlocalized version (acor_en.dat) rather than the localized version
> (acor_en-US.dat) which doesn't look good to me.
> 
> In my opinion when you have conflicts, the autocorrect engine should look
> first in the autocorrect list which is specific for the document language,
> in this case (acor_en-US.dat), and only in a second time in the unlocalized
> version (acor_en.dat) if there's no replacement in the previous file.
With fresh build of master sources + French UI by default here are my tests.

Open autocorrect French France, change "afirmer => affirmer" to "afirmer => afffirmer".
I get "afffirmer" (3 f) when I type "afirmer". I close LO and reopen and the change is still the localized one.

I'm quite lost here :-(

Comment 19 tommy27 2014-08-20 06:57:31 UTC

Created attachment 104939 [details]
autocorrect testkit

Hi Julien, probably my test in comment 17 was not 100% accurate.

try replicating this new experiment.

a- download the attached .zip file which contains 3 minimal autocorrect .dat file

1- acor_und.dat
it has a single entry: test → test1
it will apply in any document regardless the language since the acor_und.dat file is the global autocorrect list (you can find it at the top of the language dropdown list in the autocorrect replacement table under [All] (don't know how's localized in french)

2- acor_en-GB.dat
it has a single entry: test → test2
it will apply only in document where the language is English(UK)

3- acor_en.dat
it has a single entry: test → test3
it should apply in any document written in any of the English variants (UK, US, Australia etc.)

b- place these 3 dat files in the autocorr subfolder of the user profile

c- load a blank new Writer document and set the language as English(UK)

d- type test and see how it gets autocorrected

e- compare with my results with LibO 4.2.6.2 under Win7x64:

- all 3 dat files present → test corrected into test2, so the localized variant acor_en-GB.dat rules over the acor_und.dat and  acor_en.dat

- remove the und.dat file → again you get test2, so the  en-GB.dat wins over the en.dat

- remove only the  en-GB.dat file → you get test3 so the en.dat wins over the und.dat

- remove both en-GB.dat and und.dat → again you get test3 since only en.dat is left

- remove both  en-GB.dat and en.dat → you get test1 since only und.dat is left

so basically in case of autocorrect conflicts the “en-GB” list wins over the “en” list and over the “und” list.

I agree this is the correct behavior since the language of the document should tell which is the first autocorrect list to look inside.

f- same results if I rename those file to match italian locale (i.e. acor_it.dat and acor_it-IT.dat) and I write an italian(Italy) document.

g- different results if you do the same trick with german or french locales. In those cases the renamed acor_de.dat and acor_fr.dat files will have no effect even if you remove the und.dat and the de-DE.dat and the fr-FR.dat files. So it seems that the unlocalized variant .dat files doesn't work in some language subgroups... this is strange and unconsistent with the results with Italian and English where it worked with no issue.

Do you have any thoughts about that?

Comment 20 tommy27 2014-08-20 06:59:24 UTC

P.S. when you remove some .dat files as described in the tests above always remember to close LibO first and then restart the program again.

Comment 21 Julien Nabet 2014-08-20 07:50:27 UTC

First, I don't think you should manually copy files in user for the tests

Then, I don't have LO at work so can't check but there can't be any unlocalized files in user\autocorrect.
With the brand new profile all autocorrect files (unlocalized and localized) are in share\autocorrect, adding an entry creates a localized autocorrect file in user\autocorrect. This last one is used in priority.


Again, perhaps I miss something or am wrong since I'm not at home to check.

Comment 22 tommy27 2014-08-20 08:11:20 UTC

(In reply to comment #21)
> First, I don't think you should manually copy files in user for the tests

trust me, it's harmelss. I've done it multiple times.

> Then, I don't have LO at worsk so can't check but there can't be any
> unlocalized files in user\autocorrect.
> With the brand new profile all autocorrect files (unlocalized and localized)
> are in share\autocorrect, adding an entry creates a localized autocorrect
> file in user\autocorrect. This last one is used in priority.

it can't be there since you cannot create a brand new one. the workaround is to enter an entry in a language you don't use (let's say iceland) and then rename the dat file which is created inside the user profile. this is what I did to create the acor_en.dat file and it worked.

what we miss at the moment is the ability to directly create an unlocalized variant of an autocorrect list since in the language dropdown menu of the replacement table you can only select localized versions of languages like English(UK), English(US), English(Australia) etc. etc. and you don't have the chance to select a plain English language item with no indicated variant.

What I think that we should tweak that UI and allow support for variantless languages.

> Again, perhaps I miss something or am wrong since I'm not at home to check.

why don't you downlaod the portable LibreOffice version from WinPenPack?
link is here: http://sourceforge.net/projects/winpenpack/files/X-LibreOffice/releases/

then you can put in a USB key and bring it with you everywhere.

Comment 23 Julien Nabet 2014-08-20 08:33:40 UTC

There are 2 points to distinguish:
1) In the basic process, there can't be any unlocalized file in user/autocorrect. It seems you may have weird result only if you bypass the process by copying an unlocalized file in it.

2) I understand you'd like to edit unlocalized file and I didn't try to implement it. Being able to edit it would mean indeed mean there could be unlocalized files in user/autocorrect (without manually copying). I don't think I'd be able to do it since it means:
- to be able to list unlocalized languages as you said
- prevent conflicts you described when there are localized and unlocalized files (again as you said)
I'm sorry to tell I can't help more on this last point :-(

Andras/Michael: any thoughts?

Comment 24 tommy27 2014-08-20 12:46:35 UTC

(In reply to comment #23)
> There are 2 points to distinguish:
> 1) In the basic process, there can't be any unlocalized file in
> user/autocorrect. It seems you may have weird result only if you bypass the
> process by copying an unlocalized file in it.

yes, your fix for Bug 79276 had this side effect and allowed me to manually "hack" the user profile and use and unlocalized autocorrect file inside it.

the weird thing is that this work with some languages (Italian, English) and not with others (French, German). I don't understand why...

> 2) I understand you'd like to edit unlocalized file and I didn't try to
> implement it. Being able to edit it would mean indeed mean there could be
> unlocalized files in user/autocorrect (without manually copying). I don't
> think I'd be able to do it since it means:
> - to be able to list unlocalized languages as you said,

exactly, that would be exactly what I wanted to be implemented when I opened this Bug 44580

> - prevent conflicts you described when there are localized and unlocalized
> files (again as you said)

I think it's up to the user to avoid autocorrect conflicts.
You can already have conflicts using the acor_UND.dat file but it's your fault if you set discordant autocorrect replacements among different lists.

> I'm sorry to tell I can't help more on this last point :-(

you already did a lot.

Comment 25 Alex Thurgood 2015-01-03 17:39:02 UTC Comment hidden (no-value)

Adding self to CC if not already on

Comment 26 Robinson Tryon (qubit) 2015-12-10 11:40:55 UTC Comment hidden (obsolete)

Migrating Whiteboard tags to Keywords: (easyHack, difficultyBeginner, skillCpp, topicCleanup)

Comment 27 Robinson Tryon (qubit) 2016-02-18 14:52:07 UTC Comment hidden (obsolete)

JanI is default CC for Easy Hacks (Add Jan; remove LibreOffice Dev List from CC)
[NinjaEdit]