------------------ INTRODUCTION ------------------ This bug report is the twin of OOo issue 87672 ( https://issues.apache.org/ooo/show_bug.cgi?id=87672 ) which I reported in March 2008 in OOo 2.4 (gosh!!! 5 years!!! times runs fast!!!). for this reason I suggest not to change the bug title which is the same. ------------------ SUMMARY ------------------ autocorrect databases can store only 65534 per language if you reach 65535 limit the file crashes and the database is erased. so there are 2 problems 1- autocorrect entries number is limited 2- loss of data when that limit is reached ------------------ BACKGROUND ------------------ Some informations to get into it: 1- OOo/LibO autocorrect function is based on database of couple of entries (bad spelling --> correct spelling) collected in an .xml file called DocumentList.xml which is stored inside an acor_.dat file (basically it's a .zip with .dat extension) under this Windows path ...User\LibreOffice 3\user\autocorr 2- there's one global .dat file (acor_.dat) which applies autocorrect items in any languge and a single .dat file for each single language and variant (i.e. acor_en-GB.dat for british english, acor_en-US.dat foramerican english, acor_it-IT.dat for italian etc. etc.) which applies autocorrect items only if you are writing in that particular language 3- the DocumentList.xml file containing all the entries is based on a 16-bit structure so it has an upper limit of 2^16 = 65 536 4- for some reasons if you get close to that limit, with entry number 65 535, the .xml file “implodes” and all the database is erased with loss of data. ------------------ POSSIBLE SOLUTIONS ------------------ 1- rise the upper limit from 2^16 to higher value... if I remember correctly once Calc had the same 2^16 limit for cell numbers which was upgraded to 2^20 = 1 048 576. Could something similar be done for autocorrect as well? 2- allow use of multiple .dat files for the same language (i.e. acor_en-GB1.dat, acor_en-GB2.dat, acor_en-GB3.dat etc. etc.) each one with the 65K limit. Once one is full, you can move to another just as it already happens with custom dictionaries... you can have more than one of those indeed. ------------------ SOURCE CODE ------------------ the autocorrect function is coded into the svxcorr.cxx file http://docs.libreoffice.org/editeng/html/svxacorr_8cxx_source.html ------------------ TEST FILES ------------------ the .zip contains the acor_it-IT.dat and documentlist.xml files that crash in OOo/LibO once you add another autocorrect entry https://issues.apache.org/ooo/attachment.cgi?id=52416
setting this as NEW since this issue was already confirmed in the OOo era and it's still reproducible in LibO (tested either on 3.5.2 and master~2012-04-26_21.37.10_LibO-Dev_3.6.0alpha0_Win_x86_install_en-US)
Hopefully reasonably easy to fix; needs some testing though; the code to poke at is in: cui/source/tabpages/autocdlg.cxx - for the editing UI + quite possibly there are problems with the list widget here wrt. longer lists. editeng/inc/editeng/svxacorr.hxx - is a good place to start reading for the underlying structures Of course bumping this list up will potentially create some nasty performance issues around the list code & so on, but - we'll have to deal with those too I imagine.
I'm not a developer, and I have no C++ knowledge, so I could have misunderstood some things, however as far as I see the issue is due to the fact that the documentlist.xml containg all entries is a 16-bit structure, hence the limit is 65535. if it could be adapted to 32-bit the limit would be much higher... looking at LibO 4 development page ( http://wiki.documentfoundation.org/Development/LibreOffice4 ) I see this: >>> consider moving all 16-Bit integer types to 32-Bit at least >>> this cant be done in bulk as the are still a lot of SAL_MAX_INT16-2 style >>> magic numbers in the applications, so needs a very careful evaluation >>> -Bjoern-michaelsen 2012-07-29T17:09:34 (CEST) and looking inside svxacorr.cxx I see a lot of "sal_uInt16 nLang" does this make any sense to you?
Since I'm no developer but I'm very interest in a fix for this bug, I decided to donate money to support the devs. I have just sent a 50 euros wire bank transfer to the TDF bank account (IBAN: DE12666900000003497390, BIC: VBPFDE66) check your account and you will see a 50 euros transfer with this infos: FUNDING FOR LIBO BUG 48729 - HTTPS://BUGS.FREEDESKTOP.ORG/SHOW_BUG.CGI?ID=48729 I will also donate another 50 euros to the personal bank account of the hacker who will finally fix this. The bug has 2 main part: 1- autocorrect entries is limit to 16^2 --> 65535 2- once the limit is reached the database gets erased the most important thing for me is actually to raise the number of entries (from 16^2 to 20^2 or 32^2) since I need to add more autocorrect terms in my LibreOffice. just fixing the database erase without raising that number of available entries would not help me, since my priority is to expand the autocorrect database capacity thank you, TOMMY
Hi Tommy; thanks for your donation. TDF does not itself do development work for money, although many individuals and companies in our ecosystem do work that way. We appreciate your support & interest, and hopefully someone will pick up this easy hack at some stage ! Thanks you for your generosity.
Hi, I like the fact that you made a donation and offered money to get this resolved instead of bitching about it. This problem doesn't sound like it is hard to solve and it sound interesting so I will take a look at it today. I don't need the money tough. Regards, Tomaž
@Tomaz thanks for taking care of it. as I said in the first comment, froma a non-developer point of view I actually see 2 possible ways to fix it. 1- rise the upper limit from 2^16 to higher value. if I remember correctly once Calc had the same 2^16 limit for cell numbers which was upgraded to 2^20 = 1 048 576. 2- allow use of multiple .dat files for the same language (i.e. acor_en-GB1.dat, acor_en-GB2.dat, acor_en-GB3.dat etc. etc.) each one with the 65K limit. Once one is full, you can move to another just as it already happens with custom dictionaries... you can have more than one of those indeed. so, I expect that from a developer point of view you could find out which is the easiest way to hack it up, and which one will have less impact on performance
Tomaž Vajngerl committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=78a39502de36c32d7aaf6e5bbde1f8df80fdd21f fdo#48729 Change int16->int32 in auto replace dialog. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Hi, I think I found the problem. Everything was already using dynamic lists/sets and all the nice things but the code in the dialog still used sal_uInt16 variables for position and iteration. I have commited a fix into the master so it will be available in the next daily build. Tommy please test if it is working now. Another problem is that it is slow as hell to add/edit entries in the autocorrect table - this could still be optimized as I don't see a reason it is so slow when there are only 65k entries. Best regards, Tomaž
(In reply to comment #9) > Hi, > > I think I found the problem. > ... snip ... >I have commited a fix into the master so it will be available > in the next daily build. Tommy please test if it is working now. WOW!!! how fast have you been!!! so if I get it right, being a Windows user, I shall download a new master build in the next couple of days from here: http://dev-builds.libreoffice.org/daily/Win-x86@6/ I'm eager to test the new 32-bit enabled autocorrect function and report if it works with larger entries database and if such higher capacity has any effect on performance. > Another problem is that it is slow as hell to add/edit entries in the > autocorrect table - this could still be optimized as I don't see a reason it is > so slow when there are only 65k entries. this is a know problem, already reported by me as Bug 49350. a backtrace for this bug is available and Micheal Meeks labeled it as Easy Hacks suggesting a way to obtain a 10x speedup for it > Best regards, Tomaž thank you again very much, Tommy
(In reply to comment #9) > I think I found the problem. Everything was already using dynamic lists/sets > and all the nice things but the code in the dialog still used sal_uInt16 > variables for position and iteration. Because the underlying classes use xub_StrLen everywhere, which is in fact sal_uInt16. IMHO you need to change those, too.
@Tomaz - first thanks for doing this quickly :) Now my concern....I think this sets a terrible precedent, someone offers cash and even though it was declined, a patch is made right away. Might something we need to discuss going forward :) @reporter - thanks for reporting, donating, and being a champion for this issue, I'm glad that it looks resolved for you :)
(In reply to comment #11) > Because the underlying classes use xub_StrLen everywhere, which is in fact > sal_uInt16. IMHO you need to change those, too. Hm... OK I will have to go through the code again today.
Hi Joel, > Now my concern....I think this sets a terrible precedent, someone offers cash > and even though it was declined, a patch is made right away. Might something > we need to discuss going forward :) Not sure I think it's a terrible precedent; it's just a micro-payments version of what larger consultancies are doing I think; of course bugzilla is not a terribly elegant tool for that - but lets discuss at the ESC later. ATB, Michael.
> so if I get it right, being a Windows user, I shall download a new master build > in the next couple of days from here: > http://dev-builds.libreoffice.org/daily/Win-x86@6/ > > I'm eager to test the new 32-bit enabled autocorrect function and report if it > works with larger entries database and if such higher capacity has any effect > on performance. The fix is included in: http://dev-builds.libreoffice.org/daily/Win-x86@6/master/2012-09-19_20.37.35/master~2012-09-19_20.37.35_LibO-Dev_3.7.0.0.alpha0_Win_x86_install_en-US.msi (and any later daily build)
@Tomaz positively tested your fix on master-2012-09-19_20.37.35/ I merged two 65K autocorrect database to create a 130K database. LibO can now handle this enlarged acor.dat files without crashing and loosing all data as happened before with the 16bit limit. adding new entries by "right click autocorrect suggestions" is still fast and unaffected, while adding new entries in the "autocorrect replacement table" is even slower than before because of the higher number of entries. as I said before that is Bug 49350, which I happily see you have already tried to fix with a new patch. Once a new master build is available I'll test it as well and give my feedback. great job!!! thank you very much!!!
setting this as RESOLVED FIXED in master (future LibO 3.7). I'll take the time to do some more testing about performance and then I'll ask cherry-pick in LibO 3.6.x thanks again Tomaz for your wonderful work.
ok, my test say that Tomaz's patch has no serious side effects. so I'm asking him to undergo review on the dev-list for cherry-picking in LibO 3.6.3
LibO 4.0 is imminent (february 2013) however I think this easy-hack fix is worth backporting to the 3.6.x branch which still has 2 programmed micro-releases (.5 and .6 which are scheduled in february and april 2013). the expanded capacity of autocorrect database (from 65K to 4M) IMHO is a great added value in LibO towards OOo/AOO which still suffer the 16bit limit.
Migrating Whiteboard tags to Keywords: (EasyHack,DifficultyBeginner,SkillCpp,TopicCleanup) [NinjaEdit]