We should parse our auto-correct XML files using the nice tokenizing XFastParser instead of the significantly slower SAX parser interface. Code is here: editeng/source/misc/SvXMLAutoCorrectImport.cxx The XFastParser interface is in offapi, and needs a hash function to tokenise its input; you'd want to look at sax/source/fastparser/fastparser.cxx to see how that works I guess. This would move the parsing into its own thread. Then again - perhaps that is too complicated; we could get at least some win by moving to xmlreader/ perhaps. Thanks !
Thanks David - it'd be great to get good at XFastParsing =) we need a ninja there to convert the ODF import filters as well ... The file I got saved was here: ~/.config/libreoffice/4/user/autocorr/acor_en-GB.dat with things like: <block-list:block block-list:abbreviated-name="abotu" block-list:name="about"/><block-list:block block-list:abbreviated-name="abouta" block-list:name="about a"/> in it etc. :-) Hopefully it's not too bad to convert. the XFastParser implementation is in: sax/source/fastparser/fastparser.cxx its just one module I think, thought the fast attributes class is in: sax/source/tools/fastattrib.cxx I suspect we'll need a (very) trivial hash function to make this work - and here we try to punch through the UNO later - with a native interface. class SAX_DLLPUBLIC FastTokenHandlerBase Since the UNO version of this was so slow it destroyed any benefit of tokenizing =)
Hi David; currently there is a student from Munich working on this easy hack =) ( before you go fix it in a heart-beat ;-). Matus knows who that is ...
(In reply to comment #2) > Hi David; currently there is a student from Munich working on this easy hack > =) ( before you go fix it in a heart-beat ;-). Matus knows who that is ... Nice ;-) IOW, i shouldn't wok on it?
(In reply to comment #2) > Hi David; currently there is a student from Munich working on this easy hack > =) ( before you go fix it in a heart-beat ;-). Matus knows who that is ... so I think that he should put his name under the ASSIGNED TO field.
I'm working on an other easy hack right now, but afterwards I want to start with this as an entry to all the fastparser stuff.
Daniel Sikeler committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=c0a5d390e519603dbc19a38c610d0a114b80cfa1 fdo#80403: Import baseclasses implement FastParser interfaces It will be available in 4.4.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Daniel Sikeler committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=328f861dfbe7086c85dbd0d791c5f18b0714ca75 fdo#80403: AutoCorrect uses XFastParser It will be available in 4.4.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Wow - thanks Daniel - it looks lovely ... A couple of comments: +sal_Int32 SAL_CALL SvXMLAutoCorrectTokenHandler::getTokenFromUTF8( const Sequence< sal_Int8 >& Identifier ) + throw (::css::uno::RuntimeException, std::exception) +{ + switch( Identifier.getLength() ) + { + case 4: return XML_NAME; + case 5: return XML_BLOCK; + case 10: return XML_BLOCK_LIST; + case 16: return XML_ABBREVIATED_NAME; + default: return css::xml::sax::FastToken::DONTKNOW; + } +} Really looks as if it could use some string comparisons as well as the (fast) name checks: which are great =) so if (Identifier == "name") return XML_NAME; else return DONTKNOW etc. Otherwise looks good - did you write a unit test for this (or is there one already ?) =) Thanks !
(In reply to Michael Meeks from comment #8) > Really looks as if it could use some string comparisons as well as the > (fast) name checks: which are great =) so if (Identifier == "name") return > XML_NAME; else return DONTKNOW etc. Sorry, but I don't know what you mean. I thought the improvement is to avoid string comparisons, because they are slower than integer comparisons.
> > Really looks as if it could use some string comparisons as well as the > > (fast) name checks: which are great =) so if (Identifier == "name") return > > XML_NAME; else return DONTKNOW etc. > > Sorry, but I don't know what you mean. I thought the improvement is to avoid > string comparisons, because they are slower than integer comparisons. The primary win comes from doing the XML Parsing and tokenization in another thread =) so all of that work is "for free" owing to the magic of un-used CPUs on your machine. But we should not sacrifice correctness for performance: +sal_Int32 SAL_CALL SvXMLAutoCorrectTokenHandler::getTokenFromUTF8( const Sequence< sal_Int8 >& Identifier ) + throw (::css::uno::RuntimeException, std::exception) +{ + switch( Identifier.getLength() ) + { + case 4: return XML_NAME; + case 5: return XML_BLOCK; + case 10: return XML_BLOCK_LIST; + case 16: return XML_ABBREVIATED_NAME; + default: return css::xml::sax::FastToken::DONTKNOW; + } So - we do tokenization once inside the XFastParser
> > Really looks as if it could use some string comparisons as well as the > > (fast) name checks: which are great =) so if (Identifier == "name") return > > XML_NAME; else return DONTKNOW etc. > > Sorry, but I don't know what you mean. I thought the improvement is to avoid > string comparisons, because they are slower than integer comparisons. The primary win comes from doing the XML Parsing and tokenization in another thread =) so all of that work is "for free" owing to the magic of un-used CPUs on your machine. But we should not sacrifice correctness for performance: +sal_Int32 SAL_CALL SvXMLAutoCorrectTokenHandler::getTokenFromUTF8( const Sequence< sal_Int8 >& Identifier ) + throw (::css::uno::RuntimeException, std::exception) +{ + switch( Identifier.getLength() ) + { + case 4: return XML_NAME; + case 5: return XML_BLOCK; + case 10: return XML_BLOCK_LIST; + case 16: return XML_ABBREVIATED_NAME; + default: return css::xml::sax::FastToken::DONTKNOW; + } Is just incorrect =) To make it somewhat faster we also have a direct C++ punch-through like this: /// A native C++ interface to tokenisation class SAX_DLLPUBLIC FastTokenHandlerBase { public: virtual ~FastTokenHandlerBase(); virtual sal_Int32 getTokenDirect( const char *pToken, sal_Int32 nLength ) const = 0; So we don't have to duplicate the strings; if you implement that interface on the token lookup interface then we can do a 'strncmp()' on the const char *pToken without some horrible duplicate / copy going on which might also help =) Thanks !
Daniel Sikeler committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=2b442c1df48fce7a265e983a902f4a30e3edabf1 fdo#80403: TokenHandler impl. FastTokenHandlerBase It will be available in 4.4.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Hi Daniel, is this supposed to give an additional speed-up?
(In reply to tommy27 from comment #13) > Hi Daniel, is this supposed to give an additional speed-up? No, I don't think so. This were some changes to fix the things, Michael mentioned in comment 8.
I retested with 4.4.0.0.alpha2+ Build ID: b021b5983c62e266b82d9f0c5c6d8d8900553827 TinderBox: Win-x86@42, Branch:master, Time: 2014-11-12_00:19:18 you are right. speed is the same. anyway I'm already very happy about the improvement as said in bug 79761 current times are: starting autocorrect options LO 4.3.2.2 --> 5 seconds LO 4.0.0.0 --> 3 seconds typing first word and press space LO 4.3.2.2 --> 4.3 seconds LO 4.0.0.0 --> 1.5 seconds there's respectively a 40% and 64% speed up obtained thanks to the XFastParser which is an amazing result. consider that those tests have been performed on an old laptop and that the residual lag will probably be not significative on newer computers. great job Daniel. I wonder if this report can be considered as FIXED or you still have to do some code polishing about it.
Hi Tommy - any chance you can add your numbers to https://wiki.documentfoundation.org/ReleaseNotes/4.4 for Daniel's line-item there ? - and Daniel - it'd be great if you can close the bug yourself (?) =) then crediting is right there. Thanks guys !
In comment #15 Isn't it 4.3.2.2 vs 4.4.0.0 and not 4.0.0.0 >starting autocorrect options >LO 4.3.2.2 --> 5 seconds >LO 4.0.0.0 --> 3 seconds >typing first word and press space >LO 4.3.2.2 --> 4.3 seconds >LO 4.0.0.0 --> 1.5 seconds
yes you are right it was a "typing" error
Daniel Sikeler committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=c1e90457d2ecc6e1171b7a296ab8bd05821e39e6 fdo#80403: Writer specific AutoCorr use FastParser It will be available in 4.5.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Hi Daniel, would you please explain me what's the effect of this committ. please use "non-developer" language so even a 5 years old can understand :-)
Stephan Bergmann committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=7ef4457b9fc06b5c25c607d7e0149f45c7015b10 Revert "fdo#80403: Writer specific AutoCorr use FastParser" It will be available in 4.5.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
So - re-opening; Daniel - I guess we need 'make check' to pass =) and to merge in the other pieces here. I wonder what the (horrible Java / Junit) sw_unoapi test is catching (if anything) that is not in our unit test. Can you chase ?
Daniel Sikeler committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=c2c0989794d5a4724da8b8880559764bffd059ec fdo#80403: Writer specific AutoCorr use FastParser It will be available in 4.5.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Hi Tommy, There is a basic auto correction, which works for all applications, and an extension for writer. In writer it is possible to replace some text with a table for example. And therefore we have an other parse, which is now converted to a fastparser.
Ok, I understand. thanks for pointing this out. you did an awesome job with autocorrection. the performance is now drastically improved. I hope the latest change could be backported to 4.4.x as well.
Migrating Whiteboard tags to Keywords: (EasyHack DifficultyInteresting SkillCpp TopicCleanup ) [NinjaEdit]