80403 – AutoCorrect - parse XML files using XFastParser ...

Bug 80403 - AutoCorrect - parse XML files using XFastParser ...

Summary: AutoCorrect - parse XML files using XFastParser ...

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Linguistic (show other bugs)
Version: (earliest affected)	4.4.0.0.alpha0+ Master
Hardware:	Other All

Importance:	medium normal
Assignee:	d.sikeler94

URL:
Whiteboard:	target:4.4.0 target:4.5.0
Keywords:	difficultyInteresting, easyHack, skillCpp, topicCleanup

Depends on:
Blocks:

Reported:	2014-06-23 14:22 UTC by Michael Meeks
Modified:	2015-12-16 00:37 UTC (History)
CC List:	6 users (show)

See Also:	79761
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael Meeks 2014-06-23 14:22:35 UTC

We should parse our auto-correct XML files using the nice tokenizing XFastParser instead of the significantly slower SAX parser interface. Code is here:

editeng/source/misc/SvXMLAutoCorrectImport.cxx

The XFastParser interface is in offapi, and needs a hash function to tokenise its input; you'd want to look at sax/source/fastparser/fastparser.cxx to see how that works I guess.

This would move the parsing into its own thread.

Then again - perhaps that is too complicated; we could get at least some win by moving to xmlreader/ perhaps.

Thanks !

Comment 1 Michael Meeks 2014-06-27 15:37:06 UTC

Thanks David - it'd be great to get good at XFastParsing =) we need a ninja there to convert the ODF import filters as well ...

The file I got saved was here:

~/.config/libreoffice/4/user/autocorr/acor_en-GB.dat 

with things like:

<block-list:block block-list:abbreviated-name="abotu" block-list:name="about"/><block-list:block block-list:abbreviated-name="abouta" block-list:name="about a"/>

in it etc. :-)

Hopefully it's not too bad to convert.

the XFastParser implementation is in:

sax/source/fastparser/fastparser.cxx

its just one module I think, thought the fast attributes class is in:

sax/source/tools/fastattrib.cxx

I suspect we'll need a (very) trivial hash function to make this work - and here we try to punch through the UNO later - with a native interface.

class SAX_DLLPUBLIC FastTokenHandlerBase

Since the UNO version of this was so slow it destroyed any benefit of tokenizing =)

Comment 2 Michael Meeks 2014-09-25 08:49:08 UTC

Hi David; currently there is a student from Munich working on this easy hack =) ( before you go fix it in a heart-beat ;-). Matus knows who that is ...

Comment 3 DavidO 2014-09-25 20:22:50 UTC

(In reply to comment #2)
> Hi David; currently there is a student from Munich working on this easy hack
> =) ( before you go fix it in a heart-beat ;-). Matus knows who that is ...

Nice ;-) IOW, i shouldn't wok on it?

Comment 4 tommy27 2014-09-26 08:17:13 UTC

(In reply to comment #2)
> Hi David; currently there is a student from Munich working on this easy hack
> =) ( before you go fix it in a heart-beat ;-). Matus knows who that is ...

so I think that he should put his name under the ASSIGNED TO field.

Comment 5 d.sikeler94 2014-09-26 08:21:41 UTC

I'm working on an other easy hack right now, but afterwards I want to start with this as an entry to all the fastparser stuff.

Comment 6 Commit Notification 2014-10-31 08:19:38 UTC

Daniel Sikeler committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c0a5d390e519603dbc19a38c610d0a114b80cfa1

fdo#80403: Import baseclasses implement FastParser interfaces

It will be available in 4.4.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.

Comment 7 Commit Notification 2014-10-31 08:21:52 UTC

Daniel Sikeler committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=328f861dfbe7086c85dbd0d791c5f18b0714ca75

fdo#80403: AutoCorrect uses XFastParser

It will be available in 4.4.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.

Comment 8 Michael Meeks 2014-10-31 11:40:53 UTC

Wow - thanks Daniel - it looks lovely ...

A couple of comments:

+sal_Int32 SAL_CALL SvXMLAutoCorrectTokenHandler::getTokenFromUTF8( const Sequence< sal_Int8 >& Identifier )
+     throw (::css::uno::RuntimeException, std::exception)
+{
+    switch( Identifier.getLength() )
+    {
+        case 4:     return XML_NAME;
+        case 5:     return XML_BLOCK;
+        case 10:    return XML_BLOCK_LIST;
+        case 16:    return XML_ABBREVIATED_NAME;
+        default:    return css::xml::sax::FastToken::DONTKNOW;
+    }
+}

Really looks as if it could use some string comparisons as well as the (fast) name checks: which are great =) so if (Identifier == "name") return XML_NAME; else return DONTKNOW etc.

Otherwise looks good - did you write a unit test for this (or is there one already ?) =)

Thanks !

Comment 9 d.sikeler94 2014-11-03 06:39:09 UTC

(In reply to Michael Meeks from comment #8)

> Really looks as if it could use some string comparisons as well as the
> (fast) name checks: which are great =) so if (Identifier == "name") return
> XML_NAME; else return DONTKNOW etc.

Sorry, but I don't know what you mean. I thought the improvement is to avoid string comparisons, because they are slower than integer comparisons.

Comment 10 Michael Meeks 2014-11-03 09:34:30 UTC

> > Really looks as if it could use some string comparisons as well as the
> > (fast) name checks: which are great =) so if (Identifier == "name") return
> > XML_NAME; else return DONTKNOW etc.
>
> Sorry, but I don't know what you mean. I thought the improvement is to avoid
> string comparisons, because they are slower than integer comparisons.

The primary win comes from doing the XML Parsing and tokenization in another thread =) so all of that work is "for free" owing to the magic of un-used CPUs on your machine.

But we should not sacrifice correctness for performance:

+sal_Int32 SAL_CALL SvXMLAutoCorrectTokenHandler::getTokenFromUTF8( const Sequence< sal_Int8 >& Identifier )
+     throw (::css::uno::RuntimeException, std::exception)
+{
+    switch( Identifier.getLength() )
+    {
+        case 4:     return XML_NAME;
+        case 5:     return XML_BLOCK;
+        case 10:    return XML_BLOCK_LIST;
+        case 16:    return XML_ABBREVIATED_NAME;
+        default:    return css::xml::sax::FastToken::DONTKNOW;
+    }



So - we do tokenization once inside the XFastParser

Comment 11 Michael Meeks 2014-11-03 09:37:18 UTC

> > Really looks as if it could use some string comparisons as well as the
> > (fast) name checks: which are great =) so if (Identifier == "name") return
> > XML_NAME; else return DONTKNOW etc.
>
> Sorry, but I don't know what you mean. I thought the improvement is to avoid
> string comparisons, because they are slower than integer comparisons.

The primary win comes from doing the XML Parsing and tokenization in another thread =) so all of that work is "for free" owing to the magic of un-used CPUs on your machine.

But we should not sacrifice correctness for performance:

+sal_Int32 SAL_CALL SvXMLAutoCorrectTokenHandler::getTokenFromUTF8( const Sequence< sal_Int8 >& Identifier )
+     throw (::css::uno::RuntimeException, std::exception)
+{
+    switch( Identifier.getLength() )
+    {
+        case 4:     return XML_NAME;
+        case 5:     return XML_BLOCK;
+        case 10:    return XML_BLOCK_LIST;
+        case 16:    return XML_ABBREVIATED_NAME;
+        default:    return css::xml::sax::FastToken::DONTKNOW;
+    }
 
Is just incorrect =) To make it somewhat faster we also have a direct C++ punch-through like this:

/// A native C++ interface to tokenisation
class SAX_DLLPUBLIC FastTokenHandlerBase
{
 public:
    virtual ~FastTokenHandlerBase();
    virtual sal_Int32 getTokenDirect( const char *pToken, sal_Int32 nLength ) const = 0;

So we don't have to duplicate the strings; if you implement that interface on the token lookup interface then we can do a 'strncmp()' on the const char *pToken without some horrible duplicate / copy going on which might also help =)

Thanks !

Comment 12 Commit Notification 2014-11-11 09:37:52 UTC

Daniel Sikeler committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=2b442c1df48fce7a265e983a902f4a30e3edabf1

fdo#80403: TokenHandler impl. FastTokenHandlerBase

It will be available in 4.4.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.

Comment 13 tommy27 2014-11-11 11:59:46 UTC

Hi Daniel, is this supposed to give an additional speed-up?

Comment 14 d.sikeler94 2014-11-11 12:02:41 UTC

(In reply to tommy27 from comment #13)
> Hi Daniel, is this supposed to give an additional speed-up?

No, I don't think so. This were some changes to fix the things, Michael mentioned in comment 8.

Comment 15 tommy27 2014-11-15 07:37:05 UTC

I retested with 4.4.0.0.alpha2+
Build ID: b021b5983c62e266b82d9f0c5c6d8d8900553827
TinderBox: Win-x86@42, Branch:master, Time: 2014-11-12_00:19:18

you are right. speed is the same.
anyway I'm already very happy about the improvement
as said in bug 79761 current times are:

starting autocorrect options
LO 4.3.2.2 --> 5 seconds
LO 4.0.0.0 --> 3 seconds

typing first word and press space

LO 4.3.2.2 --> 4.3 seconds
LO 4.0.0.0 --> 1.5 seconds

there's respectively a 40% and 64% speed up obtained thanks to the XFastParser which is an amazing result.

consider that those tests have been performed on an old laptop and that the residual lag will probably be not significative on newer computers.

great job Daniel.
I wonder if this report can be considered as FIXED or you still have to do some code polishing about it.

Comment 16 Michael Meeks 2014-11-15 08:08:20 UTC

Hi Tommy - any chance you can add your numbers to https://wiki.documentfoundation.org/ReleaseNotes/4.4 for Daniel's line-item there ? - and Daniel - it'd be great if you can close the bug yourself (?) =) then crediting is right there.

Thanks guys !

Comment 17 Pierre C 2014-11-18 20:12:20 UTC

In comment #15 Isn't it 4.3.2.2 vs 4.4.0.0 and not 4.0.0.0

>starting autocorrect options
>LO 4.3.2.2 --> 5 seconds
>LO 4.0.0.0 --> 3 seconds

>typing first word and press space

>LO 4.3.2.2 --> 4.3 seconds
>LO 4.0.0.0 --> 1.5 seconds

Comment 18 tommy27 2014-11-18 20:20:32 UTC

yes you are right
it was a "typing" error

Comment 19 Commit Notification 2014-12-04 09:39:46 UTC

Daniel Sikeler committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c1e90457d2ecc6e1171b7a296ab8bd05821e39e6

fdo#80403: Writer specific AutoCorr use FastParser

It will be available in 4.5.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.

Comment 20 tommy27 2014-12-04 12:35:40 UTC

Hi Daniel, would you please explain me what's the effect of this committ.

please use "non-developer" language so even a 5 years old can understand :-)

Comment 21 Commit Notification 2014-12-05 13:25:45 UTC

Stephan Bergmann committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=7ef4457b9fc06b5c25c607d7e0149f45c7015b10

Revert "fdo#80403: Writer specific AutoCorr use FastParser"

It will be available in 4.5.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.

Comment 22 Michael Meeks 2014-12-05 16:16:52 UTC

So - re-opening; Daniel - I guess we need 'make check' to pass =) and to merge in the other pieces here. I wonder what the (horrible Java / Junit) sw_unoapi test is catching (if anything) that is not in our unit test.
Can you chase ?

Comment 23 Commit Notification 2014-12-10 18:33:07 UTC

Daniel Sikeler committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c2c0989794d5a4724da8b8880559764bffd059ec

fdo#80403: Writer specific AutoCorr use FastParser

It will be available in 4.5.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.

Comment 24 d.sikeler94 2014-12-12 11:16:53 UTC

Hi Tommy,

There is a basic auto correction, which works for all applications, and an extension for writer. In writer it is possible to replace some text with a table for example. And therefore we have an other parse, which is now converted to a fastparser.

Comment 25 tommy27 2014-12-12 12:48:30 UTC

Ok, I understand. thanks for pointing this out.
you did an awesome job with autocorrection.
the performance is now drastically improved.

I hope the latest change could be backported to 4.4.x as well.

Comment 26 Robinson Tryon (qubit) 2015-12-16 00:37:32 UTC

Migrating Whiteboard tags to Keywords: (EasyHack DifficultyInteresting SkillCpp TopicCleanup )
[NinjaEdit]