Bug 44736 - FILEOPEN: Loading RTF files is slow in comparison with LibO 3.4.4 (part of closed bug #44157)
Summary: FILEOPEN: Loading RTF files is slow in comparison with LibO 3.4.4 (part of cl...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.0 Beta2
Hardware: Other All
: medium normal
Assignee: Miklos Vajna
QA Contact:
URL:
Whiteboard: BSA target:3.6.0 target:4.0.0 target:...
Keywords: filter:rtf, regression
: 45826 47396 50691 (view as bug list)
Depends on:
Blocks:
 
Reported: 2012-01-12 10:15 UTC by s-joyemusequna
Modified: 2015-12-17 11:53 UTC (History)
7 users (show)

See Also:


Attachments
zipped file (7Zip) with 3 files with the same content, but different formatting (3.49 KB, application/x-7z-compressed)
2012-01-12 10:15 UTC, s-joyemusequna
Details
Next testcase. 50MB rtf document. Libre Office allocate 450MB RAM. (973.88 KB, application/bzip2)
2012-01-24 11:29 UTC, Pavel Kysilka
Details
Testcase (7zipped). 1.45Mo RTF (15.44 KB, application/x-7z-compressed)
2012-10-24 11:00 UTC, mav
Details

Note You need to log in before you can comment on or make changes to this bug.
Description s-joyemusequna 2012-01-12 10:15:44 UTC
Created attachment 55510 [details]
zipped file (7Zip) with 3 files with the same content, but different formatting

Problem description: Loading RTF files is slow in comparison with LibO 3.4.4.

I attach 3 example files zipped with 7Zip, the content is the same (1000 paragraphs, 421 pages):

1. lorem421pages_unformated.rtf (unformatted)
2. lorem421pages_format_parag.rtf  (paragraphs formatted)
3. lorem421pages_format_parag_and_text.rtf (paragraph and text formatted)

Steps to reproduce:
1. load the files with LibO 3.4.4
2. load the same files with LibO 3.5 Beta 2
3. compare the time needed to load the files

Current behavior:

Average load time until the file is visible in editor
(Word 2007, Lib O3.4.4, LibO 3.5)

File 1: <2 seconds, <4 seconds, <6 seconds 
File 2: <3 seconds, <4 seconds, <10 seconds 
File 3: <3 seconds, <5 seconds, <37 seconds

If the file is heavily formatted, the difference grows bigger and bigger.

Expected behavior:
Load speed nearer to the LibO 3.4.4 speed

Platform (if different from the browser):               
Browser: Mozilla/5.0 (Windows NT 6.0; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1
Comment 1 Miklos Vajna 2012-01-12 10:37:00 UTC
Assign, some profiling will be necessary first to determine where is the bottleneck.

Thanks for the sample, though!
Comment 2 Pavel Kysilka 2012-01-24 11:29:43 UTC
Created attachment 56109 [details]
Next testcase. 50MB rtf document. Libre Office allocate 450MB RAM.
Comment 3 Michael Meeks 2012-02-22 03:39:03 UTC
*** Bug 45826 has been marked as a duplicate of this bug. ***
Comment 4 Miklos Vajna 2012-03-11 14:05:56 UTC
Pavel,

Slow import != import that uses too much memory. Anyway, memory usage has been improved a lot with these commits:

http://cgit.freedesktop.org/libreoffice/core/commit/?id=f32fe9f5012e3ee184e1a1fca6814bee9105d8fb (master)
http://cgit.freedesktop.org/libreoffice/core/commit/?h=libreoffice-3-5&id=9972f86a01969535139bf5a02ea10714d94b51a3 (libreoffice-3-5)

It won't cause any real speedup though, so I'm leaving this bug open.
Comment 5 Miklos Vajna 2012-03-30 06:34:19 UTC
*** Bug 47396 has been marked as a duplicate of this bug. ***
Comment 6 Miklos Vajna 2012-04-21 10:12:24 UTC
I spent a little time on this today, I see two areas where the filter can be optimized:

1) RTF requires the exporters to dump the contents of the style after using the \sN keyword to support readers created before introducing styles. Right not the importer sends this duplicated info to Writer. If one builds the writerfilter module with dbglevel=2, the output of the tokenizer is dumped in xml files under /tmp, and for a test document the size of the xml file is 6,8M for docx, and 36M for RTF. Of course if the algorithm to filter out these duplicated paragraph / character keywords isn't cheap enough, then we don't earn anything.

2) Right now each \foo string is mapped to and int (RTF_FOO in an enum), and this is done with a naive algorithm (the strings are stored in an array and searched sequentially), this could be improved by sorting the array in the tokenizer constructor and then doing a binary search for each keyword.

I implemented the second now, here are the timings I get for a non-debug build using lorem421pages_unformated.rtf (all values are in ms):

- before: 4405, 4402, 4394
- after: 3763, 3496, 3419

If I count properly, that's about 20% win! :)

I won't close the bug yet, though - I want to experiment with the first as well. (I'll push the second in a bit to master.)
Comment 7 Not Assigned 2012-04-21 10:21:50 UTC
Miklos Vajna committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=22eb78b6eee38e11aec32909b6983becb309ce13

fdo#44736 speed up RTF import a bit by sorting keywords
Comment 8 Miklos Vajna 2012-10-15 11:36:32 UTC
*** Bug 50691 has been marked as a duplicate of this bug. ***
Comment 9 mav 2012-10-24 10:56:42 UTC
I can confirm that this bug still remainsin  the 3.5.x and 3.6.1 releases.

I attach an exemple file (kind of calculation report) :
it takes more than 2'30" to open with 3.6.1.2
but it got opened in a few seconds with 3.4.x.

I hope it will help, I can provide other exemple files if needeed.
Comment 10 mav 2012-10-24 11:00:28 UTC
Created attachment 68990 [details]
Testcase (7zipped). 1.45Mo RTF

This kind of RTF report opened in **a few seconds** in 3.4.x
And it get opened in 3.5;x eand 3.6.1x in **a few minutes**
Comment 11 Miklos Vajna 2012-11-24 20:45:24 UTC
So, some progress finally here:

http://cgit.freedesktop.org/libreoffice/core/commit/?id=292422a7dc4fb4b8b3d9d9b90107fd829ff18100

The style idea is still not implemented yet.
Comment 12 Not Assigned 2012-11-25 00:04:09 UTC
Miklos Vajna committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=9f5263c477b82fef5aa9c3e79fb6af92aa049e24

fdo#44736 RTF import: ignore direct formatting which equals to style



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 13 Miklos Vajna 2012-11-25 00:22:32 UTC
Hi s-joyemusequna,

I now implemented this "ignore direct formatting of styles" idea in master (see above), it seems it causes a quite nice speedup.

Your lorem421pages_format_parag_and_text.rtf was loaded in 20686 ms on my machine yesterday, now it takes 12021 ms. (lorem421pages_format_parag.rtf is 3733 ms, lorem421pages_unformated.rtf is 2701 ms.)

To sum up, marking as resolved, all this will be in 4.0, but nothing easy to backport.

If you have a _specific_ document (that is special in some way) which is still annoyingly slow to import, then feel free to open a separate bug about it, of course.

Thanks,

Miklos
Comment 14 s-joyemusequna 2012-11-30 15:57:43 UTC
Sorry, but it doesn't work for me under Windows. There is progress compared to LO 3.6, but the tested files open *significantly* faster with LO 3.4.5 (or with LO 3.4.4). I tested under Windows XP on several machines and on Windows Vista 64 - same results. I have to reopen this bug "Loading RTF files is slow in comparison with LibO 3.4.4".

Tested files: 
1. lorem421pages_unformated.rtf (unformatted)
2. lorem421pages_format_parag.rtf (paragraphs formatted)
3. lorem421pages_format_parag_and_text.rtf (paragraph and text formatted)
4. MCB1-MCB2_02-04b_cbl.rtf (from this bug, comment 10)
5. 1059_СД_ССР_общ.rtf (from bug 47396, marked as duplicate of this bug)
6. file.rtf (from bug 45826, as duplicate of this bug)
7. Slow_opening_file_1.rtf (from bug 50691, marked as duplicate of this bug)

Average load time until the file is visible in editor, tested with Windows XP: 
LOdev 4.0: Version 4.0.0.0.alpha1+ (Build ID: 679480f3d766afe80e55410ab76b46d48dc7bef) from 2012-11-29)

Results for LO 3.4.5, LO 3.6.3, LOdev 4.0 in seconds:

File 1:   4, 25, 22  (LO 3.4.5 is more then 5 times faster)
File 2:   5, 28, 25  (LO 3.4.5 is 5 times faster)
File 3:   6, 71, 47  (LO 3.4.5 is nearly 8 times faster)
File 4:   8, 114, 114 (unchanged value for LOdev 4; LO 3.4.5 more then 14 times faster)
File 5:   hangs, 7 minutes 42 seconds, 7 minutes 27 seconds (LO 3.4.5 hangs, value for LO 3.3.4 is 1 minute 10 seconds)
File 6:   2, 28, 20 (significantly faster now, but LO 3.4.5 is 10 times faster)
File 7:   2, 54, 14 (much faster now, but LO 3.4.5 is 7 times faster)
Comment 15 Michael Meeks 2012-12-07 10:59:43 UTC
For me File4 takes 18secs or so to load - I built a callgrind profile of that which is quite interesting. It shows that the state management is consuming ~all the time, things like popState's:

    RTFParserState aState(m_aStates.top());

consumes 30bn of the 90bn instructions. It seems we spend a ton of time in std::deque shuffling big chunks of nested state around. Anyhow - having a poke :-)
Comment 16 Michael Meeks 2012-12-07 12:16:56 UTC
Between pushState and popState - we have 67bn of 94bn cycles for file4; and most of that is STL copying funkiness :-) Using a reference for:

    RTFParserState &aState = m_aStates.top();

saves a chunk of load time, but busts a unit test loading a CVE document
Comment 17 Not Assigned 2012-12-07 13:39:29 UTC
Michael Meeks committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=a48e2fd9049797110b3b2505c363557284987ca8

fdo#44736 - convert RTFSprms to a copy-on-write structure.



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 18 Michael Meeks 2012-12-07 13:42:11 UTC
    This is some rather nice code Miklos hacked up here - a pleasure to work on it.
    I switched the RTFSprms class to use an intrusive pointer, and a copy-on-write pattern; hopefully that makes duplicating the bus-load of state we have in our stack trivially cheap there. I also saved a wasted duplicate of RTFParserState.

    Passes make slowcheck in sw (for me) and looks reasonable => pushed to master, would appreciate review for -4-0 (and perhaps -3-6?) since I'm no expert there.

    Timings afterwards (incidentally this includes app startup):

    No document:
    user	0m1.772s
    sys	0m0.388s
    With document file 4.
    user	0m5.748s
    sys	0m0.308s

    So - it takes around 4 secs to load and render the 1st page of that RTF file down from 16 or so - which seems (to me) reasonable. Better the callgrind trace now looks (mostly) sane - lots of writer internals getting banged on as you'd expect.

    So resolving fixed again: thanks for the report ! :-)
Comment 19 Not Assigned 2012-12-07 20:19:16 UTC
Michael Meeks committed a patch related to this issue.
It has been pushed to "libreoffice-4-0":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=816279ecf7d768110a51accda11ce0037e04068c&g=libreoffice-4-0

fdo#44736 - convert RTFSprms to a copy-on-write structure.


It will be available in LibreOffice 4.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 20 s-joyemusequna 2012-12-09 13:03:17 UTC
OK, it is much better now. File 4 is fantastic, faster than with LO 3.4.5.
File 1, 2, and 3: LO seems to hang for rather long time without visible progress bar.

Tested files: 
1. lorem421pages_unformated.rtf
2. lorem421pages_format_parag.rtf
3. lorem421pages_format_parag_and_text.rtf
4. MCB1-MCB2_02-04b_cbl.rtf (from this bug, comment 10)
5. 1059_СД_ССР_общ.rtf (from bug 47396, marked as duplicate of this bug)
6. file.rtf (from bug 45826, marked as duplicate of this bug)
7. Slow_opening_file_1.rtf (from bug 50691, marked as duplicate of this bug)

Average load time until the file is visible in editor, tested with Windows XP: 
LOdev 4.0: buildname: Win-x86@6, tree: libreoffice-4-0, pull time 2012-12-08 05:20:21)

Results for LO 3.4.5, LO 3.6.3, LOdev 4.0 in seconds:

File 1:   4, 25, 15 [3]*
File 2:   5, 28, 17 [3]* 
File 3:   6, 71, 24 [10]*
File 4:   8, 114, 6 [5-6]* (fantastic !!! faster than LO 3.4.5)
File 5:   hangs, 7min 42 sec, 1min 44sec [1min 38sec]* (LO 3.4.5 hangs, value for LO 3.3.4 is 1min 10sec)
File 6:   2, 28, 12 [10]*
File 7:   2, 54, 8 [7]*

*) progress bar is visible (in seconds)
Comment 21 Michael Meeks 2012-12-20 23:09:20 UTC
The word-count nasties that explain the large delays after RTF loading are tracked in bug#58590
Comment 22 s-joyemusequna 2012-12-22 13:16:52 UTC
Retested with LOdev 4.0 Beta2: buildname: Win-x86@6, tree: libreoffice-4-0, pull time 2012-12-22 02:16:51

Tested files: 
1. lorem421pages_unformated.rtf
2. lorem421pages_format_parag.rtf
3. lorem421pages_format_parag_and_text.rtf
4. MCB1-MCB2_02-04b_cbl.rtf (from this bug, comment 10)
5. 1059_СД_ССР_общ.rtf (from bug 47396, marked as duplicate of this bug)
6. file.rtf (from bug 45826, marked as duplicate of this bug)
7. Slow_opening_file_1.rtf (from bug 50691, marked as duplicate of this bug)

Average load time until the file is visible in editor, tested with Windows XP. 

Results for LO 3.4.5, LO 3.6.3, LOdev 4.0 Beta1, LOdev Beta2 in seconds:

File 1:   4, 25, 15 [3], 6 [3]
File 2:   5, 28, 17 [3], 7 [3] 
File 3:   6, 71, 24 [10], 14 [10]
File 4:   8, 114, 6 [5-6], 6 [5-6]  (same values as beta1)
File 5:   70*, 462, 104 [98], 95 [92] 
File 6:   2, 28, 12 [10], 10 [9]
File 7:   2, 54, 8 [7], 7 [6]

*) LO 3.3.4 (LO 3.4.5 hangs)
[nn] progress bar visibility in seconds
Comment 23 Not Assigned 2013-01-08 22:13:17 UTC
Michael Meeks committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=ee0bf5d58bc59052923c4ced928a989956e71456

fdo#44736 - set and fetch multiple properties concurrently



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 24 Commit Notification 2013-10-25 15:22:59 UTC
Matuš Kukan committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=986fa38eb23a397546061c3ce0df9077ba334a07

fdo#44736 - set and fetch multiple properties concurrently 2



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 25 Robinson Tryon (qubit) 2015-12-17 11:53:35 UTC
Migrating Whiteboard tags to Keywords: (filter:rtf)
Replace rtf_filter -> filter:rtf.
[NinjaEdit]