Created attachment 55510 [details] zipped file (7Zip) with 3 files with the same content, but different formatting Problem description: Loading RTF files is slow in comparison with LibO 3.4.4. I attach 3 example files zipped with 7Zip, the content is the same (1000 paragraphs, 421 pages): 1. lorem421pages_unformated.rtf (unformatted) 2. lorem421pages_format_parag.rtf (paragraphs formatted) 3. lorem421pages_format_parag_and_text.rtf (paragraph and text formatted) Steps to reproduce: 1. load the files with LibO 3.4.4 2. load the same files with LibO 3.5 Beta 2 3. compare the time needed to load the files Current behavior: Average load time until the file is visible in editor (Word 2007, Lib O3.4.4, LibO 3.5) File 1: <2 seconds, <4 seconds, <6 seconds File 2: <3 seconds, <4 seconds, <10 seconds File 3: <3 seconds, <5 seconds, <37 seconds If the file is heavily formatted, the difference grows bigger and bigger. Expected behavior: Load speed nearer to the LibO 3.4.4 speed Platform (if different from the browser): Browser: Mozilla/5.0 (Windows NT 6.0; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1
Assign, some profiling will be necessary first to determine where is the bottleneck. Thanks for the sample, though!
Created attachment 56109 [details] Next testcase. 50MB rtf document. Libre Office allocate 450MB RAM.
*** Bug 45826 has been marked as a duplicate of this bug. ***
Pavel, Slow import != import that uses too much memory. Anyway, memory usage has been improved a lot with these commits: http://cgit.freedesktop.org/libreoffice/core/commit/?id=f32fe9f5012e3ee184e1a1fca6814bee9105d8fb (master) http://cgit.freedesktop.org/libreoffice/core/commit/?h=libreoffice-3-5&id=9972f86a01969535139bf5a02ea10714d94b51a3 (libreoffice-3-5) It won't cause any real speedup though, so I'm leaving this bug open.
*** Bug 47396 has been marked as a duplicate of this bug. ***
I spent a little time on this today, I see two areas where the filter can be optimized: 1) RTF requires the exporters to dump the contents of the style after using the \sN keyword to support readers created before introducing styles. Right not the importer sends this duplicated info to Writer. If one builds the writerfilter module with dbglevel=2, the output of the tokenizer is dumped in xml files under /tmp, and for a test document the size of the xml file is 6,8M for docx, and 36M for RTF. Of course if the algorithm to filter out these duplicated paragraph / character keywords isn't cheap enough, then we don't earn anything. 2) Right now each \foo string is mapped to and int (RTF_FOO in an enum), and this is done with a naive algorithm (the strings are stored in an array and searched sequentially), this could be improved by sorting the array in the tokenizer constructor and then doing a binary search for each keyword. I implemented the second now, here are the timings I get for a non-debug build using lorem421pages_unformated.rtf (all values are in ms): - before: 4405, 4402, 4394 - after: 3763, 3496, 3419 If I count properly, that's about 20% win! :) I won't close the bug yet, though - I want to experiment with the first as well. (I'll push the second in a bit to master.)
Miklos Vajna committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=22eb78b6eee38e11aec32909b6983becb309ce13 fdo#44736 speed up RTF import a bit by sorting keywords
*** Bug 50691 has been marked as a duplicate of this bug. ***
I can confirm that this bug still remainsin the 3.5.x and 3.6.1 releases. I attach an exemple file (kind of calculation report) : it takes more than 2'30" to open with 3.6.1.2 but it got opened in a few seconds with 3.4.x. I hope it will help, I can provide other exemple files if needeed.
Created attachment 68990 [details] Testcase (7zipped). 1.45Mo RTF This kind of RTF report opened in **a few seconds** in 3.4.x And it get opened in 3.5;x eand 3.6.1x in **a few minutes**
So, some progress finally here: http://cgit.freedesktop.org/libreoffice/core/commit/?id=292422a7dc4fb4b8b3d9d9b90107fd829ff18100 The style idea is still not implemented yet.
Miklos Vajna committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=9f5263c477b82fef5aa9c3e79fb6af92aa049e24 fdo#44736 RTF import: ignore direct formatting which equals to style The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Hi s-joyemusequna, I now implemented this "ignore direct formatting of styles" idea in master (see above), it seems it causes a quite nice speedup. Your lorem421pages_format_parag_and_text.rtf was loaded in 20686 ms on my machine yesterday, now it takes 12021 ms. (lorem421pages_format_parag.rtf is 3733 ms, lorem421pages_unformated.rtf is 2701 ms.) To sum up, marking as resolved, all this will be in 4.0, but nothing easy to backport. If you have a _specific_ document (that is special in some way) which is still annoyingly slow to import, then feel free to open a separate bug about it, of course. Thanks, Miklos
Sorry, but it doesn't work for me under Windows. There is progress compared to LO 3.6, but the tested files open *significantly* faster with LO 3.4.5 (or with LO 3.4.4). I tested under Windows XP on several machines and on Windows Vista 64 - same results. I have to reopen this bug "Loading RTF files is slow in comparison with LibO 3.4.4". Tested files: 1. lorem421pages_unformated.rtf (unformatted) 2. lorem421pages_format_parag.rtf (paragraphs formatted) 3. lorem421pages_format_parag_and_text.rtf (paragraph and text formatted) 4. MCB1-MCB2_02-04b_cbl.rtf (from this bug, comment 10) 5. 1059_СД_ССР_общ.rtf (from bug 47396, marked as duplicate of this bug) 6. file.rtf (from bug 45826, as duplicate of this bug) 7. Slow_opening_file_1.rtf (from bug 50691, marked as duplicate of this bug) Average load time until the file is visible in editor, tested with Windows XP: LOdev 4.0: Version 4.0.0.0.alpha1+ (Build ID: 679480f3d766afe80e55410ab76b46d48dc7bef) from 2012-11-29) Results for LO 3.4.5, LO 3.6.3, LOdev 4.0 in seconds: File 1: 4, 25, 22 (LO 3.4.5 is more then 5 times faster) File 2: 5, 28, 25 (LO 3.4.5 is 5 times faster) File 3: 6, 71, 47 (LO 3.4.5 is nearly 8 times faster) File 4: 8, 114, 114 (unchanged value for LOdev 4; LO 3.4.5 more then 14 times faster) File 5: hangs, 7 minutes 42 seconds, 7 minutes 27 seconds (LO 3.4.5 hangs, value for LO 3.3.4 is 1 minute 10 seconds) File 6: 2, 28, 20 (significantly faster now, but LO 3.4.5 is 10 times faster) File 7: 2, 54, 14 (much faster now, but LO 3.4.5 is 7 times faster)
For me File4 takes 18secs or so to load - I built a callgrind profile of that which is quite interesting. It shows that the state management is consuming ~all the time, things like popState's: RTFParserState aState(m_aStates.top()); consumes 30bn of the 90bn instructions. It seems we spend a ton of time in std::deque shuffling big chunks of nested state around. Anyhow - having a poke :-)
Between pushState and popState - we have 67bn of 94bn cycles for file4; and most of that is STL copying funkiness :-) Using a reference for: RTFParserState &aState = m_aStates.top(); saves a chunk of load time, but busts a unit test loading a CVE document
Michael Meeks committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=a48e2fd9049797110b3b2505c363557284987ca8 fdo#44736 - convert RTFSprms to a copy-on-write structure. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
This is some rather nice code Miklos hacked up here - a pleasure to work on it. I switched the RTFSprms class to use an intrusive pointer, and a copy-on-write pattern; hopefully that makes duplicating the bus-load of state we have in our stack trivially cheap there. I also saved a wasted duplicate of RTFParserState. Passes make slowcheck in sw (for me) and looks reasonable => pushed to master, would appreciate review for -4-0 (and perhaps -3-6?) since I'm no expert there. Timings afterwards (incidentally this includes app startup): No document: user 0m1.772s sys 0m0.388s With document file 4. user 0m5.748s sys 0m0.308s So - it takes around 4 secs to load and render the 1st page of that RTF file down from 16 or so - which seems (to me) reasonable. Better the callgrind trace now looks (mostly) sane - lots of writer internals getting banged on as you'd expect. So resolving fixed again: thanks for the report ! :-)
Michael Meeks committed a patch related to this issue. It has been pushed to "libreoffice-4-0": http://cgit.freedesktop.org/libreoffice/core/commit/?id=816279ecf7d768110a51accda11ce0037e04068c&g=libreoffice-4-0 fdo#44736 - convert RTFSprms to a copy-on-write structure. It will be available in LibreOffice 4.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
OK, it is much better now. File 4 is fantastic, faster than with LO 3.4.5. File 1, 2, and 3: LO seems to hang for rather long time without visible progress bar. Tested files: 1. lorem421pages_unformated.rtf 2. lorem421pages_format_parag.rtf 3. lorem421pages_format_parag_and_text.rtf 4. MCB1-MCB2_02-04b_cbl.rtf (from this bug, comment 10) 5. 1059_СД_ССР_общ.rtf (from bug 47396, marked as duplicate of this bug) 6. file.rtf (from bug 45826, marked as duplicate of this bug) 7. Slow_opening_file_1.rtf (from bug 50691, marked as duplicate of this bug) Average load time until the file is visible in editor, tested with Windows XP: LOdev 4.0: buildname: Win-x86@6, tree: libreoffice-4-0, pull time 2012-12-08 05:20:21) Results for LO 3.4.5, LO 3.6.3, LOdev 4.0 in seconds: File 1: 4, 25, 15 [3]* File 2: 5, 28, 17 [3]* File 3: 6, 71, 24 [10]* File 4: 8, 114, 6 [5-6]* (fantastic !!! faster than LO 3.4.5) File 5: hangs, 7min 42 sec, 1min 44sec [1min 38sec]* (LO 3.4.5 hangs, value for LO 3.3.4 is 1min 10sec) File 6: 2, 28, 12 [10]* File 7: 2, 54, 8 [7]* *) progress bar is visible (in seconds)
The word-count nasties that explain the large delays after RTF loading are tracked in bug#58590
Retested with LOdev 4.0 Beta2: buildname: Win-x86@6, tree: libreoffice-4-0, pull time 2012-12-22 02:16:51 Tested files: 1. lorem421pages_unformated.rtf 2. lorem421pages_format_parag.rtf 3. lorem421pages_format_parag_and_text.rtf 4. MCB1-MCB2_02-04b_cbl.rtf (from this bug, comment 10) 5. 1059_СД_ССР_общ.rtf (from bug 47396, marked as duplicate of this bug) 6. file.rtf (from bug 45826, marked as duplicate of this bug) 7. Slow_opening_file_1.rtf (from bug 50691, marked as duplicate of this bug) Average load time until the file is visible in editor, tested with Windows XP. Results for LO 3.4.5, LO 3.6.3, LOdev 4.0 Beta1, LOdev Beta2 in seconds: File 1: 4, 25, 15 [3], 6 [3] File 2: 5, 28, 17 [3], 7 [3] File 3: 6, 71, 24 [10], 14 [10] File 4: 8, 114, 6 [5-6], 6 [5-6] (same values as beta1) File 5: 70*, 462, 104 [98], 95 [92] File 6: 2, 28, 12 [10], 10 [9] File 7: 2, 54, 8 [7], 7 [6] *) LO 3.3.4 (LO 3.4.5 hangs) [nn] progress bar visibility in seconds
Michael Meeks committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=ee0bf5d58bc59052923c4ced928a989956e71456 fdo#44736 - set and fetch multiple properties concurrently The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Matuš Kukan committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=986fa38eb23a397546061c3ce0df9077ba334a07 fdo#44736 - set and fetch multiple properties concurrently 2 The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Migrating Whiteboard tags to Keywords: (filter:rtf) Replace rtf_filter -> filter:rtf. [NinjaEdit]