The mail merge wizard currently creates a merged document, even when the user wants to save split files in the end. To know where to split the document, the wizard remembers the start and end page of each part.
This is both slow and fragile:
- To get the page numbers right, the whole merged document is rerendered after each part is inserted (issue i#72820 or commit 62ebbb006) again and again
- This is futile anyway as the wizard allow the merged document to be "personalized", that is: edited. With that, page breaks might be introduced that mess up the remembered page numbers.
It might be worth a try to use UNO_BOOKMARKs instead of page numbers to remember the position:
- with that possibly the merged document only has to be rendered once, at the end of the merge and not once per added document part, which should help performance (quite a lot actually)
- UNO_BOOKMARKs should move along even as pages are inserted/removed. While not perfect, they are certainly less fragile than hard page indices.
Created attachment 102189 [details]
hot spot function tree of a complex, 3000 documents mail merge via UNO.
This is from my initial callgrind of a mail merge of 3000 documents with our real, primary document template, where the examples documents from #79067 are based on.
The callgrind has 4.5 MB and I don't have a place to store it. Probably someone can offer storage ;-) So the callgrind would "just" be a reference, I can't supply the documents, but it shows the hot spots. The callgrind ran almost a whole weekend.
FYI: the document has 19 "special frames", that's why mail merge will stop after 3450 documents (16bit index used for the vector). Me / a collegue will work on a patch for that, but it's currently not our primary concern, as MM produces broken results for various examples documents. Fixes for most are in private/jmux/mailmerge-fixes.
There are various patches related to MM perforamnce in private/jmux/sorted-pagedesc+spzfrmfmts
= The hot spots =
This algorithm was probably written for a linked list, where removing the first element is cheap. For vectors it's ridiculously expensive, especially for vectors with 65000 objects...
Working fix in private/jmux/sorted-pagedesc+spzfrmfmts
These code has two problems:
2.1 Creates a lot of temporary OUStrings (billions!) when doing atoi.
- Actually most time is spend allocation and freeing the substrings!
- Fix 1: directly try to convert the tail.
2.2 Has to check all entries in SpzFrmFmts. Most are invalid.
- Fix 2: Sorting this array will help it a lot.
- Fix 3: Storing the first hole when deleteing would even speed it up.
Broken fix 1 + 2 are in private/jmux/sorted-pagedesc+spzfrmfmts
With thsese patches, both hot spots are gone.
= TODO =
Undo for (vector) lists is implemented by storing the current position. For sorted lists undo has to store the index of the new object. Not really hard to fix, but I need more knowledge of the undo code to finally fix it.
Even for my 3000 pages doc, CalcLayout is far from any hot spot (MergeNew = 55%, CalcLayout 2.5%).
I didn't yet look at the CalcLayout, but I hope it's possible to change the code to add an SwNode index, to tell it to just "render" the appended SwNodes (nodes after the index).
But yeah - actually the number of pages could just be taken from the current working copy, but I guess this would confuse the copy of page anchored frames.
I had to add an additional CalcLayout to the new SwDoc::Append code as a workaround for my "delete first node" problem, as I was visually loosing a page anchored node in the displayed document, while the saved document was correct. This is just a workaround in my branch, until I find the real problem in the drawing layer.
Created attachment 102475 [details]
3000 documents MM callgrind, part 1
Created attachment 102476 [details]
3000 documents MM callgrind, part 2
(In reply to comment #2)
> Even for my 3000 pages doc, CalcLayout is far from any hot spot (MergeNew =
> 55%, CalcLayout 2.5%).
Well, I callgrinded too, but with an artificial doc (just some plain one with a few enumerations and numberings), and saw CalcLayout on top there.
Ultimately though, doing _any_ layout work with each insertion is broken by design and has to be very slow for large merges. So killing dependencies on that is needed anyway, for to have a mail merge at some point that can do the layout _once_ at the end of the merge. As such, not using hardcoded page numbers is a good starting point IMHO.
I think your callgrind is very likely for simple documents without headers, footers and tables. I really would like to prevent layouting the whole document, when we just append some pages.
We need the page numbers for page-anchored flys. The internal structure uses the absolute page number as the reference. So if you have a MM document with 3 pages and a page anhored Fly on the second page, the anchor is bound to page 6 (3 pages + double side hidden empty page + 2). And to know, if the layout has still 3 pages after replacing the page numbers, you have to run the layouting.
Luboš Luňák committed a patch related to this issue.
It has been pushed to "master":
use bookmarks to mark mailmerge parts in a mailmerge document (fdo#80823)
It will be available in 4.4.0.
The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
Affected users are encouraged to test the fix and report feedback.
I've implemented using UNO bookmarks instead of page numbers in master, so let's close this.