subsequentcheck sometimes crashes in xmloff.Impress.XMLContentImporter::com::sun::star::document::XImporter
steps to reproduce:
echo "-o xmloff.Impress.XMLContentImporter" > qa/unoapi/xmloff.sce
echo > qa/unoapi/knownissues.xcl << EOF
R=T; while test "$R" = "T"; do make subsequentcheck || R=F; done
test passes without a crash
attaching backtrace and log
Created attachment 63332 [details]
Created attachment 63334 [details]
@Thorsten: Backtrace looks like something messing up during document teardown. Do you have any suspicion/hint which one of the many destructors might be going wrong there?
Created attachment 63336 [details]
stacktrace with all threads
in case it helps: all the other impress testcases in xmloff trigger this too, so the bug description is just there to get a reproducable minimal testcase.
looks suspicious erase() invalidates iterators, yet it is used in a loop
I'm not sure but :
"erase" invalidates the iterator in the for loop.
Then it breaks so we exit the inner/for loop but we keep on the outer/do loop (since bLinkRemoved =true), then iLink iterator var is recreated and reinitialized with begin.
In brief, yep "erase" invalidates, but then iLink is valid again.
Now perhaps I miss something obvious.
236 bLinkRemoved = false;
237 LinkMap::iterator iLink;
238 for (iLink=mpLinks->begin(); iLink!=mpLinks->end(); ++iLink)
240 if (iLink->second.mpTargetWindow == pWindow)
244 bLinkRemoved = true;
249 while (bLinkRemoved);
Julien: no, you are right.
Created attachment 63364 [details]
debug (vcl,sd) stacktrace
nasty: this one seems to be there only on gcc-4.7. Might still be our bug though.
Created attachment 63401 [details]
On pc Debian x86-64, with master sources udpated today, I reproduced the bug. In my case, it failed each time, not sometimes only.
I followed this link to try to debug :
but I didn't understand how it worked :
- I couldn't switch off TUI with "C-x a" (or perhaps I badly interpreted, I tried "Ctrl+x" then "a")
- LO StartCenter launched but nothing then
So I stopped everything with Ctrl-c
For information, here are some config elements :
Linux kernel : 3.2.0-2-amd64
gcc (Debian 4.6.3-1) 4.6.3
ldd (Debian EGLIBC 2.13-33) 2.13
java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.1) (6b24-1.11.1-6)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
I can *not* reporduce this with Ubuntu/Linaro 4.6.3-5ubuntu2 on Ubuntu Quantal, but with Ubuntu/Linaro 4.7.0-13ubuntu1 on Ubuntu Quantal
Created attachment 63427 [details]
I cannot reproduce it here, but the attached patch might help
Created attachment 63452 [details]
stacktrace after applying patch
patch does not seem to help, see attached stacktrace.
oh, great. Just had this on a pure gcc 4.6 build.
Since I have not ever seen this on LibreOffice 3.5, I am assuming this to be a regression.
we have similar crashes in rhbz for LO 3.5 Fedora packages,
though i think we closed them as they were not reproducible.
the 3.5 packages are for Fedora 17, which uses GCC 4.7.
so it seems this doesn't just affect tests, but real users also.
yes, this will affect endusers as it hits on closing an impress doc.
as for reproducibility: yes, this is a heisenbug to make things interesting. with looping the unoapi test, it can be reproduced after a few iterations. Maybe a race condition, or something other funky?
As said above I can reproduce this on both 4.7.1-2ubuntu1 (SVN 20120623/r188906) and 4.6.3-8ubuntu1 (SVN 20120624/r188916) on Ubuntu quantal, but so far have not reproduced this on 4.6.3-1ubuntu5 (only a few selected backports).
So either this, a) in gcc between 4.6.3 and SVN r188916 b) boost: 184.108.40.206 on precise (not reproducable) vs. 220.127.116.11 on quantal (reproducable) c) something else changing in the toolchain.
Hints for candidates of the "something else" kind are most welcome.
Hmmm, seeems Im too stupid to read .spec files:
has a BuildRequires boost-devel but no --with-system-libs or --with-system-boost, so I dont know if you are building with internal boost 1.44 or f17s boost 1.48.
But both ways, this seems to suggest our boost update (1.48->1.49) is innocent.
also note that I didnt recompile all build deps with 4.6 on quantal.
Yep, that seems to be the root cause.
Brutally forcing CXX0X off with:
helps a lot. xmloff unoapi subsequentcheck surviving >20 iterations and counting ...
So I guess we need to get rid of "autodetecting" CXX0X and make it an explicit option at least (to be activated once the distro have all system libs moved over to CXX0X -- likely in one big incompatible ABI step).
meh, saw the crasher again. But that might not be our (LibreOffice) error, but one of the system packages (or our packages having their own build like libwp*) using the crappy and useless CXX0X ABI (useless as it is even incompatible with itself between 4.6 and 4.7 might continue to do so).
So we need to make sure all those do *not* use --std=..cxx0x or use our internal version.
just rechecked on Ubuntu 12.04 LTS precise: 1282 full runs of xmloff_unoapi without one hickup. So yeah, something did creep in cxx0x compiled stuff in Ubuntu 12.10 quantal in one/any of our deps. The fun is to find out what.
Seems we are dodging the bullet for now:
So - IMHO this is not a libreoffice bug :-) and should be closed NOTOURBUG ...
Of course, if we can add a configure check to catch systems that are compiled with an older version of libstdc++ or somesuch that'd be great - we could prolly compile a small file that did some sizeof() checks in configure.
But hopefully the issue has gone away...
So, I did some painful research on this:
On Ubuntu precise (build and run) the bug is not there.
On Ubuntu quantal with gcc 4.7 the bug is there even after fixing the ABI incompatibility.
On Ubuntu precise with LibreOffice packages build on quantal (sticking to quantal versions of non-LibreOffice packages), the bug is still there. So whatever is the root cause it is introducing the bug already at build time.
So I recompiled the packages on quantal with gcc 4.6 and retested on Ubuntu precise. Bug is still there, so it also is not the compiler update.
To make sure, I recompiled the exact compiler package from precise on quantal and then recompiled LibreOffice with that on quantal and tested the result on precise. The bug is _still_ there.
So, I am getting humbled by these results not to make any bold claims, but it seems to me that the bug is introduced by something changing _at_buildtime_ between precise and quantal (e.g. some dependencies) and it is _not_ gcc.
Reopening, finally found the root cause of this it seems and LibreOffice is not really innocent. At:
we are generating an equal_range on an unsorted container, just to delete those in the next line. As erasing elements from that container invalidates iterators that is clearly illegal and one has to wonder how that ever worked at all.
Replacing line 229-230 with "mpLinks->erase(pWindow)" is not only simpler, cleaner and easier to read, it might actually be legal. There are some other abuses in that file that need a close look too.
commited to master, waiting for review at:
tested again with internal boost 1.44 on quantal: 167 interations without a problem so far. so closing as "not our bug" again, but still would welcome the patch below to be integrated to 3.6 as it might help and cant hurt.
Bjoern Michaelsen committed a patch related to this issue.
It has been pushed to "libreoffice-3-6":
fdo#51324 lp#1017125 rhbz#806236 rhbz#823272: erase on invalid iterators
It will be available in LibreOffice 3.6.1.
(In reply to comment #26)
> Reopening, finally found the root cause of this it seems and LibreOffice is not
> really innocent.
Just for the record, it more looks like a problem of boost than of LibreOffice to me (though the commit that happens to fix it is fine in and of itself anyway, of course); quoting recent #libreoffice-dev:
Aug 07 11:56:18 <Sweetshark> caolan: could you please review https://gerrit.libreoffice.org/#/c/373/ for libreoffice-3-6 ?
Aug 07 11:57:28 <sberg> Sweetshark, but aCandidates.first/second are not used after erase, so the original code should be fine?
Aug 07 12:01:21 <Sweetshark> sberg: afaik the iterator are not guaranteed to be stable _inside_ an erase (at least a few stl pages warned about that).
Aug 07 12:02:15 <sberg> Sweetshark, and "Remove the links [plural!] from the given window" suggests that there can indeed be multiple entries for pWindow (after all, its an unordered_multimap)
Aug 07 12:02:47 <sberg> Sweetshark, "not guaranteed to be stable": that would render erase(iterator,iterator) completely useless
Aug 07 12:07:38 <Sweetshark> sberg: fact is: without that I crash after ~10 iterations, with the change it crashes after >100 iterations on a different issue here. So either we are doing something illegal (which -- as you say is unlikely), or boost-1.49/gcc4.7 is broken wrt that.
Aug 07 12:15:04 <sberg> Sweetshark, or, only removing a single entry per pWindow instead of all of them happens to mask some other error
Aug 07 12:17:05 <Sweetshark> sberg: huh? according to boost docs erase(key&) also kills _all_ pWindows
Aug 07 12:18:39 <sberg> Sweetshark, ah, right; odd, then
Aug 07 12:20:19 <Sweetshark> sberg: note I also replaced some of the std::lists with std::deques before to evade ABI breakage. however that did not fix the issue.
Removing comma from Whiteboard (please use a space to delimit values in this field)