Bug 51324 - crash in xmloff.Impress.XMLContentImporter::com::sun::star::document::XImporter
Summary: crash in xmloff.Impress.XMLContentImporter::com::sun::star::document::XImporter
Status: RESOLVED NOTOURBUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Impress (show other bugs)
Version:
(earliest affected)
3.5.1 release
Hardware: Other Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: unoapitest crash target:3.6.1
Keywords: regression
Depends on:
Blocks: mab3.6
  Show dependency treegraph
 
Reported: 2012-06-22 00:18 UTC by Björn Michaelsen
Modified: 2015-12-22 01:33 UTC (History)
7 users (show)

See Also:
Crash report or crash signature:


Attachments
stacktrace (12.71 KB, text/plain)
2012-06-22 00:19 UTC, Björn Michaelsen
Details
log (22.64 KB, text/x-log)
2012-06-22 00:22 UTC, Björn Michaelsen
Details
stacktrace with all threads (20.46 KB, text/plain)
2012-06-22 00:52 UTC, Björn Michaelsen
Details
debug (vcl,sd) stacktrace (21.44 KB, text/plain)
2012-06-22 14:13 UTC, Björn Michaelsen
Details
done.log (18.56 KB, text/x-log)
2012-06-24 05:50 UTC, Julien Nabet
Details
possible fix (1.18 KB, patch)
2012-06-25 02:59 UTC, David Tardon
Details
stacktrace after applying patch (12.91 KB, text/plain)
2012-06-25 07:51 UTC, Björn Michaelsen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Björn Michaelsen 2012-06-22 00:18:07 UTC
subsequentcheck sometimes crashes in xmloff.Impress.XMLContentImporter::com::sun::star::document::XImporter

Testcode:
http://opengrok.libreoffice.org/xref/core/qadevOOo/tests/java/ifc/document/_XImporter.java
against service:
http://opengrok.libreoffice.org/xref/core/qadevOOo/tests/java/mod/_xmloff/Impress/XMLContentImporter.java

steps to reproduce:
cd xmloff
echo "-o xmloff.Impress.XMLContentImporter" > qa/unoapi/xmloff.sce
echo > qa/unoapi/knownissues.xcl << EOF
xmloff.Impress.XMLContentImporter::com::sun::star::lang::XInitialization
xmloff.Impress.XMLContentImporter::com::sun::star::document::XFilter
xmloff.Impress.XMLContentImporter::com::sun::star::container::XNamed
EOF
R=T; while test "$R" = "T"; do make subsequentcheck || R=F; done

expected result:
test passes without a crash

actual result:
crash

attaching backtrace and log
Comment 1 Björn Michaelsen 2012-06-22 00:19:42 UTC
Created attachment 63332 [details]
stacktrace
Comment 2 Björn Michaelsen 2012-06-22 00:22:11 UTC
Created attachment 63334 [details]
log
Comment 3 Björn Michaelsen 2012-06-22 00:24:52 UTC
@Thorsten: Backtrace looks like something messing up during document teardown. Do you have any suspicion/hint which one of the many destructors might be going wrong there?
Comment 4 Björn Michaelsen 2012-06-22 00:52:20 UTC
Created attachment 63336 [details]
stacktrace with all threads
Comment 5 Björn Michaelsen 2012-06-22 01:05:11 UTC
in case it helps: all the other impress testcases in xmloff trigger this too, so the bug description is just there to get a reproducable minimal testcase.
Comment 6 Björn Michaelsen 2012-06-22 08:31:18 UTC
http://opengrok.libreoffice.org/xref/core/sd/source/ui/toolpanel/TaskPaneFocusManager.cxx#243
looks suspicious erase() invalidates iterators, yet it is used in a loop
Comment 7 Julien Nabet 2012-06-22 11:34:54 UTC
I'm not sure but :
"erase" invalidates the iterator in the for loop.
Then it breaks so we exit the inner/for loop but we keep on the outer/do loop (since bLinkRemoved =true), then iLink iterator var is recreated and reinitialized with begin.
In brief, yep "erase" invalidates, but then iLink is valid again.
Now perhaps I miss something obvious.
  
    234     do
    235     {
    236         bLinkRemoved = false;
    237         LinkMap::iterator iLink;
    238         for (iLink=mpLinks->begin(); iLink!=mpLinks->end(); ++iLink)
    239         {
    240             if (iLink->second.mpTargetWindow == pWindow)
    241             {
    242                 RemoveUnusedEventListener(iLink->first);
    243                 mpLinks->erase(iLink);
    244                 bLinkRemoved = true;
    245                 break;
    246             }
    247         }
    248     }
    249     while (bLinkRemoved);
Comment 8 Björn Michaelsen 2012-06-22 11:53:35 UTC
Julien: no, you are right.
Comment 9 Björn Michaelsen 2012-06-22 14:13:47 UTC
Created attachment 63364 [details]
debug (vcl,sd) stacktrace
Comment 10 Björn Michaelsen 2012-06-24 04:46:00 UTC
nasty: this one seems to be there only on gcc-4.7. Might still be our bug though.
Comment 11 Julien Nabet 2012-06-24 05:50:53 UTC
Created attachment 63401 [details]
done.log

On pc Debian x86-64, with master sources udpated today, I reproduced the bug. In my case, it failed each time, not sometimes only.

I followed this link to try to debug :
http://wiki.documentfoundation.org/Development/How_to_debug#Debugging_the_subsequent_testslow
but I didn't understand how it worked :
- I couldn't switch off TUI with "C-x a" (or perhaps I badly interpreted, I tried "Ctrl+x" then "a")
- LO StartCenter launched but nothing then
So I stopped everything with Ctrl-c

For information, here are some config elements :
Linux kernel : 3.2.0-2-amd64
gcc (Debian 4.6.3-1) 4.6.3
ldd (Debian EGLIBC 2.13-33) 2.13
java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.1) (6b24-1.11.1-6)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

autogen.lastrun :
--with-system-odbc
--enable-ext-mysql-connector
--with-system-mysql
--enable-symbols
--enable-ext-barcode
--enable-ext-diagram
--enable-ext-google-docs
--enable-ext-hunart
--enable-ext-nlpsolver
--enable-ext-ct2n
--enable-ext-numbertext
--enable-ext-oooblogger
--enable-ext-pdfimport
--enable-postgresql-sdbc
--enable-ext-presenter-console
--enable-ext-presenter-minimizer
--enable-ext-report-builder
--enable-ext-scripting-beanshell
--enable-ext-scripting-javascript
--enable-ext-typo
--enable-ext-validator
--enable-ext-watch-window
--enable-ext-wiki-publisher
--enable-dbus
--enable-graphite
--enable-evolution2
--enable-werror
--enable-debug
--enable-dbgutil
--enable-crashdump
--enable-kde4
--enable-dependency-tracking
--enable-online-update
Comment 12 Björn Michaelsen 2012-06-24 08:02:13 UTC
I can *not* reporduce this with Ubuntu/Linaro 4.6.3-5ubuntu2 on Ubuntu Quantal, but with Ubuntu/Linaro 4.7.0-13ubuntu1 on Ubuntu Quantal
Comment 13 David Tardon 2012-06-25 02:59:06 UTC
Created attachment 63427 [details]
possible fix

I cannot reproduce it here, but the attached patch might help
Comment 14 Björn Michaelsen 2012-06-25 07:51:44 UTC
Created attachment 63452 [details]
stacktrace after applying patch

patch does not seem to help, see attached stacktrace.
Comment 15 Björn Michaelsen 2012-06-26 13:50:48 UTC
oh, great. Just had this on a pure gcc 4.6 build.

Since I have not ever seen this on LibreOffice 3.5, I am assuming this to be a regression.
Comment 16 Michael Stahl (allotropia) 2012-06-27 02:32:37 UTC
we have similar crashes in rhbz for LO 3.5 Fedora packages,
though i think we closed them as they were not reproducible.

the 3.5 packages are for Fedora 17, which uses GCC 4.7.

so it seems this doesn't just affect tests, but real users also.
Comment 17 Björn Michaelsen 2012-06-27 03:10:19 UTC
yes, this will affect endusers as it hits on closing an impress doc.
as for reproducibility: yes, this is a heisenbug to make things interesting. with looping the unoapi test, it can be reproduced after a few iterations. Maybe a race condition, or something other funky?

As said above I can reproduce this on both 4.7.1-2ubuntu1 (SVN 20120623/r188906) and 4.6.3-8ubuntu1 (SVN 20120624/r188916) on Ubuntu quantal, but so far have not reproduced this on 4.6.3-1ubuntu5 (only a few selected backports).

So either this, a) in gcc between 4.6.3 and SVN r188916 b) boost: 1.48.0.2 on precise (not reproducable) vs. 1.49.0.1 on quantal (reproducable) c) something else changing in the toolchain.

Hints for candidates of the "something else" kind are most welcome.
Comment 18 Björn Michaelsen 2012-06-27 03:27:28 UTC
Hmmm, seeems Im too stupid to read .spec files:
http://pkgs.fedoraproject.org/gitweb/?p=libreoffice.git;a=blob;f=libreoffice.spec;h=e53252babd085b6621b37fbc0d6a51825069427f;hb=refs/heads/f17
has a BuildRequires boost-devel but no --with-system-libs or --with-system-boost, so I dont know if you are building with internal boost 1.44 or f17s boost 1.48.

But both ways, this seems to suggest our boost update (1.48->1.49) is innocent.
Comment 19 Björn Michaelsen 2012-06-27 03:48:17 UTC
Possibly related:

https://lists.ubuntu.com/archives/ubuntu-devel/2012-June/035310.html

also note that I didnt recompile all build deps with 4.6 on quantal.
Comment 20 Björn Michaelsen 2012-06-27 06:25:49 UTC
\o/
Yep, that seems to be the root cause.

Brutally forcing CXX0X off with:
http://anonscm.debian.org/gitweb/?p=pkg-openoffice/libreoffice.git;a=blob;f=patches/trying-to-force-CXX0X-off-for-ABI-incompatibility.diff;h=1fa52e078613a9125ba43823e0a68c2d6085aab5;hb=4d4f4d6035b67dc42d21bb705d3da6570f9f2c05

helps a lot. xmloff unoapi subsequentcheck surviving >20 iterations and counting ...

So I guess we need to get rid of "autodetecting" CXX0X and make it an explicit option at least (to be activated once the distro have all system libs moved over to CXX0X -- likely in one big incompatible ABI step).
Comment 21 Björn Michaelsen 2012-06-27 06:48:24 UTC
meh, saw the crasher again. But that might not be our (LibreOffice) error, but one of the system packages (or our packages having their own build like libwp*) using the crappy and useless CXX0X ABI (useless as it is even incompatible with itself between 4.6 and 4.7 might continue to do so).
So we need to make sure all those do *not* use --std=..cxx0x or use our internal version.
Comment 22 Björn Michaelsen 2012-06-28 00:02:47 UTC
just rechecked on Ubuntu 12.04 LTS precise: 1282 full runs of xmloff_unoapi without one hickup. So yeah, something did creep in cxx0x compiled stuff in Ubuntu 12.10 quantal in one/any of our deps. The fun is to find out what.
Comment 23 Björn Michaelsen 2012-07-03 03:31:43 UTC
Seems we are dodging the bullet for now:

 http://gcc.gnu.org/ml/gcc/2012-07/msg00024.html
Comment 24 Michael Meeks 2012-07-06 03:04:16 UTC
So - IMHO this is not a libreoffice bug :-) and should be closed NOTOURBUG ...

Of course, if we can add a configure check to catch systems that are compiled with an older version of libstdc++ or somesuch that'd be great - we could prolly compile a small file that did some sizeof() checks in configure.

But hopefully the issue has gone away...
Comment 25 Björn Michaelsen 2012-08-02 08:32:45 UTC
So, I did some painful research on this:

On Ubuntu precise (build and run) the bug is not there.

On Ubuntu quantal with gcc 4.7 the bug is there even after fixing the ABI incompatibility.



On Ubuntu precise with LibreOffice packages build on quantal (sticking to quantal versions of non-LibreOffice packages), the bug is still there. So whatever is the root cause it is introducing the bug already at build time.

So I recompiled the packages on quantal with gcc 4.6 and retested on Ubuntu precise. Bug is still there, so it also is not the compiler update.

To make sure, I recompiled the exact compiler package from precise on quantal and then recompiled LibreOffice with that on quantal and tested the result on precise. The bug is _still_ there.

So, I am getting humbled by these results not to make any bold claims, but it seems to me that the bug is introduced by something changing _at_buildtime_ between precise and quantal (e.g. some dependencies) and it is _not_ gcc.
Comment 26 Björn Michaelsen 2012-08-07 09:34:09 UTC
Reopening, finally found the root cause of this it seems and LibreOffice is not really innocent. At:

http://opengrok.libreoffice.org/xref/core/sd/source/ui/toolpanel/TaskPaneFocusManager.cxx#229

we are generating an equal_range on an unsorted container, just to delete those in the next line. As erasing elements from that container invalidates iterators that is clearly illegal and one has to wonder how that ever worked at all.

Replacing line 229-230 with "mpLinks->erase(pWindow)" is not only simpler, cleaner and easier to read, it might actually be legal. There are some other abuses in that file that need a close look too.
Comment 27 Björn Michaelsen 2012-08-07 09:53:10 UTC
commited to master, waiting for review at:
https://gerrit.libreoffice.org/#/c/373/
Comment 28 Björn Michaelsen 2012-08-07 22:25:22 UTC
tested again with internal boost 1.44 on quantal: 167 interations without a problem so far. so closing as "not our bug" again, but still would welcome the patch below to be integrated to 3.6 as it might help and cant hurt.
Comment 29 Not Assigned 2012-08-08 13:53:22 UTC
Bjoern Michaelsen committed a patch related to this issue.
It has been pushed to "libreoffice-3-6":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=861e55bd889d9f5f5b37724b3615e9355e2d5c15&g=libreoffice-3-6

fdo#51324 lp#1017125 rhbz#806236 rhbz#823272: erase on invalid iterators


It will be available in LibreOffice 3.6.1.
Comment 30 Stephan Bergmann 2012-08-16 07:31:26 UTC
(In reply to comment #26)
> Reopening, finally found the root cause of this it seems and LibreOffice is not
> really innocent.

Just for the record, it more looks like a problem of boost than of LibreOffice to me (though the commit that happens to fix it is fine in and of itself anyway, of course); quoting recent #libreoffice-dev:

Aug 07 11:56:18 <Sweetshark> caolan: could you please review https://gerrit.libreoffice.org/#/c/373/ for libreoffice-3-6 ?
Aug 07 11:57:28 <sberg> Sweetshark, but aCandidates.first/second are not used after erase, so the original code should be fine?
Aug 07 12:01:21 <Sweetshark> sberg: afaik the iterator are not guaranteed to be stable _inside_ an erase (at least a few stl pages warned about that).
Aug 07 12:02:15 <sberg> Sweetshark, and "Remove the links [plural!] from the given window" suggests that there can indeed be multiple entries for pWindow (after all, its an unordered_multimap)
Aug 07 12:02:47 <sberg> Sweetshark, "not guaranteed to be stable": that would render erase(iterator,iterator) completely useless
Aug 07 12:07:38 <Sweetshark> sberg: fact is: without that I crash after ~10 iterations, with the change it crashes after >100 iterations on a different issue here. So either we are doing something illegal (which -- as you say is unlikely), or boost-1.49/gcc4.7 is broken wrt that.
Aug 07 12:15:04 <sberg> Sweetshark, or, only removing a single entry per pWindow instead of all of them happens to mask some other error
Aug 07 12:17:05 <Sweetshark> sberg: huh? according to boost docs erase(key&) also kills _all_ pWindows
Aug 07 12:18:39 <sberg> Sweetshark, ah, right;  odd, then
Aug 07 12:20:19 <Sweetshark> sberg: note I also replaced some of the std::lists with std::deques before to evade ABI breakage. however that did not fix the issue.
Comment 31 Robinson Tryon (qubit) 2015-12-22 01:33:01 UTC
Removing comma from Whiteboard (please use a space to delimit values in this field)
https://wiki.documentfoundation.org/QA/Bugzilla/Fields/Whiteboard#Getting_Started
[NinjaEdit]