Most interesting; vs. master of: 66a82c248ea099d43f289e0737cd7e8cb808a591 I get a cppunittester hang: user32.dll!_NtUserGetMessage@16() Unknown user32.dll!_GetMessageW@16() Unknown vcllo.dll!ImplSalYieldMutexAcquireWithWait() Line 256 C++ vcllo.dll!WinSalInstance::DoYield(bool bWait, bool bHandleAllCurrentEvents, unsigned long nReleased) Line 703 C++ vcllo.dll!ImplYield(bool i_bWait, bool i_bAllEvents, const unsigned long nReleased) Line 514 C++ vcllo.dll!Application::Yield() Line 540 C++ > scfiltlo.dll!oox::xls::`anonymous namespace'::importSheetFragments(oox::xls::WorkbookFragment & rWorkbookHandler, std::vector<std::pair<std::shared_ptr<oox::xls::WorksheetGlobals>,rtl::Reference<oox::core::FragmentHandler> >,std::allocator<std::pair<std::shared_ptr<oox::xls::WorksheetGlobals>,rtl::Reference<oox::core::FragmentHandler> > > > & rSheets) Line 346 C++ scfiltlo.dll!oox::xls::WorkbookFragment::finalizeImport() Line 488 C++ ooxlo.dll!oox::core::FragmentHandler2::endDocument() Line 51 C++ expwraplo.dll!sax_fastparser::FastSaxParserImpl::parseStream(const com::sun::star::xml::sax::InputSource & maStructSource) Line 791 C++ expwraplo.dll!sax_fastparser::FastSaxParser::parseStream(const com::sun::star::xml::sax::InputSource & aInputSource) Line 1297 C++ ooxlo.dll!oox::core::FastParser::parseStream(const com::sun::star::xml::sax::InputSource & rInputSource, bool bCloseStream) Line 118 C++ ooxlo.dll!oox::core::FastParser::parseStream(const com::sun::star::uno::Reference<com::sun::star::io::XInputStream> & rxInStream, const rtl::OUString & rStreamName, bool bCloseStream) Line 127 C++ ooxlo.dll!oox::core::XmlFilterBase::importFragment(const rtl::Reference<oox::core::FragmentHandler> & rxHandler, oox::core::FastParser & rParser) Line 353 C++ ooxlo.dll!oox::core::XmlFilterBase::importFragment(const rtl::Reference<oox::core::FragmentHandler> & rxHandler) Line 294 C++ scfiltlo.dll!oox::xls::ExcelFilter::importDocument() Line 102 C++ ooxlo.dll!oox::core::FilterBase::filter(const com::sun::star::uno::Sequence<com::sun::star::beans::PropertyValue> & rMediaDescSeq) Line 484 C++ scfiltlo.dll!oox::xls::ExcelFilter::filter(const com::sun::star::uno::Sequence<com::sun::star::beans::PropertyValue> & rDescriptor) Line 156 C++ sfxlo.dll!SfxObjectShell::ImportFrom(SfxMedium & rMedium, const com::sun::star::uno::Reference<com::sun::star::text::XTextRange> & xInsertPosition) Line 2273 C++ sfxlo.dll!SfxObjectShell::DoLoad(SfxMedium * pMed) Line 770 C++ scqahelper.dll!ScBootstrapFixture::load(bool bReadWrite, const rtl::OUString & rURL, const rtl::OUString & rFilter, const rtl::OUString & rUserData, const rtl::OUString & rTypeName, SfxFilterFlags nFilterFlags, SotClipboardFormatId nClipboardID, unsigned long nFilterVersion, const rtl::OUString * pPassword) Line 557 C++ scqahelper.dll!ScBootstrapFixture::load(const rtl::OUString & rURL, const rtl::OUString & rFilter, const rtl::OUString & rUserData, const rtl::OUString & rTypeName, SfxFilterFlags nFilterFlags, SotClipboardFormatId nClipboardID, unsigned long nFilterVersion, const rtl::OUString * pPassword) Line 572 C++ scqahelper.dll!ScBootstrapFixture::saveAndReload(ScDocShell * pShell, const rtl::OUString & rFilter, const rtl::OUString & rUserData, const rtl::OUString & rTypeName, SfxFilterFlags nFormatType) Line 646 C++ scqahelper.dll!ScBootstrapFixture::saveAndReload(ScDocShell * pShell, long nFormat) Line 659 C++ test_sc_subsequent_export_test.dll!ScExportTest::testLinkedGraphicRT() Line 2476 C++ test_sc_subsequent_export_test.dll!CppUnit::TestCaller<ScExportTest>::runTest() Line 173 C++ Very curious. It appears that Application::EndYield doesn't end an Application::Yield correctly in this case; mrSheetsLeft == 0.
Seems we are blocked in: GetMessageW( &aTmpMsg, pInst->mhComWnd, SAL_MSG_RELEASEWAITYIELD, SAL_MSG_RELEASEWAITYIELD ); Interesting.
Rather odd; some unpleasant race somewhere here no doubt; pInst->mnYieldWaitCount == 1 and we're on the GetMessageW - looks like a reasonably simple case; all other threads quiescent. Michael - any ideas ? =) the theading code looks somewhat tortured =) no doubt for good reason. I'll need to drop the debugger on this one though I suspect it'll not recur.
sc_subsquent_export_test was the bad-boy in this case.
this looks very much like bug 92257, which is old enough that the nReleased parameter to ImplYield didn't exist yet...
*** Bug 92257 has been marked as a duplicate of this bug. ***
Good point; let me mark as duplicates in the other direction to keep the newer trace on top; thanks ! =) The code badly needs simplifying here I think ...
okay, with ALG's help i could reproduce this, though not very often. it happens on CppunitTest_sc_subsequent_export_test, with something else running (make check) to add more load. there are several aspects to this problem, which is fundamentally that the main thread's message queue fills up. 1. main thread retrieves some message and dispatches it, gets stuck in ImplSalYieldMutexAcquireWithWait() on GetMessage() holding mpSalWaitMutex 2. some timers go wild and post thousands of these messages 1186 = SAL_MSG_TIMER_CALLBACK (WM_USER+162) 3. pool threads in SalYieldMutexRelease() tries to PostMessage(SAL_MSG_RELEASEWAITYIELD) but the queue is full already and the result of PostMessage isn't checked so no message is posted; thread waits for main thread to release mpSalWaitMutex 4. deadlock ALG has a plan to replace this SAL_MSG_RELEASEWAITYIELD with something better, it turns out that there's a MsgWaitForMultipleObjects() function that can wait for either a mutex (i.e. SalYieldMutex's m_mutex) to be available or (with QS_SENDMESSAGE) a SendMessage() to come in, or a timeout, so that should cover all cases and we should be able to get rid of mpSalWaitMutex. that would fix this deadlock. however the fact that the message queue can fill up in these parallel file import tests is concerning; other PostMessages could get lost and we basically never check that PostMessage() succeeds.
Created attachment 124168 [details] stacks of where the PostMessage fails this is with a patch that added some error checking on PostMessageW(mpInstData->mhComWnd, SAL_MSG_RELEASEWAITYIELD, 0, 0)
Created attachment 124169 [details] dump of pending messages on main thread this is with a tricky patch to replace the GetMessageW(SAL_MSG_RELEASEWAITYIELD) with a PeekMessage / Sleep(1) and then a debugger-enableable loop to print results of GetMessage(0, 0) once it's deadlocked it's truncated in the end because i put the fflush() in the wrong place :-/
With my suggested change the wakeup would be more safe. Nonetheless, still messages might be lost what is bad. MS only allows 10000 in queue. Reason is identified, thanks MST. SAL_MSG_TIMER_CALLBACK is firing like crazy, but only one message used (see salinst.cxx ln 786). Question is if this is needed at all, could be okay to send no more message until the last one is processed?
Should/may be fixed with https://gerrit.libreoffice.org/#/c/23921/, to whom it may concern, have a look there, please...
Armin Le Grand committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=9c9970952b0adec4a8c6de9a4cd54d0980cd47ec tdf#96887 enhance SolarMutex AcquireWithWait for Windows It will be available in 5.2.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Michael Stahl committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=8a0e6b25219e59b12034348b8b264117059755ec tdf#96887 vcl: stop using periodic timers on WNT It will be available in 5.2.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
ok i think we can call this fixed now on master
Would this problem exist in the 5.1 branch too? How hard would it be to fix it there, too?
At least the two commits above are cherry-pickable with minimal adaptions for changed directory structure. Will check if the intermittent problem I am seeing goes away with that... although, of course it is impossible to prove by experimentation alone that the patches help when the problem is intermittent and quite hard to reproduce in the first place. But if you guys who actually understand what is going on confirm that this indeed would be relevant for 5.1, too, then surely we should cherry-pick the fixes there, too?
Tor: I'm convinced this is useful for 5.1 too, indeed - your backtrace from 5.1 looks exactly the same as this.
Armin Le Grand committed a patch related to this issue. It has been pushed to "libreoffice-5-1": http://cgit.freedesktop.org/libreoffice/core/commit/?id=69fa1e78febb4991e8e8b8b53ddf5b2d4f7e9f00&h=libreoffice-5-1 tdf#96887 enhance SolarMutex AcquireWithWait for Windows It will be available in 5.1.5. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Michael Stahl committed a patch related to this issue. It has been pushed to "libreoffice-5-1": http://cgit.freedesktop.org/libreoffice/core/commit/?id=25ba37abcc8ad381be6038ddb332dd60d3dcf4b1&h=libreoffice-5-1 tdf#96887 vcl: stop using periodic timers on WNT It will be available in 5.1.5. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Armin Le Grand committed a patch related to this issue. It has been pushed to "libreoffice-5-1-4": http://cgit.freedesktop.org/libreoffice/core/commit/?id=7e33db31290357c82a70b8dbf932ae85967337d6&h=libreoffice-5-1-4 tdf#96887 enhance SolarMutex AcquireWithWait for Windows It will be available in 5.1.4. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Michael Stahl committed a patch related to this issue. It has been pushed to "libreoffice-5-1-4": http://cgit.freedesktop.org/libreoffice/core/commit/?id=4d8924806102681f8d807fd82cdd3db786ddfac0&h=libreoffice-5-1-4 tdf#96887 vcl: stop using periodic timers on WNT It will be available in 5.1.4. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.