Bug 96887 - windows / main-loop lock-up
Summary: windows / main-loop lock-up
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
5.0.2.1 rc
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:5.2.0 target:5.1.4
Keywords:
: 92257 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-01-04 15:08 UTC by Michael Meeks
Modified: 2016-10-25 19:02 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
stacks of where the PostMessage fails (9.81 KB, text/plain)
2016-04-07 19:24 UTC, Michael Stahl (allotropia)
Details
dump of pending messages on main thread (1.79 MB, text/plain)
2016-04-07 19:27 UTC, Michael Stahl (allotropia)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Meeks 2016-01-04 15:08:55 UTC
Most interesting; vs. master of: 66a82c248ea099d43f289e0737cd7e8cb808a591

I get a cppunittester hang:

 	user32.dll!_NtUserGetMessage@16()	Unknown
 	user32.dll!_GetMessageW@16()	Unknown
 	vcllo.dll!ImplSalYieldMutexAcquireWithWait() Line 256	C++
 	vcllo.dll!WinSalInstance::DoYield(bool bWait, bool bHandleAllCurrentEvents, unsigned long nReleased) Line 703	C++
 	vcllo.dll!ImplYield(bool i_bWait, bool i_bAllEvents, const unsigned long nReleased) Line 514	C++
 	vcllo.dll!Application::Yield() Line 540	C++
>	scfiltlo.dll!oox::xls::`anonymous namespace'::importSheetFragments(oox::xls::WorkbookFragment & rWorkbookHandler, std::vector<std::pair<std::shared_ptr<oox::xls::WorksheetGlobals>,rtl::Reference<oox::core::FragmentHandler> >,std::allocator<std::pair<std::shared_ptr<oox::xls::WorksheetGlobals>,rtl::Reference<oox::core::FragmentHandler> > > > & rSheets) Line 346	C++
 	scfiltlo.dll!oox::xls::WorkbookFragment::finalizeImport() Line 488	C++
 	ooxlo.dll!oox::core::FragmentHandler2::endDocument() Line 51	C++
 	expwraplo.dll!sax_fastparser::FastSaxParserImpl::parseStream(const com::sun::star::xml::sax::InputSource & maStructSource) Line 791	C++
 	expwraplo.dll!sax_fastparser::FastSaxParser::parseStream(const com::sun::star::xml::sax::InputSource & aInputSource) Line 1297	C++
 	ooxlo.dll!oox::core::FastParser::parseStream(const com::sun::star::xml::sax::InputSource & rInputSource, bool bCloseStream) Line 118	C++
 	ooxlo.dll!oox::core::FastParser::parseStream(const com::sun::star::uno::Reference<com::sun::star::io::XInputStream> & rxInStream, const rtl::OUString & rStreamName, bool bCloseStream) Line 127	C++
 	ooxlo.dll!oox::core::XmlFilterBase::importFragment(const rtl::Reference<oox::core::FragmentHandler> & rxHandler, oox::core::FastParser & rParser) Line 353	C++
 	ooxlo.dll!oox::core::XmlFilterBase::importFragment(const rtl::Reference<oox::core::FragmentHandler> & rxHandler) Line 294	C++
 	scfiltlo.dll!oox::xls::ExcelFilter::importDocument() Line 102	C++
 	ooxlo.dll!oox::core::FilterBase::filter(const com::sun::star::uno::Sequence<com::sun::star::beans::PropertyValue> & rMediaDescSeq) Line 484	C++
 	scfiltlo.dll!oox::xls::ExcelFilter::filter(const com::sun::star::uno::Sequence<com::sun::star::beans::PropertyValue> & rDescriptor) Line 156	C++
 	sfxlo.dll!SfxObjectShell::ImportFrom(SfxMedium & rMedium, const com::sun::star::uno::Reference<com::sun::star::text::XTextRange> & xInsertPosition) Line 2273	C++
 	sfxlo.dll!SfxObjectShell::DoLoad(SfxMedium * pMed) Line 770	C++
 	scqahelper.dll!ScBootstrapFixture::load(bool bReadWrite, const rtl::OUString & rURL, const rtl::OUString & rFilter, const rtl::OUString & rUserData, const rtl::OUString & rTypeName, SfxFilterFlags nFilterFlags, SotClipboardFormatId nClipboardID, unsigned long nFilterVersion, const rtl::OUString * pPassword) Line 557	C++
 	scqahelper.dll!ScBootstrapFixture::load(const rtl::OUString & rURL, const rtl::OUString & rFilter, const rtl::OUString & rUserData, const rtl::OUString & rTypeName, SfxFilterFlags nFilterFlags, SotClipboardFormatId nClipboardID, unsigned long nFilterVersion, const rtl::OUString * pPassword) Line 572	C++
 	scqahelper.dll!ScBootstrapFixture::saveAndReload(ScDocShell * pShell, const rtl::OUString & rFilter, const rtl::OUString & rUserData, const rtl::OUString & rTypeName, SfxFilterFlags nFormatType) Line 646	C++
 	scqahelper.dll!ScBootstrapFixture::saveAndReload(ScDocShell * pShell, long nFormat) Line 659	C++
 	test_sc_subsequent_export_test.dll!ScExportTest::testLinkedGraphicRT() Line 2476	C++
 	test_sc_subsequent_export_test.dll!CppUnit::TestCaller<ScExportTest>::runTest() Line 173	C++

Very curious. It appears that Application::EndYield doesn't end an Application::Yield correctly in this case; mrSheetsLeft == 0.
Comment 1 Michael Meeks 2016-01-04 15:11:04 UTC
Seems we are blocked in:

GetMessageW( &aTmpMsg, pInst->mhComWnd, SAL_MSG_RELEASEWAITYIELD, SAL_MSG_RELEASEWAITYIELD );

Interesting.
Comment 2 Michael Meeks 2016-01-04 15:22:49 UTC
Rather odd; some unpleasant race somewhere here no doubt; pInst->mnYieldWaitCount == 1 and we're on the GetMessageW - looks like a reasonably simple case; all other threads quiescent.

Michael - any ideas ? =) the theading code looks somewhat tortured =) no doubt for good reason. I'll need to drop the debugger on this one though I suspect it'll not recur.
Comment 3 Michael Meeks 2016-01-04 15:23:27 UTC
sc_subsquent_export_test was the bad-boy in this case.
Comment 4 Michael Stahl (allotropia) 2016-01-04 15:41:59 UTC
this looks very much like bug 92257, which is old enough that the nReleased parameter to ImplYield didn't exist yet...
Comment 5 Michael Meeks 2016-01-05 09:14:55 UTC
*** Bug 92257 has been marked as a duplicate of this bug. ***
Comment 6 Michael Meeks 2016-01-05 09:16:02 UTC
Good point; let me mark as duplicates in the other direction to keep the newer trace on top; thanks ! =) The code badly needs simplifying here I think ...
Comment 7 Michael Stahl (allotropia) 2016-04-07 19:21:15 UTC
okay, with ALG's help i could reproduce this, though not very often.

it happens on CppunitTest_sc_subsequent_export_test, with
something else running (make check) to add more load.

there are several aspects to this problem, which is fundamentally
that the main thread's message queue fills up.

1. main thread retrieves some message and dispatches it,
   gets stuck in ImplSalYieldMutexAcquireWithWait() on GetMessage()
   holding mpSalWaitMutex

2. some timers go wild and post thousands of these messages
   1186 = SAL_MSG_TIMER_CALLBACK      (WM_USER+162)

3. pool threads in SalYieldMutexRelease() tries to
   PostMessage(SAL_MSG_RELEASEWAITYIELD)
   but the queue is full already and the result of PostMessage
   isn't checked so no message is posted;
   thread waits for main thread to release mpSalWaitMutex

4. deadlock


ALG has a plan to replace this SAL_MSG_RELEASEWAITYIELD with something better,
it turns out that there's a MsgWaitForMultipleObjects() function
that can wait for either a mutex (i.e. SalYieldMutex's m_mutex) to
be available or  (with QS_SENDMESSAGE) a SendMessage() to come in,
or a timeout, so that should cover all cases and we should be able
to get rid of mpSalWaitMutex.

that would fix this deadlock.

however the fact that the message queue can fill up in these parallel
file import tests is concerning; other PostMessages could get lost
and we basically never check that PostMessage() succeeds.
Comment 8 Michael Stahl (allotropia) 2016-04-07 19:24:15 UTC
Created attachment 124168 [details]
stacks of where the PostMessage fails

this is with a patch that added some error checking on
PostMessageW(mpInstData->mhComWnd, SAL_MSG_RELEASEWAITYIELD, 0, 0)
Comment 9 Michael Stahl (allotropia) 2016-04-07 19:27:44 UTC
Created attachment 124169 [details]
dump of pending messages on main thread

this is with a tricky patch to replace the GetMessageW(SAL_MSG_RELEASEWAITYIELD)
with a PeekMessage / Sleep(1) and then a debugger-enableable loop to print results of GetMessage(0, 0) once it's deadlocked

it's truncated in the end because i put the fflush() in the wrong place :-/
Comment 10 Armin Le Grand (allotropia) 2016-04-08 08:37:40 UTC
With my suggested change the wakeup would be more safe. Nonetheless, still messages might be lost what is bad. MS only allows 10000 in queue. Reason is identified, thanks MST.
SAL_MSG_TIMER_CALLBACK is firing like crazy, but only one message used (see salinst.cxx ln 786). Question is if this is needed at all, could be okay to send no more message until the last one is processed?
Comment 11 Armin Le Grand (allotropia) 2016-04-12 08:14:43 UTC
Should/may be fixed with https://gerrit.libreoffice.org/#/c/23921/, to whom it may concern, have a look there, please...
Comment 12 Commit Notification 2016-04-12 08:59:12 UTC
Armin Le Grand committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=9c9970952b0adec4a8c6de9a4cd54d0980cd47ec

tdf#96887 enhance SolarMutex AcquireWithWait for Windows

It will be available in 5.2.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 13 Commit Notification 2016-04-18 11:26:31 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=8a0e6b25219e59b12034348b8b264117059755ec

tdf#96887 vcl: stop using periodic timers on WNT

It will be available in 5.2.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 14 Michael Stahl (allotropia) 2016-05-03 15:42:53 UTC
ok i think we can call this fixed now on master
Comment 15 How can I remove my account? 2016-06-14 08:27:24 UTC
Would this problem exist in the 5.1 branch too? How hard would it be to fix it there, too?
Comment 16 How can I remove my account? 2016-06-14 08:45:43 UTC
At least the two commits above are cherry-pickable with minimal adaptions for changed directory structure. Will check if the intermittent problem I am seeing goes away with that... although, of course it is impossible to prove by experimentation alone that the patches help when the problem is intermittent and quite hard to reproduce in the first place.

But if you guys who actually understand what is going on confirm that this indeed would be relevant for 5.1, too, then surely we should cherry-pick the fixes there, too?
Comment 17 Jan Holesovsky 2016-06-14 08:54:20 UTC
Tor: I'm convinced this is useful for 5.1 too, indeed - your backtrace from 5.1 looks exactly the same as this.
Comment 18 Commit Notification 2016-06-14 12:54:58 UTC
Armin Le Grand committed a patch related to this issue.
It has been pushed to "libreoffice-5-1":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=69fa1e78febb4991e8e8b8b53ddf5b2d4f7e9f00&h=libreoffice-5-1

tdf#96887 enhance SolarMutex AcquireWithWait for Windows

It will be available in 5.1.5.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 19 Commit Notification 2016-06-14 13:08:55 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "libreoffice-5-1":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=25ba37abcc8ad381be6038ddb332dd60d3dcf4b1&h=libreoffice-5-1

tdf#96887 vcl: stop using periodic timers on WNT

It will be available in 5.1.5.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 20 Commit Notification 2016-06-14 13:11:24 UTC
Armin Le Grand committed a patch related to this issue.
It has been pushed to "libreoffice-5-1-4":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=7e33db31290357c82a70b8dbf932ae85967337d6&h=libreoffice-5-1-4

tdf#96887 enhance SolarMutex AcquireWithWait for Windows

It will be available in 5.1.4.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 21 Commit Notification 2016-06-14 13:13:54 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "libreoffice-5-1-4":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=4d8924806102681f8d807fd82cdd3db786ddfac0&h=libreoffice-5-1-4

tdf#96887 vcl: stop using periodic timers on WNT

It will be available in 5.1.4.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.