Bug 165202 - Database - Mutex lock / race condition requiring force kill and restart
Summary: Database - Mutex lock / race condition requiring force kill and restart
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Base (show other bugs)
Version:
(earliest affected)
25.2.0.3 release
Hardware: All macOS (All)
: medium normal
Assignee: Michael Weghorn
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-02-11 16:15 UTC by Alex Thurgood
Modified: 2025-02-15 09:50 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
Apple trace at moment of hang/kill (2.99 MB, text/plain)
2025-02-11 16:18 UTC, Alex Thurgood
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Thurgood 2025-02-11 16:15:58 UTC
Description:
Ever since moving to 25.2.0.3 (AARCH64), I can reproducibly get a mutex lock in Base requiring a forced kill exit and restart.

1) The database connection is to a mysql backend using the direct (native)  connector.

2) On screen, I display the main base window, and then open a table for direct editing of data. 

3) I set a filter on the table to select a subset of records.

4) I edit one or two fields of a given record, and then validate those entries.

5) I then try to remove the filter, i.e. set it back to displaying all of the records.

6) At this point, LO goes into spinning beachball mode, and I'm required to force kill the application.

7) The Apple trace that is produced at the moment the LO process is terminated is enclosed.

It appears to show a concurrent mutex lock, or race condition, caused by an XAccessible listener:

comphelper::OInterfaceContainerHelper3<com::sun::star::form::XLoadListener>::forEach<comphelper::OInterfaceContainerHelper3<com::sun::star::form::XLoadListener>::NotifySingleListener<com::sun::star::lang::EventObject>>(comphelper::OInterfaceContainerHelper3<com::sun::star::form::XLoadListener>::NotifySingleListener<com::sun::star::lang::EventObject> const&) + 224 (libmergedlo.dylib + 10137008) [0x10bc3edb0]
  22  FmXGridPeer::reloading(com::sun::star::lang::EventObject const&) + 104 (libmergedlo.dylib + 25632824) [0x10cb06038]
  22  DbGridControl::setDataSource(com::sun::star::uno::Reference<com::sun::star::sdbc::XRowSet> const&, DbGridControlOptions) + 976 (libmergedlo.dylib + 25819700) [0x10cb33a34]
  22  DbGridControl::RemoveRows() + 580 (libmergedlo.dylib + 25816064) [0x10cb32c00]
  22  svt::EditBrowseBox::RemoveRows() + 20 (libmergedlo.dylib + 22718388) [0x10c83e7b4]
  22  BrowseBox::Clear() + 436 (libmergedlo.dylib + 22619892) [0x10c8266f4]
  22  accessibility::AccessibleBrowseBoxAccess::commitEvent(short, com::sun::star::uno::Any const&, com::sun::star::uno::Any const&) + 72 (libmergedlo.dylib + 42695428) [0x10db4bb04]
  22  accessibility::AccessibleBrowseBoxBase::commitEvent(short, com::sun::star::uno::Any const&, com::sun::star::uno::Any const&) + 212 (libmergedlo.dylib + 42702980) [0x10db4d884]
  22  comphelper::AccessibleEventNotifier::addEvent(unsigned int, com::sun::star::accessibility::AccessibleEventObject const&) + 400 (libmergedlo.dylib + 4619828) [0x10b6fbe34]
  22  DocumentFocusListener::notifyEvent(com::sun::star::accessibility::AccessibleEventObject const&) + 248 (libvclplug_osxlo.dylib + 79324) [0x106a475dc]
  22  DocumentFocusListener::detachRecursive(com::sun::star::uno::Reference<com::sun::star::accessibility::XAccessible> const&) + 76 (libvclplug_osxlo.dylib + 80692) [0x106a47b34]
  22  accessibility::AccessibleBrowseBoxBase::getAccessibleStateSet() + 100 (libmergedlo.dylib + 42700056) [0x10db4cd18]
  22  accessibility::AccessibleBrowseBoxBase::implCreateStateSet() + 116 (libmergedlo.dylib + 42705636) [0x10db4e2e4]
  22  accessibility::AccessibleBrowseBoxBase::implIsShowing() + 60 (libmergedlo.dylib + 42705100) [0x10db4e0cc]
  22  accessibility::AccessibleBrowseBoxAccess::getAccessibleContext() + 56 (libmergedlo.dylib + 42692372) [0x10db4af14]
  22  std::__1::mutex::lock() + 16 (libc++.1.dylib + 141144) [0x198d93758]
  22  _pthread_mutex_firstfit_lock_slow + 220 (libsystem_pthread.dylib + 7612) [0x198e56dbc]
  22  __psynch_mutexwait + 8 (libsystem_kernel.dylib + 15292) [0x198e1dbbc]
 *22  psynch_mtxcontinue + 0 (com.apple.kec.pthread + 9756) [0xfffffe000baba84c]


This does not occur with LO 24.8. ==> regression





Steps to Reproduce:
See description above

Actual Results:
Concurrent mutex locks causing hang, and requiring forced kill

Expected Results:
Shouldn't hang, the filter removal should return gracefully to the default grid control table resultset display.


Reproducible: Always


User Profile Reset: Yes

Additional Info:
Version: 25.2.0.3 (AARCH64) / LibreOffice Community
Build ID: e1cf4a87eb02d755bce1a01209907ea5ddc8f069
CPU threads: 8; OS: macOS 15.2; UI render: Skia/Raster; VCL: osx
Locale: fr-FR (fr_FR.UTF-8); UI: fr-FR
Calc: threaded
Comment 1 Alex Thurgood 2025-02-11 16:18:04 UTC
Created attachment 199147 [details]
Apple trace at moment of hang/kill

The Apple trace appears to show two concurrent mutex locks which leads to deadlock/spinning beachball of the LO process.
Comment 2 Robert Großkopf 2025-02-12 14:41:54 UTC
Tried this one but couldn't confirm under Linux (OpenSUSE 15.6 64bit rpm)
Comment 3 Alex Thurgood 2025-02-12 21:59:31 UTC
Hmm, trawling through bugzilla hints that bug 148435 has resurfaced.
Comment 4 Alex Thurgood 2025-02-12 22:00:19 UTC
(In reply to Robert Großkopf from comment #2)
> Tried this one but couldn't confirm under Linux (OpenSUSE 15.6 64bit rpm)

Thanks Robert, pretty certain this is macOS specific.
Comment 5 Patrick (volunteer) 2025-02-12 22:41:37 UTC
(In reply to Alex Thurgood from comment #3)
> Hmm, trawling through bugzilla hints that bug 148435 has resurfaced.

Your Activity Monitor sample looks similar to the last sample in tdf#148435. Both samples have the grammar checker and other non-main threads waiting for something to unblock them. But the big difference is that on the main thread, this bug is in the accessibility code whereas tdf#148435 was the "idle task scheduler" code.

So adding @Michael Weghorn to see if he has any insights into the locking code in accessibility::AccessibleBrowseBoxAccess::getAccessibleContext(). My memory is hazy, but I remember that tdf#148435 was caused by trying to lock a non-recursive mutex when it is already locked. In other words, tdf#148435 was due to using a non-recursive lock as a recursive lock. On macOS, trying to lock a non-recursive lock that is already lock results in a hang so possibly we need to replace the lock with a recursive lock?

I can't tell from your sample if the mutex that accessibility::AccessibleBrowseBoxAccess::getAccessibleContext() is trying to lock is a non-recursive lock our not as, surprisingly, I couldn't find the code with a "grep -r AccessibleBrowseBoxAccess". :/
Comment 6 Michael Weghorn 2025-02-13 07:42:16 UTC
(In reply to Patrick (volunteer) from comment #5)
> So adding @Michael Weghorn to see if he has any insights into the locking
> code in accessibility::AccessibleBrowseBoxAccess::getAccessibleContext(). My
> memory is hazy, but I remember that tdf#148435 was caused by trying to lock
> a non-recursive mutex when it is already locked. In other words, tdf#148435
> was due to using a non-recursive lock as a recursive lock. On macOS, trying
> to lock a non-recursive lock that is already lock results in a hang so
> possibly we need to replace the lock with a recursive lock?

I agree, the backtrace suggests that's the problem. I actually ran into the same issue while refactoring the AccessibleBrowseBox* classes and submitted

    commit 71d1432714b7aba0a10ca5d072c87e46ec325271
    Author: Michael Weghorn <m.weghorn@posteo.de>
    Date:   Wed Feb 5 11:26:43 2025 +0100

        browsebox a11y: Use recursive mutex

on master. Backports for 25-2 and 25-2-1 now pending in Gerrit:

https://gerrit.libreoffice.org/c/core/+/181542
https://gerrit.libreoffice.org/c/core/+/181543 

> I can't tell from your sample if the mutex that
> accessibility::AccessibleBrowseBoxAccess::getAccessibleContext() is trying
> to lock is a non-recursive lock our not as, surprisingly, I couldn't find
> the code with a "grep -r AccessibleBrowseBoxAccess". :/

That class is no more on master. I dropped it while refactoring BrowseBox a11y code, see `git log --grep="browsebox a11y"`, in particular

    commit cc8d3dac879ce8be66b785efdfe62be4ca6676bd
    Author: Michael Weghorn <m.weghorn@posteo.de>
    Date:   Thu Feb 6 10:17:47 2025 +0100

        browsebox a11y: Drop AccessibleBrowseBoxAccess
Comment 7 Michael Weghorn 2025-02-13 07:45:51 UTC
(In reply to Alex Thurgood from comment #0)
> This does not occur with LO 24.8. ==> regression

For the record, quoting from the commit message of 71d1432714b7aba0a10ca5d072c87e46ec325271:

>    Locking in AccessibleBrowseBoxAccess::commitEvent was added in
>    
>        commit 67158da00e965c90495bb4f339ea25bbec898c60
>        Date:   Fri Oct 4 14:22:22 2024 +0100
>    
>            cid#1608061 Data race condition
>    
>            and
>    
>            cid#1607995 Data race condition

which could explain why it regressed in 25.2.
Comment 8 Michael Weghorn 2025-02-15 09:50:35 UTC
> I actually ran into the
> same issue while refactoring the AccessibleBrowseBox* classes and submitted
> 
>     commit 71d1432714b7aba0a10ca5d072c87e46ec325271
>     Author: Michael Weghorn <m.weghorn@posteo.de>
>     Date:   Wed Feb 5 11:26:43 2025 +0100
> 
>         browsebox a11y: Use recursive mutex
> 
> on master. Backports for 25-2 and 25-2-1 now pending in Gerrit:
> 
> https://gerrit.libreoffice.org/c/core/+/181542
> https://gerrit.libreoffice.org/c/core/+/181543

All of these are merged now, so once you update to 25.2.1, you shouldn't see this issue any more.