Bug Hunting Session
Bug 37579 - oosplash.bin crash
Summary: oosplash.bin crash
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
3.4.0 release
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Michael Meeks
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-05-25 05:19 UTC by Jean-Baptiste Faure
Modified: 2011-10-29 07:38 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
strace log file (974.46 KB, application/x-bzip)
2011-06-29 22:44 UTC, Rafael Daud
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jean-Baptiste Faure 2011-05-25 05:19:31 UTC
Both with LibO 3.4 beta-x and my own builds under Ubuntu 10.04 x86-64, oosplash.bin does not quit, system monitor says it is waiting : futex_wait_queue_me.

After a moment of inactivity oosplash.bin crashes. With previous beta versions it was eating 100% of the cpu before it crashes, but it is not the case anymore with RC1 and my own builds.

Under gdb I get following informations:

Program received signal SIGABRT, Aborted.
0x00007f008462ea75 in raise () from /lib/libc.so.6
(gdb) thread apply all backtrace

Thread 3 (Thread 0x7f008340b700 (LWP 5236)):
#0  0x00007f008514bbc9 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x00007f0084be20c6 in ?? () from /home/jbf/LibO/libreoffice-3-4/install/program/../basis-link/ure-link/lib/libuno_sal.so.3
#2  0x00007f00851469ca in start_thread () from /lib/libpthread.so.0
#3  0x00007f00846e170d in clone () from /lib/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f0082409700 (LWP 5240)):
#0  0x00007f008514f48d in waitpid () from /lib/libpthread.so.0
#1  0x00007f0084bbce65 in ?? () from /home/jbf/LibO/libreoffice-3-4/install/program/../basis-link/ure-link/lib/libuno_sal.so.3
#2  0x00007f0084bbb2ec in ?? () from /home/jbf/LibO/libreoffice-3-4/install/program/../basis-link/ure-link/lib/libuno_sal.so.3
#3  0x00007f00851469ca in start_thread () from /lib/libpthread.so.0
#4  0x00007f00846e170d in clone () from /lib/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f008554e720 (LWP 5229)):
#0  0x00007f008462ea75 in raise () from /lib/libc.so.6
#1  0x00007f00846325c0 in abort () from /lib/libc.so.6
#2  0x00007f00846725e0 in ?? () from /lib/libc.so.6
#3  0x00007f0084e542a2 in ?? () from /usr/lib/libX11.so.6
#4  0x00007f0084e54c07 in _XEventsQueued () from /usr/lib/libX11.so.6
#5  0x00007f0084e2c2da in XFlush () from /usr/lib/libX11.so.6
---Type <return> to continue, or q <return> to quit---
#6  0x000000000040368c in splash_draw_progress ()
#7  0x0000000000405dcc in ?? ()
#8  0x0000000000406992 in main ()
(gdb) 

Hope this help to fix the problem.
Best regards. JBF
Comment 1 vitriol 2011-05-25 05:25:46 UTC
Maybe a duplicate of Bug 35693
Comment 2 Jean-Baptiste Faure 2011-05-25 12:56:24 UTC
If I build LibreOffice 3.4 with debugging symbols I get the following trace :

Program received signal SIGABRT, Aborted.
0x00007fc63a9dca75 in raise () from /lib/libc.so.6
(gdb) thread apply all backtrace

Thread 3 (Thread 0x7fc6397b9700 (LWP 7504)):
#0  0x00007fc63b4f9bc9 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x00007fc63af900c6 in rtl_cache_wsupdate_wait (arg=<value optimized out>) at alloc_cache.c:1417
#2  rtl_cache_wsupdate_all (arg=<value optimized out>) at alloc_cache.c:1561
#3  0x00007fc63b4f49ca in start_thread () from /lib/libpthread.so.0
#4  0x00007fc63aa8f70d in clone () from /lib/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7fc6387b7700 (LWP 7508)):
#0  0x00007fc63b4fd48d in waitpid () from /lib/libpthread.so.0
#1  0x00007fc63af6ae65 in ChildStatusProc (pData=0x7fffeb3749e0) at process.c:612
#2  0x00007fc63af692ec in osl_thread_start_Impl (pData=<value optimized out>) at thread.c:276
#3  0x00007fc63b4f49ca in start_thread () from /lib/libpthread.so.0
#4  0x00007fc63aa8f70d in clone () from /lib/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7fc63b8fc720 (LWP 7497)):
#0  0x00007fc63a9dca75 in raise () from /lib/libc.so.6
#1  0x00007fc63a9e05c0 in abort () from /lib/libc.so.6
#2  0x00007fc63aa205e0 in ?? () from /lib/libc.so.6
#3  0x00007fc63b2022a2 in ?? () from /usr/lib/libX11.so.6
#4  0x00007fc63b202c07 in _XEventsQueued () from /usr/lib/libX11.so.6
---Type <return> to continue, or q <return> to quit---
#5  0x00007fc63b1da2da in XFlush () from /usr/lib/libX11.so.6
#6  0x000000000040368c in process_events (progress=<value optimized out>) at splashx.c:579
#7  splash_draw_progress (progress=<value optimized out>) at splashx.c:616
#8  0x0000000000405dcc in sal_main_with_args (argc=<value optimized out>, argv=<value optimized out>) at start.c:981
#9  0x0000000000406992 in main (argc=2, argv=0x7fffeb378f08) at start.c:891
(gdb) 

Best regards. JBF
Comment 3 Michael Meeks 2011-05-26 02:59:23 UTC
Interesting :-) if I had to guess, I would say we are missing a CLOEXEC bit on our X socket - such that it is cloned into the soffice.bin sub-process, which then causes trouble later.

Unfortunately, its hard to tell. Any chance you can install some debuginfo packages for X and also glibc - and re-run inside gdb ? - that would give us trace information for deeper inside X (ie. this bit):

#2  0x00007fc63aa205e0 in ?? () from /lib/libc.so.6
#3  0x00007fc63b2022a2 in ?? () from /usr/lib/libX11.so.6
#4  0x00007fc63b202c07 in _XEventsQueued () from /usr/lib/libX11.so.6
#5  0x00007fc63b1da2da in XFlush () from /usr/lib/libX11.so.6

I guess it is just an XIOerror cf.
   http://cgit.freedesktop.org/xorg/lib/libX11/tree/src/xcb_io.c#n344

Another thing that would -really- help, particularly if it crashes nice and quickly like this would be to do:

strace -f -o /tmp/log soffice -writer # or whatever you run

And when it has failed: gzip /tmp/log - and attach it here.

Anyhow - interesting bug, thanks for the help !
Comment 4 Jean-Baptiste Faure 2011-06-04 10:17:12 UTC
Well, I do not know what happend: I wait since several days that oosplash.bin crashes but he does not want to do that. He does not close cleanly either.

Long life to LibreOffice. JBF
Comment 5 Michael Meeks 2011-06-06 08:27:10 UTC
lol - sorry about the lack of crash when we want it: that sucks.
Anyhow - if you can find him please do update the bug ! (and thanks for your report & support).
Comment 6 Baptiste Jonglez 2011-06-08 15:26:51 UTC
Using the LibreOffice 3.4.0 release on Archlinux, I get a similar behaviour.

However, oosplash.bin does not crash, it only eats 100% of the cpu (in fact, I usually don't let it run longer than a dozen of seconds before killing it, I hate overheating)

The weird thing is that it usually happens something like 15 minutes after launching Libreoffice, which is a bit late for a splash screen ;)

Steps to reproduce (the bug does not seem to show up every time):
 - open up a odt document;
 - forget LibreOffice on a spare desktop and start doing something else;
 - after a certain amount of time (typically more than 10 minutes), the oosplash.bin process starts eating all the cpu.
Comment 7 Rafael Daud 2011-06-29 20:18:11 UTC
I confirm this bug on a similar system (amd64, archlinux, LO 3.4). However I am unable to recompile LO with debugging symbols in this system. I tried running with strace as Michael suggests: it didn't crash, but strace produced a very big log file after half an hour (3.9 GB!), and spit a PANIC message to the console (I inadvertently close the console before taking notes, sorry).
I'm trying to reproduce this PANIC or the oosplash.bin crash, but it just comes and goes. After I'm able to do any of these, I'll post here the result.
Comment 8 Rafael Daud 2011-06-29 22:44:12 UTC
Created attachment 48577 [details]
strace log file

Only the first 11MB of the actual log, which was 3.9GB originally.
Comment 9 Rafael Daud 2011-06-29 22:47:35 UTC
Yes, I managed to reproduce it.
oosplash.bin was using not 100% CPU, but a mere 15% (which still seems buggy and was overheating the computer). It continued like this until it crashed (normally I kill it, but I let it run to see what'd happen).
When it crashed (not sure if at the exact same time) a message was printed to the console: PANIC: handle_group_exit: 26869 leader 26859. The same message I talked about earlier.
The log is 3.9 GB big. It just didn't grew more because my root partition got filled entirely. I striped the first 11MB of the file to upload here, it's attached above.
If I should have taken the last 11MB, just let me know. I figured the last parts would be repeated garbage, since the file was growing indefinitly, but I could be wrong.
Comment 10 Baptiste Jonglez 2011-06-30 06:23:13 UTC
A very weird behavior, indeed... I tried running LO with --nologo (which should disable the splash screen)

Libreoffice shows up as expected, and blocks the terminal (sounds natural). However, some time later, I noticed I got my prompt again, with a segfault... But LO is still running (I guess it had forked before). So the segfault might well come from oosplash.bin...

Running with time shows a very consistent timing for this segfault :

zorun@tuxmachine ~$ time libreoffice --nologo
/usr/share/themes/Shiki-Brave/gtk-2.0/gtkrc:126: Murrine configuration option "gradients" is no longer supported and will be ignored.
Erreur de segmentation

real	17m4.360s
user	0m0.037s
sys	0m0.063s
zorun@tuxmachine ~$ time libreoffice --nologo
/usr/share/themes/Shiki-Brave/gtk-2.0/gtkrc:126: Murrine configuration option "gradients" is no longer supported and will be ignored.
Erreur de segmentation

real	17m4.358s
user	0m0.050s
sys	0m0.053s
zorun@tuxmachine ~$ 



zorun@tuxmachine ~$ libreoffice --version
LibreOffice 3.4  340m1(Build:12)

I'll attach a debug trace.
Comment 11 Baptiste Jonglez 2011-06-30 06:54:20 UTC
And 17 minutes later...

Reading symbols from /usr/lib/libreoffice/program/oosplash.bin...(no debugging symbols found)...done.
(gdb) r
Starting program: /usr/lib/libreoffice/program/oosplash.bin --nologo
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff5ea8700 (LWP 12364)]
[New Thread 0x7ffff5467700 (LWP 12365)]
[New Thread 0x7ffff4c66700 (LWP 12368)]
[Thread 0x7ffff5467700 (LWP 12365) exited]
/usr/share/themes/Shiki-Brave/gtk-2.0/gtkrc:126: Murrine configuration option "gradients" is no longer supported and will be ignored.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff78c0056 in XSetForeground () from /usr/lib/libX11.so.6
(gdb) 


Not very helpful...
Comment 12 Francisco Pina Martins 2011-07-06 15:55:25 UTC
I confirm the bug, in the same form as stated by Comment6.
Using Archlinux, x86_64.
I first noticed the bug on version 3.4.0, but it is still present on 3.4.1.
What can I do to help debug this?
I can seem to reproduce it fairly easy (I just need to have patience, sometimes it takes about 20 minutes before oosplash.bin starts eating all the CPU cycles).
This bug has also been reported on the Arch bugtracker, here is the link for reference:
https://bugs.archlinux.org/task/24617

Thanks!
Comment 13 Michael Meeks 2011-07-13 03:01:39 UTC
Ah - great catch :-) it seems this is just a free-memory read/write issue that eventually clobbers us. I've pushed a fix to master & am getting it reviewed for -3-4-2.

Thanks ! :-)
Comment 14 jw.hendy 2011-08-01 08:46:59 UTC
(In reply to comment #13)
> Ah - great catch :-) it seems this is just a free-memory read/write issue that
> eventually clobbers us. I've pushed a fix to master & am getting it reviewed
> for -3-4-2.
> 
> Thanks ! :-)


Any updates on this? This is still not fixed for me.
LibreOffice 3.4.1 
OOO340m1 (Build:103)

Has the fix been accepted and/or is there a ETA on 3.4.2?
Comment 16 Björn Michaelsen 2011-09-13 10:33:01 UTC
reopening as setting display to NULL seems to trigger open a race condition:

 https://bugs.launchpad.net/ubuntu/+source/libreoffice/+bug/835153

Do we need a lock around splash_draw_progress() maybe?

Just looking at the number of dupes this collected in a few days on a beta release makes me assume the condition to fire way too often.
Comment 17 Jean-Baptiste Faure 2011-09-14 21:34:38 UTC
I confirm that this bug is not fixed for me (LibO 3.4.3 on Ubuntu 10.04 x86_64).
But, as I never install the package libreoffice-debian-menu for incompatibility reasons with LibreOffice 3.3.2 from Ubuntu PPA, I am not sure whether the problem is in LibO 3.4.x or in my installation.

Kind regards. JBF
Comment 19 Björn Michaelsen 2011-09-15 10:31:03 UTC
cherrypicked as:
http://cgit.freedesktop.org/libreoffice/libs-core/commit/?h=libreoffice-3-4&id=253ff23c3a93b5ea45a2451a8bc97fca19856a75
on libreoffice-3-4 to be in 3.4.4 and following releases.
Comment 20 Michael Meeks 2011-09-19 04:48:50 UTC
resolving fixed then :-) thanks guys !