Bug 162526 - Oocalc fails on startup
Summary: Oocalc fails on startup
Status: UNCONFIRMED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-08-19 23:50 UTC by Geoff Kuenning
Modified: 2025-01-04 08:48 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
Trace of oocalc --strace that fails (4.59 MB, text/x-log)
2024-08-19 23:50 UTC, Geoff Kuenning
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Geoff Kuenning 2024-08-19 23:50:54 UTC
Created attachment 195909 [details]
Trace of oocalc --strace that fails

Libreoffice 24.2 running on OpenSuSE Leap 15.5.

Oocalc fails on launch, giving a "LibreOffice 24.2 Document Recovery" window that lists no files.  Clicking "OK" gives a standard LibreOffice "pick something to do" window; single-clicking "Calc Spreadsheet" has no effect while double-clicking "Calc Spreadsheet" simply terminates.

Other LibreOffice applications (at least Writer and Impress) work fine.

I am fairly convinced that it's a configuration problem because oocalc launches correctly on a scratch VM running the same OS.  However, I haven't been able to figure out where the difference lies.  Removing ~/.config/libreoffice/4 doesn't help, nor does safe mode.  I upgraded to the latest OpenSuSE releases of LibreOffice and Java, without effect.

I have attached an strace log from running simply "oocalc --strace".  FWIW the failure window popped up before 16:01:22.
Comment 1 Buovjaga 2024-11-05 17:06:51 UTC
If you run strace on the system where it works, do you also get those

26586 16:01:20.237555 openat(AT_FDCWD, "/usr/local/lib64/tls/haswell/x86_64/libXinerama.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

What if you run a TDF-packaged LibreOffice, does it also fail? One way is to use an appimage: https://www.libreoffice.org/download/appimage/

Of course, you might also try building LibreOffice from source: https://wiki.documentfoundation.org/Development/BuildingOnLinux

I'm not sure, if our Bugzilla is the right place for this question, though.
Comment 2 Geoff Kuenning 2024-11-26 09:18:26 UTC
Well, I guess I should have started with gdb...although the session took about 6 hours before I found the underlying cause, but at least it was more focused than looking through strace output with over 100K lines or randomly updating shared libraries.

The immediate cause of the crash is a segfault.  That's because in ScDocShell::InitNew, in sc/source/ui/docshell/docsh2.cxx, there is this innocent-looking bit of code:

        ScOrcusFilters* pOrcus = ScFormatFilter::Get().GetOrcusFilters();

The problem is that ScFormatFilter::Get() returns a reference to a dynamically loaded object, and in my case that reference is a null pointer.  So C++ merrily follows that null pointer, trying to call the virtual function GetOrcusFilters, and crashes.

So: bug #1 is that ScFormatFilter::Get has no protection against a failed dynamic load.  I'd guess that the same thing happens in other Libreoffice code, although I'm certainly not the one to go hunting.

Bug #2 is a derivative: when there's a failed dynamic load, dlopen generates a nice error message (which I'll get to in a moment), but that message is unavailable to the user.  Instead there's just a confusing and very generic "I crashed" message caused by the segfault.  The code for Get() in sc/source/ui/docshell/impex.cxx contains an assert(false), but that's hardly better (and I guess in my distro assertions must have been disabled, boo).  I'll agree that most users wouldn't be able to understand a failed-dlopen message, but at least they'd have something they could put into a bug report or read off to an expert.

Now, as to the error message: decoding the error structure from dlopen, the offending file is /usr/lib64/libreoffice/program/libscfiltlo.so and the problem is an undefined symbol, _ZN5orcus13create_filterENS_8format_tEPNS_11spreadsheet5iface14import_factoryE, which demangles to orcus::create_filter(orcus::format_t, orcus::spreadsheet::iface::import_factory*).  That symbol doesn't exist in any version of liborcus that I have installed on the failing machines (16, 17, and 18--don't ask me why I have three variants because I don't know).  But sure enough, the scratch VM that works has it defined in /usr/lib64/liborcus-0.18.so.0.0.0.

And armed with that knowledge, I quickly figured out that a simple update was needed and oocalc works again!

But: it's really not OK for C++ functions that return references to return null pointers.  That's an absolute no-no.  So please get rid of this code in ScFormatFilter::Get():

        return static_cast<ScFormatFilterPlugin*>(nullptr);

and replace it with proper error handling.
Comment 3 Mike Kaganski 2024-11-26 09:43:55 UTC
(In reply to Geoff Kuenning from comment #2)
> Well, I guess I should have started with gdb...although the session took
> about 6 hours before I found the underlying cause, but at least it was more
> focused than looking through strace output with over 100K lines or randomly
> updating shared libraries.
> 
> The immediate cause of the crash is a segfault.  That's because in
> ScDocShell::InitNew, in sc/source/ui/docshell/docsh2.cxx, there is this
> innocent-looking bit of code:
> 
>         ScOrcusFilters* pOrcus = ScFormatFilter::Get().GetOrcusFilters();
> 
> The problem is that ScFormatFilter::Get() returns a reference to a
> dynamically loaded object, and in my case that reference is a null pointer. 
> So C++ merrily follows that null pointer, trying to call the virtual
> function GetOrcusFilters, and crashes.

Thank you! That is very useful.

> So: bug #1 is that ScFormatFilter::Get has no protection against a failed
> dynamic load.  I'd guess that the same thing happens in other Libreoffice
> code, although I'm certainly not the one to go hunting.
> 
> Bug #2 is a derivative: when there's a failed dynamic load, dlopen generates
> a nice error message (which I'll get to in a moment), but that message is
> unavailable to the user.  Instead there's just a confusing and very generic
> "I crashed" message caused by the segfault.  The code for Get() in
> sc/source/ui/docshell/impex.cxx contains an assert(false), but that's hardly
> better (and I guess in my distro assertions must have been disabled, boo). 
> I'll agree that most users wouldn't be able to understand a failed-dlopen
> message, but at least they'd have something they could put into a bug report
> or read off to an expert.
> 
> Now, as to the error message: decoding the error structure from dlopen, the
> offending file is /usr/lib64/libreoffice/program/libscfiltlo.so and the
> problem is an undefined symbol,
> _ZN5orcus13create_filterENS_8format_tEPNS_11spreadsheet5iface14import_factory
> E, which demangles to orcus::create_filter(orcus::format_t,
> orcus::spreadsheet::iface::import_factory*).  That symbol doesn't exist in
> any version of liborcus that I have installed on the failing machines (16,
> 17, and 18--don't ask me why I have three variants because I don't know). 
> But sure enough, the scratch VM that works has it defined in
> /usr/lib64/liborcus-0.18.so.0.0.0.
> 
> And armed with that knowledge, I quickly figured out that a simple update
> was needed and oocalc works again!
> 
> But: it's really not OK for C++ functions that return references to return
> null pointers.  That's an absolute no-no.  So please get rid of this code in
> ScFormatFilter::Get():
> 
>         return static_cast<ScFormatFilterPlugin*>(nullptr);
> 
> and replace it with proper error handling.

Please avoid teaching here. We assert the validity of the pointer; the assert(false) there in the code is for a reason. We consider that situation a hard failure and a *programming* error, to the extent that a crash is an appropriate outcome. We need to find out why the pointer is null, not to handle that.
Comment 4 Geoff Kuenning 2024-11-26 10:00:38 UTC
(a) That's exactly the point.  Doing assert(false) doesn't help reveal the bug or help find out why the pointer is null.  The right thing to do is to catch the bug, display a correct and informative message, and only then crash the program.

(b) Failing to load a dynamic library isn't a programming error; as in this case it can also be a configuration error.  Crashing without a message doesn't help find configuration errors.  Get() *knows* what went wrong; it just throws that information away and takes the easy way out.

(c) Teaching is my job. :-D  And in this case it clearly needs to happen.  Crashing is never a correct solution in a user-facing program.  It is only because I am unusually expert at low-level debugging that I was able to find the cause of this crash, and even then it took a LONG time investment over a period of months.  An ordinary user would have given up--is that how the developers want people to respond to all the hard work they've put into developing this suite?  (And in fact, for the better part of six months I've been using Microsoft Office to open spreadsheets; it's only my dedication to open source and my decades of debugging experience that motivated me to put in the effort needed figure this out.  Well, that and my stubbornness, which you can see here...)
Comment 5 Noel Grandin 2024-11-26 10:47:49 UTC
This is not a configuration error. This is a "somebody messed up the distro dependencies" problem, and as such is effectively a programming error.

We don't bother trying to report the source of those problems to the user, the user cannot in any case do anything about it.

You are, however, welcome to submit a patch to improve the situation. We have a pretty liberal patch acceptance policy.
Comment 6 Geoff Kuenning 2025-01-04 08:04:25 UTC
> We don't bother trying to report the source of those problems to the user, the user cannot in any case do anything about it.

This is an incorrect statement; I stand as a glaring counterexample.

It is true that some users wouldn't be able to correct the problem. It is also true that many others, like me, would.  But what is 100% clear is that users can only correct the problem if they know what it is; simply crashing at startup is utterly uninformative.  That's true regardless of whether the crash is due to an assert(false) or to a violation of C++ standards.

> You are, however, welcome to submit a patch to improve the situation. We have a pretty liberal patch acceptance policy.

I took a look at trying to add code to identify the failed library and pop up a dialog box saying what's wrong.  But the level of abstraction in the C++ code is sufficiently complex that it's not a simple task for an unfamiliar developer to even pop up a "hello world" dialog box.  Unfortunately I don't have the time to learn the deep details of the scalc code.

However, what I can do is to suggest a message that a knowledgeable developer could display:

scalc could not start because one of its libraries (libfoobar.so.1) was incompatible or unavailable.  Please make sure that both LibreOffice and libfoobar.so.1 are up to date and try again.

I would guess that somebody familiar with the code could make and test that change in well under an hour.
Comment 7 Buovjaga 2025-01-04 08:48:09 UTC
(In reply to Geoff Kuenning from comment #6)
> I took a look at trying to add code to identify the failed library and pop
> up a dialog box saying what's wrong.  But the level of abstraction in the
> C++ code is sufficiently complex that it's not a simple task for an
> unfamiliar developer to even pop up a "hello world" dialog box. 
> Unfortunately I don't have the time to learn the deep details of the scalc
> code.

Instead of a dialog box, one could use debug macros to print to the console: https://wiki.documentfoundation.org/Development/How_to_debug#Macros_Controlling_Debug_Code