Bug 152943 - GetErrCode() doesn't work as expected on RISC-V due to floating-point NaN payload not being propagated
Summary: GetErrCode() doesn't work as expected on RISC-V due to floating-point NaN pay...
Status: ASSIGNED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.6.0 target:25.2.0
Keywords:
Depends on:
Blocks: Calc-Function
  Show dependency treegraph
 
Reported: 2023-01-09 09:49 UTC by Sakura286
Modified: 2024-10-23 13:23 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sakura286 2023-01-09 09:49:10 UTC
Description:
I ran ucalc_formula_test and testExternalRefFunctions() always failed with 503 error (538 expected) on RISC-V. I noticed that the error code was put in a NaN floating-point variable (e.g Error 538 is 0x7ff800000000021a) wrapped in a KahanSum class in interpr6.cxx:466~922. But these NaN values were added up in kahan.hxx:add().

On x86, when NaN floating-point added, the significand bits are reserved but on RISC-V they are cleared. Maybe it is just a coincidence that the test passes on x86?

Perhaps the error value should be directly reserved in a double variable rather than wrapped in a KahanSum class to avoid being operated?

Steps to Reproduce:
(On any arch)
In source code folder, run `make CppunitTest_sc_ucalc_formula CPPUNITTRACE="gdb --args"`
Then in gdb, run
1. break ucalc_formula.cxx:7180 (FormulaError nErr = pFC->GetErrCode();)
2. run
3. break kahan.hxx:59 (double t = m_fSum + m_fMem;)
4. continue
Error code(0x7ff800000000021a) was reserved in m_fMem. m_fSum and m_fError are calculated after add() func.

Actual Results:
In testExternalRefFunctions(), the result is 503 - invalid floating point operation

Expected Results:
Error 538 - MatrixSizeError


Reproducible: Always


User Profile Reset: No

Additional Info:
Nothing.
Comment 1 Mike Kaganski 2023-01-09 11:55:17 UTC
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf

> The standard specifies that an operation with a quiet NaN input should produce a
> quiet NaN result with the same payload.

This means that for a "0 + nan", executed in KahanSum::get -> KahanSum::add, the result *must* be identical to the nan payload.

> Support for payload propagation is optional on RISC-V processors.

So likely, we just need to enable the propagation to force conformance to the standard?
Comment 2 Mike Kaganski 2023-01-09 13:07:04 UTC
https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/g79dHlV4B_k?pli=1

meaning that the "optional support" means that a given CPU may or may not implement such propagation.
Comment 3 Eike Rathke 2023-01-09 13:40:17 UTC
https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/g79dHlV4B_k/m/dtlPNaq8AwAJ
from 2016:
"We considered supporting payloads but frankly could not find any example of this being used portably in practice."
"we’d like to see some compelling portable use cases before mandating extra hardware on every RISC-V FPU implementation"

/me wonders if meanwhile they'd heard of LibreOffice.

So this just rules out RISC-V as a supported platform I guess?

And we maybe should add a test to sal/qa/rtl/math/test-rtl-math.cxx or even a static_assert() in sal/rtl/math.cxx to fail early in the build process.
Comment 4 Commit Notification 2023-03-15 19:56:16 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/2fd915b7d2d2a9e25600a07587ab01d23bffbc58

Related: tdf#152943 Test that a quiet NaN payload is propagated

It will be available in 7.6.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 5 Eike Rathke 2023-03-15 20:54:24 UTC
From
https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/g79dHlV4B_k/m/dtlPNaq8AwAJ
"Implementors are free to provide a NaN payload propagation scheme as a nonstandard extension enabled by a nonstandard operating mode."
I deduce that there may be RISC-V systems that support propagating NaN payloads. So with bad luck a build, even if it did make check on such system, would not run properly on systems that do not support the extension.

We might have to have a runtime check in Calc as well and bail out if NaN payloads are not supported.
Comment 6 Sakura286 2024-01-10 10:20:24 UTC
Currently NaN payload are not supported by most RISC-V machines. It might be necessary for RISC-V porters to know which component is affected by NaN payload.

Are there any documents that describe how NaN payload work in LibreOffice?
Comment 7 Eike Rathke 2024-01-10 13:25:33 UTC
No explicit documents, but Calc heavily uses quiet NaNs to transport and propagate any runtime error in the calculation interpreter engine and relies on NaN payload propagation in all math operators.
See CreateDoubleError() and GetDoubleErrorValue() at
https://opengrok.libreoffice.org/xref/core/include/formula/errorcodes.hxx?r=807f238f#96
and their use
https://opengrok.libreoffice.org/s?refs=CreateDoubleError&project=core
https://opengrok.libreoffice.org/s?refs=GetDoubleErrorValue&project=core
Comment 8 Sakura286 2024-01-29 03:20:51 UTC
(In reply to Eike Rathke from comment #5)

> We might have to have a runtime check in Calc as well and bail out if NaN
> payloads are not supported.

Yes, a compile time check cannot guarantee that this functionality works well at runtime. So what behaviour should Calc take when detects the lack of NaN payload support at runtime?

As I understand it, there might be three solutions:

1. Giving a warning and shutting down calc.
2. Giving a warning and disclaimer. Calc continues running.
3. Silently switch to a non-NaN payload mode.

For #1, maybe it is a bit distructive to directly rule out RISC-V as a supported platform. #2 is a transitional solution. For #3, is it possible to add this feature?

According to my observation, on riscv64, Calc does return error code when a formula error occurs, but sometimes it returns wrong error code '#NUM'.
Comment 9 Rene Engelhard 2024-01-29 06:09:10 UTC
I'd personally prefer 2.

1. has the problem that one still ships libreoffice-calc. Shipping something which doesn't work (fails on start, even with a message) is not good.

Or 3, but I don't think there's such a mode (yet). And if we had such a mode we wouldn't need that NaN payload thingy at all.
Comment 10 Sakura286 2024-02-01 02:41:33 UTC
I prefer #2, too.

I am wondering whether #2 is acceptable by community. If so, I am willing to do a bit of work on this.
Comment 11 Eike Rathke 2024-02-01 12:30:30 UTC
#3 is not possible as there is no non-NaN-payload mode implemented, and implementing it would require not only different error handling but also different transport of errors between intermediary results up to final results of a formula expression; while technically possible I wouldn't call it feasible or viable, as the effort would be huge, add even more runtime penalty, and possibly overlooking things adds uncertainty.

#2 might work to some degree, someone would have to actually try on such an inferior architecture what happens. My (hopeful) guess is that _some_ NaN is transported so there will be _some_ error (e.g. indicating numeric failure) but any detail about what specific error will be lost. But that also means that code reacting on specific error information will fail, so not everything will work.
Comment 12 Sakura286 2024-03-29 03:21:39 UTC
Here is the patch which add NaN payload check at app startup.

https://gerrit.libreoffice.org/c/core/+/165391

The content of the patch:

1. check nan payload on app startup
2. emit signal to call 'CreateNanPayloadWarningDialog()'
3. the factory creates dialog
4. record property 'ShowNanPayloadWarning' persistent
Comment 13 Sakura286 2024-10-22 03:32:57 UTC
I have add some commits related to this bug.

For builder:

- Add autogen.sh option to skip NaN payload tests https://gerrit.libreoffice.org/c/core/+/174801
- Extract NaN payload test code to make it reusable https://gerrit.libreoffice.org/c/core/+/174964
- Add check at the start of some test sequence https://gerrit.libreoffice.org/c/core/+/175293  

For user:

- Add check at app startup and inform the risk to the user if NaN payload is not supported. https://gerrit.libreoffice.org/c/core/+/165391

I would be appreciated If someone gives me advice or helps to review the patches.
Comment 14 Commit Notification 2024-10-23 13:23:03 UTC
Sakura286 committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/6a7dd6fcd95cb6086e447d1897bed750bceee8c2

tdf#152943: Add NaN payload check at the start of sc_ucalc test sequence

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.