142307 – Upgrade SSE2 sum to AVX512 sum with Neumaier (precise fp-sum)

Bug 142307 - Upgrade SSE2 sum to AVX512 sum with Neumaier (precise fp-sum)

Summary: Upgrade SSE2 sum to AVX512 sum with Neumaier (precise fp-sum)

Status:	RESOLVED FIXED

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Calc (show other bugs)
Version: (earliest affected)	unspecified
Hardware:	All All

Importance:	medium enhancement
Assignee:	Not Assigned

URL:
Whiteboard:	target:7.3.0 inReleaseNotes
Keywords:

Depends on:
Blocks:	Function-Sum
	Show dependency tree / graph

Reported:	2021-05-16 00:08 UTC by dante19031999
Modified:	2022-01-23 18:54 UTC (History)
CC List:	7 users (show)

See Also:	137679 144386
Crash report or crash signature:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description dante19031999 2021-05-16 00:08:36 UTC

Description:
Actual implementation of fast sum is using SSE2 with basic Kahan sum.
However it could be upgraded to AVX512 using Nuemanier sum.

Actual Results:
.

Expected Results:
.


Reproducible: Always


User Profile Reset: No



Additional Info:
Version: 7.1.0.3 / LibreOffice Community
Build ID: f6099ecf3d29644b5008cc8f48f42f4a40986e4c
CPU threads: 8; OS: Linux 5.11; UI render: default; VCL: gtk3
Locale: es-ES (en_US.UTF-8); UI: en-US
Calc: threaded

Comment 1 dante19031999 2021-05-16 00:20:10 UTC

In the Kahan sum patch .b has pointed me here:
http://blog.zachbjornson.com/2019/08/11/fast-float-summation.html

With that new information I believe I should be able to pull this off.
If the info is correct it should be faster and more precise.
It may also be possible using it in scmatrix summation code.
And if some conditions are met would be possible for 7.3 to add great speed improvements to statistical functions.

Comment 2 b. 2021-05-16 10:43:40 UTC

setting new as IMHO a neccessary enhancement (or a bug of SSE2 module blocking precision), 

changing subject to reflect the correct name 'Neumaier', 

@Dante, evtl. also have a look in: 
https://www.tuhh.de/ti3/paper/rump/Ru08b.pdf
(too scientific for me)  :-( 
would you like to 'take' assign this bug to you?

Comment 3 dante19031999 2021-05-16 16:18:22 UTC

Right now I'm working over here: https://gerrit.libreoffice.org/c/core/+/115675
I was able to implement Neumanier for SSE2. 
Now you're test sheet gives correct output.
AVX512 for now crashes, but is on it's way.

We are using the methods in this order:
If AVX512 is available use it.
If not try with SSE2.
If not continue with just unrolled loop.

Comment 4 Commit Notification 2021-08-26 06:49:03 UTC

dante committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/5b9cf5881ef53fac5f1d8376f687dbadf9d3cf2b

tdf#142307 - Upgrade SSE2 sum to AVX512 sum with Neumaier 1

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.

Comment 5 Roman Kuznetsov 2021-08-26 07:37:42 UTC

Dante, can you do some test with and without your patch on the same computer? I think it would be interesting to know how much has the calculation accelerated by using AVX

Comment 6 dante19031999 2021-08-26 13:13:59 UTC

(In reply to Roman Kuznetsov from comment #5)
> Dante, can you do some test with and without your patch on the same
> computer? I think it would be interesting to know how much has the
> calculation accelerated by using AVX

The test can be found in this file:
/core/sc/qa/unit/functions_statistical.cxx
(It is not yet on opengrok, give it 48 hours)

The original code is the SSE2 version.
So the new and old versions are tested.
But I don't believe this is what you are asking for.

However here you have a performance test (very basic):
https://gerrit.libreoffice.org/c/core/+/121095/2
However printf does not seem to work.
And it may fail due to statistical fluctuations.
Particularly on server technology.
So can not be merged.

Comment 7 Roman Kuznetsov 2021-08-26 14:16:34 UTC

(In reply to dante19031999 from comment #6)
> (In reply to Roman Kuznetsov from comment #5)
> > Dante, can you do some test with and without your patch on the same
> > computer? I think it would be interesting to know how much has the
> > calculation accelerated by using AVX

> But I don't believe this is what you are asking for.
 
Yeah, I just want to know info like:

"I have a spreadsheet with 1 million cells with data and 100 formulas

It took 1 min for recalculating before
It take 10 sec for recalculating after"

Comment 8 dante19031999 2021-08-26 14:42:06 UTC

> Yeah, I just want to know info like:
> 
> "I have a spreadsheet with 1 million cells with data and 100 formulas
> 
> It took 1 min for recalculating before
> It take 10 sec for recalculating after"

That depends of the computer, for mine (3*10^6 terms):
Time for sum with NONE: 0.002667 s (default on ARM)
Time for sum with AVX: 0.001426 s (new)
Time for sum with SSE2: 0.001914 s (original)
And I can't tell you about AVX512.

But can't give you much more info. This kind of thing will work on auto generated calc sheets with insane amounts of data.

So expect an improvement of ~ * 1.4 on the sum plus the time spended on the interpreter.

Comment 9 Stéphane Guillou (stragu) 2022-01-01 10:56:19 UTC

Checking 7.3 release notes.

I assume we can mark this one as fixed?

Comment 10 Stéphane Guillou (stragu) 2022-01-22 22:50:07 UTC

Marking as fixed by commit mentioned in Comment 4.

Comment 11 dante19031999 2022-01-23 18:54:42 UTC

(In reply to stragu from comment #9)
> Checking 7.3 release notes.
> 
> I assume we can mark this one as fixed?

Yes it is.