Bug 117041 - Cumulative hypergeometric probability calculation error when discrete probabilities have k>n in C(n,k)
Summary: Cumulative hypergeometric probability calculation error when discrete probabi...
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
4.1 all versions
Hardware: All All
: medium normal
Assignee: Winfried Donkers (retired)
URL:
Whiteboard: target:6.1.0
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-16 15:01 UTC by Charles Valente
Modified: 2018-05-07 19:23 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Error example with probable cause (81.15 KB, application/vnd.oasis.opendocument.spreadsheet)
2018-04-16 15:12 UTC, Charles Valente
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Charles Valente 2018-04-16 15:01:16 UTC
Description:
HYPGEOM.DIST(x,NSample,Successes,Npopulation,Cumulative) function fails to calculate cumulative probability (Cumulative=1) for situations where part of the individual mass functions are too small and gives out "Err:502". What the function should do is to equate those values to zero or do not consider them when adding Pr(X=0) + Pr(X=1) + ... + Pr(X=x) to calculate Pr(X<=x).

That is what, for instance, MS Excel 2013 seems to do when calculating cumulative hypergeometric probabilities. Its corresponding function (HYPGEOM.DIST) does not give out wrong answers in situations where Calc does.

Example:
HYPGEOM.DIST(34,36,89,100,1) is 1.78862E+20 in Calc version 5.4.5.1 (x64). That is absurd, as probabilities are limited to 1.

The correct answer is 0.9562476872.

There are countless input values that will give out erros too. For instance, HYPGEOM.DIST(53,55,269,300,1) in Calc is 0.991282957, but the more correct answer is 0.988500551. Such kind of erro is worse, because it is not patently absurd.

Excel, R and hypergeometric calculators found in the Internet (for instance, http://stattrek.com/online-calculator/hypergeometric.aspx) give the correct answers.
It seems just discarding the individual mass probabilities for values of Pr(X=x) where the results are so small they fall beyond (or bellow) machine precision is enough to prevent the error.

I did that using the individual hypergeometric mass probability function from Calc (either HYPGEOMDIST or HYPGEOM.DIST with "Cumulative=0") and the error is avoided (see the "Aux" sheet in the .ods file I am sending attached).

Steps to Reproduce:
In any Calc cell enter, for instance, "=HYPGEOM.DIST(34,36,89,100,1)" or "HYPGEOM.DIST(53,55,269,300,1)".

Actual Results:  
The incorrect results will be, respectively, "1.78862483604230E+20" and "0.991282957036719".

Expected Results:
The correct results should be, respectively, "0.95624768719555" and "0.988500551352757".


Reproducible: Always


User Profile Reset: No



Additional Info:
All I did seems to point to an easy solution to have the bug fixed: just equate to zero or do not take into consideration the values of Pr(X=x) in error (value too small, bellow machine precision).
I only tested in Windows 10 environment. I do not know if the error in HYPGEOM.DIST also occurs in Linux environments.


User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
Comment 1 Charles Valente 2018-04-16 15:12:08 UTC
Created attachment 141396 [details]
Error example with probable cause

Line 2 of "Sheet 1" has an error example (wrong result in cell E2).
Cell E3 has the correct value, calculated using the "Aux" sheet, which seems to show the problem will be solved simply discarding from the cumulative probability calculation the error values (mass probabilities smaller than machine precision).
Comment 2 Jacques Guilleron 2018-04-18 11:58:33 UTC
Hi  Charles,
I reproduce with
LO 6.1.0.0.alpha0+ Build ID: f80029445e2b558f0d0e0a25c2c1bbcbe5254120
CPU threads: 2; OS: Windows 6.1; UI render: default; 
TinderBox: Win-x86@42, Branch:master, Time: 2018-04-13_23:07:11
Locale: fr-FR (fr_FR); Calc: CL
and
LO  5.4.5.1 Build ID: 79c9829dd5d8054ec39a82dc51cd9eff340dbee8
Threads CPU : 2; OS : Windows 6.1; UI Render : par défaut; 
Locale : fr-FR (fr_FR); Calc: CL
Comment 3 Xisco Faulí 2018-04-19 09:47:03 UTC
Also reproduced in

Version 4.1.0.0.alpha0+ (Build ID: efca6f15609322f62a35619619a6d5fe5c9bd5a)

@Winfried Donkers, I thought you might be interested in this one...
Comment 4 Charles Valente 2018-04-19 21:55:43 UTC
Thank you all for testing in other systems.
I am doing further tests with HYPERGEOM.DIST and I'd like to correct what I said in my initial post: it is not a question of error related to machine precision, it is a problem in the implementation of the combination (binomial coefficient) C(n,k) (number of k-combinations in a set with n elements).
It should equate C(n,k) to zero when n<k, but it gives out an error.
Comment 5 Winfried Donkers (retired) 2018-04-20 08:16:32 UTC
I will dig into it, but am pressed for time at the moment. Hope to squeeze in a time slot for this bug.
Comment 6 Commit Notification 2018-04-25 20:00:09 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=e58b3f987681d0034f692db82345af06de217836

tdf#117041 implement note at end of ODFF1.2 par. 6.18.37

It will be available in 6.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 7 Winfried Donkers (retired) 2018-04-26 05:25:51 UTC
@Charles Valente : if you have the opportunity to verfy the fix with a daily build of version 6.1, please set the status to RESOLVED/VERIFIED.
Comment 8 Jacques Guilleron 2018-04-26 05:38:41 UTC
Hi all,

Works as expected with
LO 6.1.0.0.alpha1+
Build ID: 775d0f26beecffccf3ed27a6a011aff20d91f842
CPU threads: 2; OS: Windows 6.1; UI render: default; 
TinderBox: Win-x86@42, Branch:master, Time: 2018-04-25_23:51:48
Locale: fr-FR (fr_FR); Calc: CL
Thank you Winfried.
Waiting for Charles'verification.
Comment 9 Charles Valente 2018-04-26 19:37:41 UTC
I downloaded and tested the following version:

https://dev-builds.libreoffice.org/pre-releases/win/x86_64/LibreOfficeDev_6.1.0.0.alpha1_Win_x64.msi

The cumulative probability values for the two examples in my original post are now correct:
HYPGEOM.DIST(34,36,89,100,1) = 0.9562476871955 and
HYPGEOM.DIST(53,55,269,300,1) = 0.988500551352757

Nevertheless, it doesn't seem the problem was really fixed.
I believe there is still something wrong.
I guess it is an incorret computation of combinations C(n,k) when k>n (which should be zero).

That can be seem with another set of examples:
let's take n ("NSample" in the Calc Help) equal to 4, M ("Successes") equal to 1 and N ("Npopulation") equal to 10.

For x=0 and x=1, HYPGEOM.DIST(x,1,4,10,0) should be 0.6 and 0.4, respectively.
And that is the correct answer in LO 5 and the alpha version.

But for x>1 (x=2, 3 or 4), the answers should be zero: you cannot have more than 1 success in a sample of size 4 if the population has only 1 desired item.
So, the correct answers should be:

HYPGEOM.DIST(2,1,4,10,0) = 0
HYPGEOM.DIST(3,1,4,10,0) = 0
HYPGEOM.DIST(4,1,4,10,0) = 0

The cumulative probabilities should be, correspondingly:
HYPGEOM.DIST(2,1,4,10,0) = 1
HYPGEOM.DIST(3,1,4,10,0) = 1
HYPGEOM.DIST(4,1,4,10,0) = 1

Those are the results in, for instance, Excel 2013. 

In the current LO version, the answers are all "Err:502".

In the alpha version, they are:

HYPGEOM.DIST(2,1,4,10,0) = 1.2
HYPGEOM.DIST(3,1,4,10,0) = 2.4
HYPGEOM.DIST(4,1,4,10,0) = 2.4

And the cumulative probabilities reflect those incorrect values:

HYPGEOM.DIST(2,1,4,10,1) = 2.2
HYPGEOM.DIST(3,1,4,10,1) = 4.6
HYPGEOM.DIST(4,1,4,10,1) = 7

It must be noted that both HYPGEOM.DIST and HYPGEOMDIST must be fixed.
HYPGEOMDIST is just HYPGEOM.DIST with cumulative=0.

Please let me know if I tested an incorret alpha version.
If I did, could you please give me the link for the correct interim version?
Comment 10 Charles Valente 2018-04-26 19:47:25 UTC
I made a mistake in my previous post.
Instead of inputs (x,1,4,10), the correct is (x,4,1,10), that means, n=4, M=1 and N=10.
So, the example functions to be tested, with the corresponding correct answers, are:
HYPGEOM.DIST(2,4,1,10,0) = 0
HYPGEOM.DIST(3,4,1,10,0) = 0
HYPGEOM.DIST(4,4,1,10,0) = 0

HYPGEOM.DIST(2,4,1,10,1) = 1
HYPGEOM.DIST(3,4,1,10,1) = 1
HYPGEOM.DIST(4,4,1,10,1) = 1.
Comment 11 Jacques Guilleron 2018-04-26 22:18:20 UTC
I confirm your results, either in
LO 6.1.0.0.alpha1+ Build ID: 775d0f26beecffccf3ed27a6a011aff20d91f842
CPU threads: 2; OS: Windows 6.1; UI render: default; 
TinderBox: Win-x86@42, Branch:master, Time: 2018-04-25_23:51:48
Locale: fr-FR (fr_FR); Calc: CL
or
Excel Starter 2010.
Comment 12 Winfried Donkers (retired) 2018-04-29 05:57:12 UTC
A silly reversal of argument in calculations, which annoyingly didn't show up in tests. 
To avoid errors in the future I intend to expand the unit test for HYPGEOMDIST/HYPGEOM.DIST (which internally are identical) with every combination of a (<,<=,>=,>) b, where a and b are arguments x, n, M, N of the function.
Comment 13 Commit Notification 2018-05-03 11:51:16 UTC
Winfried Donkers committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=5cee94308b8dbceb11de4ac02e1d7c9808ccdb02

tdf#117041 use correct expression in if statement.

It will be available in 6.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 14 Charles Valente 2018-05-04 16:05:45 UTC
I tested the following daily build:
https://dev-builds.libreoffice.org/daily/master/Win-x86_64@42/current/libo-master64~2018-05-03_23.47.22_LibreOfficeDev_6.1.0.0.alpha1_Win_x64.msi
Version: 6.1.0.0.alpha1+ (x64)
Build ID: 409b7636f0e519f9ab14bac7884789b2323557c7

Everything worked fine, both to HYPGEOM.DIST and HYPGEOMDIST.

I have only a comment regarding the output when x > n.

In LO the outputs, for instance, of HYPGEOM.DIST(5,4,1,10,0) and HYPGEOM.DIST(5,4,1,10,1) are both "Err:502". Thats an acceptable output in my opinion.
In MS Excel 2013, the results are 0 and 1 respectively.
That will be a difference between the behavior of those functions in LO versus Excel. Maybe it is interesting to have that clearly stated in LO help.

Thank you very much, Mr. Donkers, for fixing the bug and most of all for dedicating your time and talent to LibreOffice and their large community of users.
Comment 15 Winfried Donkers (retired) 2018-05-07 06:01:04 UTC
(In reply to Charles Valente from comment #14)
> Everything worked fine, both to HYPGEOM.DIST and HYPGEOMDIST.

Setting the status to VERFIED.

> In LO the outputs, for instance, of HYPGEOM.DIST(5,4,1,10,0) and
> HYPGEOM.DIST(5,4,1,10,1) are both "Err:502". Thats an acceptable output in
> my opinion.
> In MS Excel 2013, the results are 0 and 1 respectively.
> That will be a difference between the behavior of those functions in LO
> versus Excel. Maybe it is interesting to have that clearly stated in LO help.

HYPGEOMDIST and HYPGEOM.DIST use identical code. The ODFF1.2 specification of HYPGEOMDIST and the Excel specification of HYPGEOM.DIST are not identical. And annoyingly for users, Excel doesn't always behave as specified.
https://support.office.com/en-us/article/hypgeom-dist-function-6dbd547f-1d12-4b1f-8ae5-b0d9e3d22fbf says under 'Remarks' : 'If sample_s [x] < 0 or sample_s [x] is greater than the lesser of number_sample [n] or population_s, HYPGEOM.DIST returns the #NUM! error value.' This conflicts with your findings.
Should this not be the case, I could change the code such that HYPGEOMDIST (ODFF spec.) and HYPGEOM.DIST (Excel spec.) behave different when x > n.
 
> Thank you very much, Mr. Donkers, for fixing the bug and most of all for
> dedicating your time and talent to LibreOffice and their large community of
> users.

Youe welcome, no problem. Your feedback was very detailed and made it easy to find the cause of the problem.
Comment 16 Charles Valente 2018-05-07 19:23:22 UTC
> https://support.office.com/en-us/article/hypgeom-dist-function-6dbd547f-1d12-
> 4b1f-8ae5-b0d9e3d22fbf says under 'Remarks' : 'If sample_s [x] < 0 or
> sample_s [x] is greater than the lesser of number_sample [n] or
> population_s, HYPGEOM.DIST returns the #NUM! error value.' This conflicts
> with your findings.

Yes, it seems Excel 2013 itself does not conform to Microsoft spec!


> Should this not be the case, I could change the code such that HYPGEOMDIST
> (ODFF spec.) and HYPGEOM.DIST (Excel spec.) behave different when x > n.

That's not a problem in my application. I think the bug was satisfactorily fixed.
Thank you very much again.