Description: HYPGEOM.DIST(x,NSample,Successes,Npopulation,Cumulative) function fails to calculate cumulative probability (Cumulative=1) for situations where part of the individual mass functions are too small and gives out "Err:502". What the function should do is to equate those values to zero or do not consider them when adding Pr(X=0) + Pr(X=1) + ... + Pr(X=x) to calculate Pr(X<=x). That is what, for instance, MS Excel 2013 seems to do when calculating cumulative hypergeometric probabilities. Its corresponding function (HYPGEOM.DIST) does not give out wrong answers in situations where Calc does. Example: HYPGEOM.DIST(34,36,89,100,1) is 1.78862E+20 in Calc version 5.4.5.1 (x64). That is absurd, as probabilities are limited to 1. The correct answer is 0.9562476872. There are countless input values that will give out erros too. For instance, HYPGEOM.DIST(53,55,269,300,1) in Calc is 0.991282957, but the more correct answer is 0.988500551. Such kind of erro is worse, because it is not patently absurd. Excel, R and hypergeometric calculators found in the Internet (for instance, http://stattrek.com/online-calculator/hypergeometric.aspx) give the correct answers. It seems just discarding the individual mass probabilities for values of Pr(X=x) where the results are so small they fall beyond (or bellow) machine precision is enough to prevent the error. I did that using the individual hypergeometric mass probability function from Calc (either HYPGEOMDIST or HYPGEOM.DIST with "Cumulative=0") and the error is avoided (see the "Aux" sheet in the .ods file I am sending attached). Steps to Reproduce: In any Calc cell enter, for instance, "=HYPGEOM.DIST(34,36,89,100,1)" or "HYPGEOM.DIST(53,55,269,300,1)". Actual Results: The incorrect results will be, respectively, "1.78862483604230E+20" and "0.991282957036719". Expected Results: The correct results should be, respectively, "0.95624768719555" and "0.988500551352757". Reproducible: Always User Profile Reset: No Additional Info: All I did seems to point to an easy solution to have the bug fixed: just equate to zero or do not take into consideration the values of Pr(X=x) in error (value too small, bellow machine precision). I only tested in Windows 10 environment. I do not know if the error in HYPGEOM.DIST also occurs in Linux environments. User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
Created attachment 141396 [details] Error example with probable cause Line 2 of "Sheet 1" has an error example (wrong result in cell E2). Cell E3 has the correct value, calculated using the "Aux" sheet, which seems to show the problem will be solved simply discarding from the cumulative probability calculation the error values (mass probabilities smaller than machine precision).
Hi Charles, I reproduce with LO 6.1.0.0.alpha0+ Build ID: f80029445e2b558f0d0e0a25c2c1bbcbe5254120 CPU threads: 2; OS: Windows 6.1; UI render: default; TinderBox: Win-x86@42, Branch:master, Time: 2018-04-13_23:07:11 Locale: fr-FR (fr_FR); Calc: CL and LO 5.4.5.1 Build ID: 79c9829dd5d8054ec39a82dc51cd9eff340dbee8 Threads CPU : 2; OS : Windows 6.1; UI Render : par défaut; Locale : fr-FR (fr_FR); Calc: CL
Also reproduced in Version 4.1.0.0.alpha0+ (Build ID: efca6f15609322f62a35619619a6d5fe5c9bd5a) @Winfried Donkers, I thought you might be interested in this one...
Thank you all for testing in other systems. I am doing further tests with HYPERGEOM.DIST and I'd like to correct what I said in my initial post: it is not a question of error related to machine precision, it is a problem in the implementation of the combination (binomial coefficient) C(n,k) (number of k-combinations in a set with n elements). It should equate C(n,k) to zero when n<k, but it gives out an error.
I will dig into it, but am pressed for time at the moment. Hope to squeeze in a time slot for this bug.
Winfried Donkers committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=e58b3f987681d0034f692db82345af06de217836 tdf#117041 implement note at end of ODFF1.2 par. 6.18.37 It will be available in 6.1.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
@Charles Valente : if you have the opportunity to verfy the fix with a daily build of version 6.1, please set the status to RESOLVED/VERIFIED.
Hi all, Works as expected with LO 6.1.0.0.alpha1+ Build ID: 775d0f26beecffccf3ed27a6a011aff20d91f842 CPU threads: 2; OS: Windows 6.1; UI render: default; TinderBox: Win-x86@42, Branch:master, Time: 2018-04-25_23:51:48 Locale: fr-FR (fr_FR); Calc: CL Thank you Winfried. Waiting for Charles'verification.
I downloaded and tested the following version: https://dev-builds.libreoffice.org/pre-releases/win/x86_64/LibreOfficeDev_6.1.0.0.alpha1_Win_x64.msi The cumulative probability values for the two examples in my original post are now correct: HYPGEOM.DIST(34,36,89,100,1) = 0.9562476871955 and HYPGEOM.DIST(53,55,269,300,1) = 0.988500551352757 Nevertheless, it doesn't seem the problem was really fixed. I believe there is still something wrong. I guess it is an incorret computation of combinations C(n,k) when k>n (which should be zero). That can be seem with another set of examples: let's take n ("NSample" in the Calc Help) equal to 4, M ("Successes") equal to 1 and N ("Npopulation") equal to 10. For x=0 and x=1, HYPGEOM.DIST(x,1,4,10,0) should be 0.6 and 0.4, respectively. And that is the correct answer in LO 5 and the alpha version. But for x>1 (x=2, 3 or 4), the answers should be zero: you cannot have more than 1 success in a sample of size 4 if the population has only 1 desired item. So, the correct answers should be: HYPGEOM.DIST(2,1,4,10,0) = 0 HYPGEOM.DIST(3,1,4,10,0) = 0 HYPGEOM.DIST(4,1,4,10,0) = 0 The cumulative probabilities should be, correspondingly: HYPGEOM.DIST(2,1,4,10,0) = 1 HYPGEOM.DIST(3,1,4,10,0) = 1 HYPGEOM.DIST(4,1,4,10,0) = 1 Those are the results in, for instance, Excel 2013. In the current LO version, the answers are all "Err:502". In the alpha version, they are: HYPGEOM.DIST(2,1,4,10,0) = 1.2 HYPGEOM.DIST(3,1,4,10,0) = 2.4 HYPGEOM.DIST(4,1,4,10,0) = 2.4 And the cumulative probabilities reflect those incorrect values: HYPGEOM.DIST(2,1,4,10,1) = 2.2 HYPGEOM.DIST(3,1,4,10,1) = 4.6 HYPGEOM.DIST(4,1,4,10,1) = 7 It must be noted that both HYPGEOM.DIST and HYPGEOMDIST must be fixed. HYPGEOMDIST is just HYPGEOM.DIST with cumulative=0. Please let me know if I tested an incorret alpha version. If I did, could you please give me the link for the correct interim version?
I made a mistake in my previous post. Instead of inputs (x,1,4,10), the correct is (x,4,1,10), that means, n=4, M=1 and N=10. So, the example functions to be tested, with the corresponding correct answers, are: HYPGEOM.DIST(2,4,1,10,0) = 0 HYPGEOM.DIST(3,4,1,10,0) = 0 HYPGEOM.DIST(4,4,1,10,0) = 0 HYPGEOM.DIST(2,4,1,10,1) = 1 HYPGEOM.DIST(3,4,1,10,1) = 1 HYPGEOM.DIST(4,4,1,10,1) = 1.
I confirm your results, either in LO 6.1.0.0.alpha1+ Build ID: 775d0f26beecffccf3ed27a6a011aff20d91f842 CPU threads: 2; OS: Windows 6.1; UI render: default; TinderBox: Win-x86@42, Branch:master, Time: 2018-04-25_23:51:48 Locale: fr-FR (fr_FR); Calc: CL or Excel Starter 2010.
A silly reversal of argument in calculations, which annoyingly didn't show up in tests. To avoid errors in the future I intend to expand the unit test for HYPGEOMDIST/HYPGEOM.DIST (which internally are identical) with every combination of a (<,<=,>=,>) b, where a and b are arguments x, n, M, N of the function.
Winfried Donkers committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=5cee94308b8dbceb11de4ac02e1d7c9808ccdb02 tdf#117041 use correct expression in if statement. It will be available in 6.1.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
I tested the following daily build: https://dev-builds.libreoffice.org/daily/master/Win-x86_64@42/current/libo-master64~2018-05-03_23.47.22_LibreOfficeDev_6.1.0.0.alpha1_Win_x64.msi Version: 6.1.0.0.alpha1+ (x64) Build ID: 409b7636f0e519f9ab14bac7884789b2323557c7 Everything worked fine, both to HYPGEOM.DIST and HYPGEOMDIST. I have only a comment regarding the output when x > n. In LO the outputs, for instance, of HYPGEOM.DIST(5,4,1,10,0) and HYPGEOM.DIST(5,4,1,10,1) are both "Err:502". Thats an acceptable output in my opinion. In MS Excel 2013, the results are 0 and 1 respectively. That will be a difference between the behavior of those functions in LO versus Excel. Maybe it is interesting to have that clearly stated in LO help. Thank you very much, Mr. Donkers, for fixing the bug and most of all for dedicating your time and talent to LibreOffice and their large community of users.
(In reply to Charles Valente from comment #14) > Everything worked fine, both to HYPGEOM.DIST and HYPGEOMDIST. Setting the status to VERFIED. > In LO the outputs, for instance, of HYPGEOM.DIST(5,4,1,10,0) and > HYPGEOM.DIST(5,4,1,10,1) are both "Err:502". Thats an acceptable output in > my opinion. > In MS Excel 2013, the results are 0 and 1 respectively. > That will be a difference between the behavior of those functions in LO > versus Excel. Maybe it is interesting to have that clearly stated in LO help. HYPGEOMDIST and HYPGEOM.DIST use identical code. The ODFF1.2 specification of HYPGEOMDIST and the Excel specification of HYPGEOM.DIST are not identical. And annoyingly for users, Excel doesn't always behave as specified. https://support.office.com/en-us/article/hypgeom-dist-function-6dbd547f-1d12-4b1f-8ae5-b0d9e3d22fbf says under 'Remarks' : 'If sample_s [x] < 0 or sample_s [x] is greater than the lesser of number_sample [n] or population_s, HYPGEOM.DIST returns the #NUM! error value.' This conflicts with your findings. Should this not be the case, I could change the code such that HYPGEOMDIST (ODFF spec.) and HYPGEOM.DIST (Excel spec.) behave different when x > n. > Thank you very much, Mr. Donkers, for fixing the bug and most of all for > dedicating your time and talent to LibreOffice and their large community of > users. Youe welcome, no problem. Your feedback was very detailed and made it easy to find the cause of the problem.
> https://support.office.com/en-us/article/hypgeom-dist-function-6dbd547f-1d12- > 4b1f-8ae5-b0d9e3d22fbf says under 'Remarks' : 'If sample_s [x] < 0 or > sample_s [x] is greater than the lesser of number_sample [n] or > population_s, HYPGEOM.DIST returns the #NUM! error value.' This conflicts > with your findings. Yes, it seems Excel 2013 itself does not conform to Microsoft spec! > Should this not be the case, I could change the code such that HYPGEOMDIST > (ODFF spec.) and HYPGEOM.DIST (Excel spec.) behave different when x > n. That's not a problem in my application. I think the bug was satisfactorily fixed. Thank you very much again.