Bug 117041 - Cumulative hypergeometric probability calculation error when discrete probabilities have k>n in C(n,k)
Cumulative hypergeometric probability calculation error when discrete probabi...
 Status: VERIFIED FIXED None LibreOffice Unclassified Calc (show other bugs) 4.1 all versions All All medium normal Winfried Donkers (retired) target:6.1.0

 Reported: 2018-04-16 15:01 UTC by Charles Valente 2018-05-07 19:23 UTC (History) 4 users (show) charlesrv7 jalojo winfrieddonkers xiscofauli

Attachments
Error example with probable cause (81.15 KB, application/vnd.oasis.opendocument.spreadsheet)
2018-04-16 15:12 UTC, Charles Valente
Details

 Note You need to log in before you can comment on or make changes to this bug.
 Charles Valente 2018-04-16 15:01:16 UTC ```Description: HYPGEOM.DIST(x,NSample,Successes,Npopulation,Cumulative) function fails to calculate cumulative probability (Cumulative=1) for situations where part of the individual mass functions are too small and gives out "Err:502". What the function should do is to equate those values to zero or do not consider them when adding Pr(X=0) + Pr(X=1) + ... + Pr(X=x) to calculate Pr(X<=x). That is what, for instance, MS Excel 2013 seems to do when calculating cumulative hypergeometric probabilities. Its corresponding function (HYPGEOM.DIST) does not give out wrong answers in situations where Calc does. Example: HYPGEOM.DIST(34,36,89,100,1) is 1.78862E+20 in Calc version 5.4.5.1 (x64). That is absurd, as probabilities are limited to 1. The correct answer is 0.9562476872. There are countless input values that will give out erros too. For instance, HYPGEOM.DIST(53,55,269,300,1) in Calc is 0.991282957, but the more correct answer is 0.988500551. Such kind of erro is worse, because it is not patently absurd. Excel, R and hypergeometric calculators found in the Internet (for instance, http://stattrek.com/online-calculator/hypergeometric.aspx) give the correct answers. It seems just discarding the individual mass probabilities for values of Pr(X=x) where the results are so small they fall beyond (or bellow) machine precision is enough to prevent the error. I did that using the individual hypergeometric mass probability function from Calc (either HYPGEOMDIST or HYPGEOM.DIST with "Cumulative=0") and the error is avoided (see the "Aux" sheet in the .ods file I am sending attached). Steps to Reproduce: In any Calc cell enter, for instance, "=HYPGEOM.DIST(34,36,89,100,1)" or "HYPGEOM.DIST(53,55,269,300,1)". Actual Results: The incorrect results will be, respectively, "1.78862483604230E+20" and "0.991282957036719". Expected Results: The correct results should be, respectively, "0.95624768719555" and "0.988500551352757". Reproducible: Always User Profile Reset: No Additional Info: All I did seems to point to an easy solution to have the bug fixed: just equate to zero or do not take into consideration the values of Pr(X=x) in error (value too small, bellow machine precision). I only tested in Windows 10 environment. I do not know if the error in HYPGEOM.DIST also occurs in Linux environments. User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36``` Charles Valente 2018-04-16 15:12:08 UTC ```Created attachment 141396 [details] Error example with probable cause Line 2 of "Sheet 1" has an error example (wrong result in cell E2). Cell E3 has the correct value, calculated using the "Aux" sheet, which seems to show the problem will be solved simply discarding from the cumulative probability calculation the error values (mass probabilities smaller than machine precision).``` Jacques Guilleron 2018-04-18 11:58:33 UTC ```Hi Charles, I reproduce with LO 6.1.0.0.alpha0+ Build ID: f80029445e2b558f0d0e0a25c2c1bbcbe5254120 CPU threads: 2; OS: Windows 6.1; UI render: default; TinderBox: Win-x86@42, Branch:master, Time: 2018-04-13_23:07:11 Locale: fr-FR (fr_FR); Calc: CL and LO 5.4.5.1 Build ID: 79c9829dd5d8054ec39a82dc51cd9eff340dbee8 Threads CPU : 2; OS : Windows 6.1; UI Render : par défaut; Locale : fr-FR (fr_FR); Calc: CL``` Xisco Faulí 2018-04-19 09:47:03 UTC ```Also reproduced in Version 4.1.0.0.alpha0+ (Build ID: efca6f15609322f62a35619619a6d5fe5c9bd5a) @Winfried Donkers, I thought you might be interested in this one...``` Charles Valente 2018-04-19 21:55:43 UTC ```Thank you all for testing in other systems. I am doing further tests with HYPERGEOM.DIST and I'd like to correct what I said in my initial post: it is not a question of error related to machine precision, it is a problem in the implementation of the combination (binomial coefficient) C(n,k) (number of k-combinations in a set with n elements). It should equate C(n,k) to zero when nn (which should be zero). That can be seem with another set of examples: let's take n ("NSample" in the Calc Help) equal to 4, M ("Successes") equal to 1 and N ("Npopulation") equal to 10. For x=0 and x=1, HYPGEOM.DIST(x,1,4,10,0) should be 0.6 and 0.4, respectively. And that is the correct answer in LO 5 and the alpha version. But for x>1 (x=2, 3 or 4), the answers should be zero: you cannot have more than 1 success in a sample of size 4 if the population has only 1 desired item. So, the correct answers should be: HYPGEOM.DIST(2,1,4,10,0) = 0 HYPGEOM.DIST(3,1,4,10,0) = 0 HYPGEOM.DIST(4,1,4,10,0) = 0 The cumulative probabilities should be, correspondingly: HYPGEOM.DIST(2,1,4,10,0) = 1 HYPGEOM.DIST(3,1,4,10,0) = 1 HYPGEOM.DIST(4,1,4,10,0) = 1 Those are the results in, for instance, Excel 2013. In the current LO version, the answers are all "Err:502". In the alpha version, they are: HYPGEOM.DIST(2,1,4,10,0) = 1.2 HYPGEOM.DIST(3,1,4,10,0) = 2.4 HYPGEOM.DIST(4,1,4,10,0) = 2.4 And the cumulative probabilities reflect those incorrect values: HYPGEOM.DIST(2,1,4,10,1) = 2.2 HYPGEOM.DIST(3,1,4,10,1) = 4.6 HYPGEOM.DIST(4,1,4,10,1) = 7 It must be noted that both HYPGEOM.DIST and HYPGEOMDIST must be fixed. HYPGEOMDIST is just HYPGEOM.DIST with cumulative=0. Please let me know if I tested an incorret alpha version. If I did, could you please give me the link for the correct interim version?``` Charles Valente 2018-04-26 19:47:25 UTC ```I made a mistake in my previous post. Instead of inputs (x,1,4,10), the correct is (x,4,1,10), that means, n=4, M=1 and N=10. So, the example functions to be tested, with the corresponding correct answers, are: HYPGEOM.DIST(2,4,1,10,0) = 0 HYPGEOM.DIST(3,4,1,10,0) = 0 HYPGEOM.DIST(4,4,1,10,0) = 0 HYPGEOM.DIST(2,4,1,10,1) = 1 HYPGEOM.DIST(3,4,1,10,1) = 1 HYPGEOM.DIST(4,4,1,10,1) = 1.``` Jacques Guilleron 2018-04-26 22:18:20 UTC ```I confirm your results, either in LO 6.1.0.0.alpha1+ Build ID: 775d0f26beecffccf3ed27a6a011aff20d91f842 CPU threads: 2; OS: Windows 6.1; UI render: default; TinderBox: Win-x86@42, Branch:master, Time: 2018-04-25_23:51:48 Locale: fr-FR (fr_FR); Calc: CL or Excel Starter 2010.``` Winfried Donkers (retired) 2018-04-29 05:57:12 UTC ```A silly reversal of argument in calculations, which annoyingly didn't show up in tests. To avoid errors in the future I intend to expand the unit test for HYPGEOMDIST/HYPGEOM.DIST (which internally are identical) with every combination of a (<,<=,>=,>) b, where a and b are arguments x, n, M, N of the function.``` Commit Notification 2018-05-03 11:51:16 UTC ```Winfried Donkers committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=5cee94308b8dbceb11de4ac02e1d7c9808ccdb02 tdf#117041 use correct expression in if statement. It will be available in 6.1.0. The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.``` Charles Valente 2018-05-04 16:05:45 UTC ```I tested the following daily build: https://dev-builds.libreoffice.org/daily/master/Win-x86_64@42/current/libo-master64~2018-05-03_23.47.22_LibreOfficeDev_6.1.0.0.alpha1_Win_x64.msi Version: 6.1.0.0.alpha1+ (x64) Build ID: 409b7636f0e519f9ab14bac7884789b2323557c7 Everything worked fine, both to HYPGEOM.DIST and HYPGEOMDIST. I have only a comment regarding the output when x > n. In LO the outputs, for instance, of HYPGEOM.DIST(5,4,1,10,0) and HYPGEOM.DIST(5,4,1,10,1) are both "Err:502". Thats an acceptable output in my opinion. In MS Excel 2013, the results are 0 and 1 respectively. That will be a difference between the behavior of those functions in LO versus Excel. Maybe it is interesting to have that clearly stated in LO help. Thank you very much, Mr. Donkers, for fixing the bug and most of all for dedicating your time and talent to LibreOffice and their large community of users.``` Winfried Donkers (retired) 2018-05-07 06:01:04 UTC ```(In reply to Charles Valente from comment #14) > Everything worked fine, both to HYPGEOM.DIST and HYPGEOMDIST. Setting the status to VERFIED. > In LO the outputs, for instance, of HYPGEOM.DIST(5,4,1,10,0) and > HYPGEOM.DIST(5,4,1,10,1) are both "Err:502". Thats an acceptable output in > my opinion. > In MS Excel 2013, the results are 0 and 1 respectively. > That will be a difference between the behavior of those functions in LO > versus Excel. Maybe it is interesting to have that clearly stated in LO help. HYPGEOMDIST and HYPGEOM.DIST use identical code. The ODFF1.2 specification of HYPGEOMDIST and the Excel specification of HYPGEOM.DIST are not identical. And annoyingly for users, Excel doesn't always behave as specified. https://support.office.com/en-us/article/hypgeom-dist-function-6dbd547f-1d12-4b1f-8ae5-b0d9e3d22fbf says under 'Remarks' : 'If sample_s [x] < 0 or sample_s [x] is greater than the lesser of number_sample [n] or population_s, HYPGEOM.DIST returns the #NUM! error value.' This conflicts with your findings. Should this not be the case, I could change the code such that HYPGEOMDIST (ODFF spec.) and HYPGEOM.DIST (Excel spec.) behave different when x > n. > Thank you very much, Mr. Donkers, for fixing the bug and most of all for > dedicating your time and talent to LibreOffice and their large community of > users. Youe welcome, no problem. Your feedback was very detailed and made it easy to find the cause of the problem.``` Charles Valente 2018-05-07 19:23:22 UTC ```> https://support.office.com/en-us/article/hypgeom-dist-function-6dbd547f-1d12- > 4b1f-8ae5-b0d9e3d22fbf says under 'Remarks' : 'If sample_s [x] < 0 or > sample_s [x] is greater than the lesser of number_sample [n] or > population_s, HYPGEOM.DIST returns the #NUM! error value.' This conflicts > with your findings. Yes, it seems Excel 2013 itself does not conform to Microsoft spec! > Should this not be the case, I could change the code such that HYPGEOMDIST > (ODFF spec.) and HYPGEOM.DIST (Excel spec.) behave different when x > n. That's not a problem in my application. I think the bug was satisfactorily fixed. Thank you very much again.```