Bug 87790 - Unexpected values in evaluating R2 using regr.lin (linear fit)
Summary: Unexpected values in evaluating R2 using regr.lin (linear fit)
Status: RESOLVED NOTABUG
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Calc (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium minor
Assignee: Not Assigned
URL: http://www.quepublishing.com/articles...
Whiteboard: target:25.2.0
Keywords:
Depends on:
Blocks:
 
Reported: 2014-12-28 12:20 UTC by mencaraglia
Modified: 2024-10-16 18:54 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
spreadsheet with data (sheet base) and two fits (1 and 2 parameters) (650.34 KB, application/vnd.oasis.opendocument.spreadsheet)
2014-12-28 14:55 UTC, mencaraglia
Details
short sample of spradsheet (169.96 KB, application/vnd.oasis.opendocument.spreadsheet)
2014-12-30 13:27 UTC, mencaraglia
Details

Note You need to log in before you can comment on or make changes to this bug.
Description mencaraglia 2014-12-28 12:20:35 UTC
When fitting a series of (x,y) data with a function a * x + b, everything (slope, intercept, correlation) seems ok   (check done by comparison with excel and by evaluaton using the standatd formulas in any book). Changing axis so that the first (x,y) is in the origin and fitting with  a * x' the slope is ok, but the correlation is wrong, It is neither the Pearson correlation nor the explained variance vs the total variance.  The result is in disagreement with what found in excel or by evlautaion with standard formula

I do not understand if there is a bug in the function or if the meaning of r2 is nn clearly eplaine in the on line manual
Comment 1 Julien Nabet 2014-12-28 13:54:17 UTC
Which LO version do you use?
Could you attach an example file by using this link? https://bugs.freedesktop.org/attachment.cgi?bugid=87790&action=enter
(have in mind that any attachment is automatically made public, so remove any confidential/private part).
Comment 2 mencaraglia 2014-12-28 14:55:02 UTC
Created attachment 111428 [details]
spreadsheet with data (sheet base) and two fits (1 and 2 parameters)

Using LIBREOFFICE 4.3.5.2 build 3a87456aaa6a95c63eea1c1b3201acedf0751bd5 with Kubuntu 14.04

Original data in sheet 'base'; I need to fit a power law with y data in col C, x data in col A. Fit done using linearization log(y) log(x)

Case 1 two parameters log(y) = a * log(x) + b, sheet 2 param

The data are in range d6/e55 (logy -> col d, logx col. E) Results from function regr.lin are in range D68/e72)   Result agree with excel, with what obtained using functions slope(), intercept(), correlation() etc and with what obtained using standard textbook

Case 2 one parameter log(y') = A * log(x)
y' is obtaine by simple shift ( y' = y / y0).  Data ad resultf from regr lin same locations.  
The problem in in the R2 value D70 which is 0,9875 while I expect 0,7364 (as fraction of explained y variance with  respect to total variance cell C70) or 0,8764 standar pearson correlation)

Excel returns 0.7364
Comment 3 Julien Nabet 2014-12-29 22:41:25 UTC
On pc Debian x86-64 with LO Debian package 4.3.3.2 and master sources updated today (so future 4.5.0), D70 has 0,8764 (so what you expected according to your last comment)
I suppose I missed something, so could you attach a simple file with a minimum step by step process to reproduce the problem?
Comment 4 mencaraglia 2014-12-30 13:27:02 UTC
Created attachment 111516 [details]
short sample of spradsheet

I am resending the spradsheet as a short version: Data are in col A and B; same data (but for a constant shift in y) are in col F and G.  Input of these data is TEP 1

Below each series (A & B ; F & G) the linear best fit using  A = a B + b (col A & B)  or  F = a' G ( col F & G); choice on how to fit from the third parameter of regr.lin

regr.lin(A;B; 1 ; 1)
regr.lin(F;G; 0 ; 1)

Evaluation of regr-lin  is STEP 2

As a check for col A & B also evaluation using slope(A;B) intercept(A;B) correlation(A;B) RQ(A:B) have been made

These checks are STEP 3

I have shown corresponding results using background color

Now, passing to col F and G the result from regr.lin(...) are ok for the slope (not shown here the check);   correlation(F;G) and RQ(F:G) correspond with what evaluated in A & B (as expected, these values are invariant) 

What I do not understand is the term on cell F56 (cell background in red)which - as far as I understand should be a correlation - but is neither correlation() nor RQ() and - as I to before is not equal to the value suggesteb by excel.

To be sure that what I find is clear I have also copied the results with 'paste special - only numbers - in the area with gray background.
Comment 5 Julien Nabet 2015-01-04 07:43:56 UTC
Thank you for your feedback and sorry for this late response.
Just to know, what number do you expect instead of "0,987462335840797"?
Comment 6 mencaraglia 2015-01-04 10:48:44 UTC
What I expect is 0,7364; In these days ha have done some other checks; using the spreadsheet GNUMERIC for instance I find for r2 the value 0,98.... (same as libreoffice); same value using the Google sheet. Using the spreadsheet SCIDAVIS the R2 is 0,73...,
Using the program GRETL the results are 0.98... if I use the ordinary least square methd (fit function a *x ) and I find 0,73.... if I use the non linear fit (fit funztion a * x).

Looking for other occurrences of the R2 problem (google may help) I found that several years ago ( 2006 ) the same problem (wrong values of r2) had been found for excel and the origin was in the linest() function.   see for instance http://support.microsoft.com/kb/829249

I have not worked any more on the subject since (a) i had a work around and (b) it's the first time I fit with Y = a *x  and probably next time will be in 10 years (;-) ) 


Here below please find the two results obtained from GRETL and the one from SCIDAVIS

Modello 1: OLS, usando le osservazioni 1-50
Variabile dipendente: v2

             coefficiente   errore std.   rapporto t   p-value 
  -------------------------------------------------------------
  v1           −1,21416      0,0195447      −62,12     2,91e-48 ***

Media var. dipendente  −3,693268   SQM var. dipendente     0,833732
Somma quadr. residui    8,977973   E.S. della regressione  0,428047
R-quadro                0,987462   R-quadro corretto       0,987462
F(1, 49)                3859,178   P-value(F)              2,91e-48
Log-verosimiglianza    −28,01571   Criterio di Akaike      58,03141
Criterio di Schwarz     59,94343   Hannan-Quinn            58,75952
Note: SQM = scarto quadratico medio; E.S. = errore standard


----------------------------------


Sono state usate derivate analitiche
Tolleranza = 1,81899e-12

Convergenza raggiunta dopo 4 iterazioni

Modello 2: NLS, usando le osservazioni 1-50
v2 = alpha * v1

              stima     errore std.   rapporto t   p-value 
  ---------------------------------------------------------
  alpha      −1,21416    0,0195447      −62,12     2,91e-48 ***

Media var. dipendente  −3,693268   SQM var. dipendente     0,833732
Somma quadr. residui    8,977973   E.S. della regressione  0,428047
R-quadro                0,736410   R-quadro corretto       0,736410
Log-verosimiglianza    −28,01571   Criterio di Akaike      58,03141
Criterio di Schwarz     59,94343   Hannan-Quinn            58,75952
Note: SQM = scarto quadratico medio; E.S. = errore standard


----------------------------------------------------

[02/01/15 14:31	Plot: ''Graph1'']
Non-linear fit of dataset: Table1_2, using function: a*x
Y standard errors: Unknown
Scaled Levenberg-Marquardt algorithm with tolerance = 0,0001
From x = 0 to x = 3,912
a  = -1,21415873230445 +/- 0,0195446571501494
--------------------------------------------------------------------------------------
Chi^2/doF = 0,183223940915315
R^2 = 0,736410071654573
---------------------------------------------------------------------------------------
Iterations = 1
Status = success
---------------------------------------------------------------------------------------
Comment 7 Julien Nabet 2015-01-04 10:53:33 UTC
Thank you again for your detailed feedback.

I put it at NEW.

Winfried/Eike/Markus/Kohei: one for you?

(http://opengrok.libreoffice.org/xref/core/sc/source/core/tool/interpr5.cxx#2340)
Comment 8 Julien Nabet 2015-01-04 13:05:47 UTC
I put an url which may help (for those who are "fluent" at statistics)
Comment 9 Winfried Donkers 2015-01-06 15:41:40 UTC
(In reply to Julien Nabet from comment #7)
> Winfried/Eike/Markus/Kohei: one for you?

I will have a look at it (mathematically at first). Don't know yet if I will be able to find the cause and fix it...
Luckily, mencaraglia gives us 10 years to fix this ;-)
Comment 10 GerardF 2015-02-22 09:02:17 UTC
I don't think this is a bug.
See https://bugs.documentfoundation.org/show_bug.cgi?id=73374
Comment 11 tommy27 2016-04-16 07:23:45 UTC
** Please read this message in its entirety before responding **

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present on a currently supported version of LibreOffice (5.0.5 or 5.1.2 https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the version of LibreOffice and your operating system, and any changes you see in the bug behavior
 
If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a short comment that includes your version of LibreOffice and Operating System

Please DO NOT

- Update the version field
- Reply via email (please reply directly on the bug tracker)
- Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not appropriate in this case)


If you want to do more to help you can test to see if your issue is a REGRESSION. To do so: 

1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3)

http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug 

3. Leave a comment with your results. 

4a. If the bug was present with 3.3 - set version to "inherited from OOo"; 
4b. If the bug was not present in 3.3 - add "regression" to keyword


Feel free to come ask questions or to say hello in our QA chat: http://webchat.freenode.net/?channels=libreoffice-qa

Thank you for your help!

-- The LibreOffice QA Team This NEW Message was generated on: 2016-04-16
Comment 12 Laurent Balland 2016-05-02 20:12:09 UTC
As GerardF said, it is not a bug. R² must not be calculated in the same in case intercept is forced.
See bug 73374 and document attachment 91622 [details] for details.

Excel may have bug, and may also correct its error. With Excel 2010 and your data, I get R²=0,987462189 and R²=0,9874623358 with LibO 5.1.3.1.
Comment 13 Commit Notification 2024-10-16 18:54:47 UTC
Caolán McNamara committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/c3c9efc94f737e5f6c42db02b261d8fdb47f3d51

crashtesting: assert on fdo87790-1.ods

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.