A linear regression line is of the form w1x+w2=y and it is the line that minimizes the sum of the squares of the distance from each data point to the line. So, given n pairs of data (xi, yi), the parameters that we are looking for are w1 and w2 which minimize the error
and we can compute the parameter vector w = (w1 , w2)T as the least-squares solution of the following over-determined system
Let's use numpy to compute the regression line:
from numpy import arange,array,ones,linalg from pylab import plot,show xi = arange(0,9) A = array([ xi, ones(9)]) # linearly generated sequence y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24] w = linalg.lstsq(A.T,y)[0] # obtaining the parameters # plotting the line line = w[0]*xi+w[1] # regression line plot(xi,line,'r-',xi,y,'o') show()We can see the result in the plot below.
You can find more about data fitting using numpy in the following posts: Update, the same result could be achieve using the function scipy.stats.linregress (thanks ianalis!):
from numpy import arange,array,ones#,random,linalg from pylab import plot,show from scipy import stats xi = arange(0,9) A = array([ xi, ones(9)]) # linearly generated sequence y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24] slope, intercept, r_value, p_value, std_err = stats.linregress(xi,y) print 'r value', r_value print 'p_value', p_value print 'standard deviation', std_err line = slope*xi+intercept plot(xi,line,'r-',xi,y,'o') show()


Possible Bugs: x_lst is unused and w[] is undefined?
ReplyDeleteThanks Steve, I fixed it. I changed the code at the end to make it consisted with the notation.
ReplyDeleteAnother method is to use scipy.stats.linregress()
ReplyDeleteIn the particular case when y=w1*x+w2 you can use both linregress and lstsq as shown above.
DeleteWhen you want to regress to multiple x-s, e.g.
y=w1*x1+w2*x2+w3*x3 you need to use lstsq.
What is the r_value and the p_value in the second program? What do they represent?
ReplyDeleter_value is the correlation coefficient and p_value is the p-value for a hypothesis test whose null hypothesis is that the slope is zero.
ReplyDeleteFor more information about correlation you can fin my last post:
http://glowingpython.blogspot.com/2012/10/visualizing-correlation-matrices.html
And you can find more about p-value here:
http://en.wikipedia.org/wiki/P-value
how can i get the sum squared error of the regression from this function ??
ReplyDeleteYou can compute the mean square error as follows:
Deleteerr=sqrt(sum((line-xi)**2)/n)
thank you, and what's n in this case? the number of rows in the 2D list ??
Deleten is the number of samples that you have: n = len(y).
DeleteBtw, there's a mistake in my last comment, the squared error is err=sqrt(sum((line-y)**2)/n)
Is there a way to calculate the maximum and minimum gradient, given multiple pairs of (x,y) measurements at each point e.g. repeated trials? Thanks!!
ReplyDeleteThe following function is quite nice: scipy.stats.linregress
ReplyDeletehttp://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
It provides the p-value and r-value without extra work.
Awesome! just what I was looking for.
ReplyDelete