Saturday, March 24, 2012

Linear regression with Numpy

Few post ago, we have seen how to use the function numpy.linalg.lstsq(...) to solve an over-determined system. This time, we'll use it to estimate the parameters of a regression line.
A linear regression line is of the form w1x+w2=y and it is the line that minimizes the sum of the squares of the distance from each data point to the line. So, given n pairs of data (xi, yi), the parameters that we are looking for are w1 and w2 which minimize the error


and we can compute the parameter vector w = (w1 , w2)T as the least-squares solution of the following over-determined system


Let's use numpy to compute the regression line:
from numpy import arange,array,ones,linalg
from pylab import plot,show

xi = arange(0,9)
A = array([ xi, ones(9)])
# linearly generated sequence
y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24]
w = linalg.lstsq(A.T,y)[0] # obtaining the parameters

# plotting the line
line = w[0]*xi+w[1] # regression line
plot(xi,line,'r-',xi,y,'o')
show()
We can see the result in the plot below.



You can find more about data fitting using numpy in the following posts: Update, the same result could be achieve using the function scipy.stats.linregress (thanks ianalis!):
from numpy import arange,array,ones#,random,linalg
from pylab import plot,show
from scipy import stats

xi = arange(0,9)
A = array([ xi, ones(9)])
# linearly generated sequence
y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24]

slope, intercept, r_value, p_value, std_err = stats.linregress(xi,y)

print 'r value', r_value
print  'p_value', p_value
print 'standard deviation', std_err

line = slope*xi+intercept
plot(xi,line,'r-',xi,y,'o')
show()

13 comments:

  1. Possible Bugs: x_lst is unused and w[] is undefined?

    ReplyDelete
  2. Thanks Steve, I fixed it. I changed the code at the end to make it consisted with the notation.

    ReplyDelete
  3. Another method is to use scipy.stats.linregress()

    ReplyDelete
    Replies
    1. In the particular case when y=w1*x+w2 you can use both linregress and lstsq as shown above.
      When you want to regress to multiple x-s, e.g.
      y=w1*x1+w2*x2+w3*x3 you need to use lstsq.

      Delete
  4. What is the r_value and the p_value in the second program? What do they represent?

    ReplyDelete
  5. r_value is the correlation coefficient and p_value is the p-value for a hypothesis test whose null hypothesis is that the slope is zero.

    For more information about correlation you can fin my last post:
    http://glowingpython.blogspot.com/2012/10/visualizing-correlation-matrices.html

    And you can find more about p-value here:
    http://en.wikipedia.org/wiki/P-value

    ReplyDelete
  6. how can i get the sum squared error of the regression from this function ??

    ReplyDelete
    Replies
    1. You can compute the mean square error as follows:
      err=sqrt(sum((line-xi)**2)/n)

      Delete
    2. thank you, and what's n in this case? the number of rows in the 2D list ??

      Delete
    3. n is the number of samples that you have: n = len(y).
      Btw, there's a mistake in my last comment, the squared error is err=sqrt(sum((line-y)**2)/n)

      Delete
  7. Is there a way to calculate the maximum and minimum gradient, given multiple pairs of (x,y) measurements at each point e.g. repeated trials? Thanks!!

    ReplyDelete
  8. The following function is quite nice: scipy.stats.linregress

    http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

    It provides the p-value and r-value without extra work.

    ReplyDelete
  9. Awesome! just what I was looking for.

    ReplyDelete