Sin categoría

Linear Regression with Python

Linear regression is a method used to find the relation between 2 variables. It will try to find a line that best fit all the points and with that line, we are going to be able to make predictions in a continuous set (regression predicts a value from a continuous set, for example, the price of a house depending on the size). How we are going to accomplish that? Well, we need to measure the error between each point to the line, find the slope and the alpha value to get the linear regression equation.

We need to understand a couple of concepts like the covariance, the variance and the coefficient of correlation.

Covariance

The covariance is a value that tells us the magnitude of the correlation between two or more sets of random variates. This will tell us if exists dependency between two variables and will allow us to estimate the coefficient of correlation linear and the regression line.

Given a set of data points we can calculate the covariance:

datapoints.PNG

  • We take the mean value from all points the x and y.mediaxy.PNG
  • Next, we find the differences between the x-values and the x-mean, multiply by the difference between the y-values and y-mean. We add all the results and divided by the number of the data points.                       cov.PNG

 

Variance

The sample variance measures the spread of the data around the mean. The variance is simply the average of the squared differences from the mean.

variance.PNG

  • In the first step, we take the mean.
  • In the second we find the differences from the mean
  • In the thirty we squared the differences why? because we want every value to be positive and also because we want to make outliers show big.
  • Then we take the variance.
Least squares 
The method of least squares is a standard approach in regression analysis. The best fit in the least-squares sense minimizes the sum of squared residuals.
The aim is to obtain a line line.PNG .
To find the slope that gives us the line that best fit all the data point we are gonna need the covariance and the variance.
slope.PNG

To obtain the coefficient α, we can use the fact that the regression line passes through the average point: (μx ,μy ). Since these means have already been calculated, it is relatively easy to substitute them into the linear equation

line2.PNG

which can then be solved for α.

Now we know how to get the equation for the regression line given the data points.  Let’s make an exercise to demonstrate what we just learn. Let’s take an example from the page Mathspace and put it in python code.

The exercise 

A very small data set of 10 points is given in columns below. We wish to calculate the means, and the necessary covariance and variance. These results will be used to find the least-squares regression line and the correlation coefficient. Now if you want to compare the results you will now have to go the web page and see if we are doing the things right.

 

x y
$5 $12
$6 $12
$7 $12
$8 $14
$9 $16
$10 $16
$11 $18
$12 $20
$13 $19
$14 $20

 


%matplotlib inline
import numpy as np
from pylab import *

x=np.array([np.arange(5,15)])
y = np.array([[12,12,12,14,16,16,18,20,19,20]])
print(x)
print(y)
scatter(x,y)

#we take the mean from the variables

ux = np.mean(x)
uy = np.mean(y)
print(ux)
print(uy)

#covariance

cova = np.cov(x,y,bias=True)[0][1]
print (cova)

#beta
beta = cova/x.var()#beta is the gradient of the least-squaes regression line
print(x.var()) #variance from the independent variable
print(beta)

#to calculate the line given by uy = beta*ux + alpha
alpha = uy - beta*ux
print(alpha)

import matplotlib.pyplot as plt

def predict(x):
 return beta * x + alpha

fitLine = predict(x)

plt.scatter(x,y)
y1=beta * x + alpha
plt.plot(x[0],y1[0])
plt.show()
print (x)
print (y1)

plot.PNG

regression line.PNG

 

Now that we have learned how to do linear regression manually we are going to use a statics python library called  Scipy.

What we are going to do is import stats from the scipy library. We are going to use linregress to get some values like the slope of the linear function, the intercept and the r_value that we are going to use to evaluate how well the linear function fits our data.


from scipy import stats

slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

Now with this information, we can plot the linear function that we just get from scipy.


import matplotlib.pyplot as plt

def predict(x):
return slope * x + intercept

fitLine = predict(x[0])

plt.scatter(x, y)
plt.plot(x[0], fitLine, c='r')
plt.show()

sypy.PNG

And as we can see the linear function fits our data very well, and we can see it calculating the R-squared.


print(r_value ** 2)

r_value.PNG

 

Whats next? Well you can play with what you just have learned. You can generate randome values with a normal distribution and generate a linear relation between this points and try to find the linear function that best fit our randome data.


%matplotlib inline
import numpy as np
from pylab import *

np.random.seed(2)

X= np.random.normal(3.0, 1.0, 1000)
y= 100 - (X+ np.random.normal(0, 0.1, 1000)) * 3

scatter(X, y)

random linear.PNG

 

Leave a comment