Linear regression is a method used to find the relation between 2 variables. It will try to find a line that best fit all the points and with that line, we are going to be able to make predictions in a continuous set (regression predicts a value from a continuous set, for example, the price of a house depending on the size). How we are going to accomplish that? Well, we need to measure the error between each point to the line, find the slope and the alpha value to get the linear regression equation.
We need to understand a couple of concepts like the covariance, the variance and the coefficient of correlation.
Covariance
The covariance is a value that tells us the magnitude of the correlation between two or more sets of random variates. This will tell us if exists dependency between two variables and will allow us to estimate the coefficient of correlation linear and the regression line.
Given a set of data points we can calculate the covariance:
- We take the mean value from all points the x and y.
- Next, we find the differences between the x-values and the x-mean, multiply by the difference between the y-values and y-mean. We add all the results and divided by the number of the data points.
Variance
The sample variance measures the spread of the data around the mean. The variance is simply the average of the squared differences from the mean.
- In the first step, we take the mean.
- In the second we find the differences from the mean
- In the thirty we squared the differences why? because we want every value to be positive and also because we want to make outliers show big.
- Then we take the variance.
To obtain the coefficient α, we can use the fact that the regression line passes through the average point: (μx ,μy ). Since these means have already been calculated, it is relatively easy to substitute them into the linear equation
which can then be solved for α.
Now we know how to get the equation for the regression line given the data points. Let’s make an exercise to demonstrate what we just learn. Let’s take an example from the page Mathspace and put it in python code.
The exercise
A very small data set of 10 points is given in columns below. We wish to calculate the means, and the necessary covariance and variance. These results will be used to find the least-squares regression line and the correlation coefficient. Now if you want to compare the results you will now have to go the web page and see if we are doing the things right.
x | y |
---|---|
$5 | $12 |
$6 | $12 |
$7 | $12 |
$8 | $14 |
$9 | $16 |
$10 | $16 |
$11 | $18 |
$12 | $20 |
$13 | $19 |
$14 | $20 |
%matplotlib inline import numpy as np from pylab import * x=np.array([np.arange(5,15)]) y = np.array([[12,12,12,14,16,16,18,20,19,20]]) print(x) print(y) scatter(x,y) #we take the mean from the variables ux = np.mean(x) uy = np.mean(y) print(ux) print(uy) #covariance cova = np.cov(x,y,bias=True)[0][1] print (cova) #beta beta = cova/x.var()#beta is the gradient of the least-squaes regression line print(x.var()) #variance from the independent variable print(beta) #to calculate the line given by uy = beta*ux + alpha alpha = uy - beta*ux print(alpha) import matplotlib.pyplot as plt def predict(x): return beta * x + alpha fitLine = predict(x) plt.scatter(x,y) y1=beta * x + alpha plt.plot(x[0],y1[0]) plt.show() print (x) print (y1)
Now that we have learned how to do linear regression manually we are going to use a statics python library called Scipy.
What we are going to do is import stats from the scipy library. We are going to use linregress to get some values like the slope of the linear function, the intercept and the r_value that we are going to use to evaluate how well the linear function fits our data.
from scipy import stats slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
Now with this information, we can plot the linear function that we just get from scipy.
import matplotlib.pyplot as plt def predict(x): return slope * x + intercept fitLine = predict(x[0]) plt.scatter(x, y) plt.plot(x[0], fitLine, c='r') plt.show()
And as we can see the linear function fits our data very well, and we can see it calculating the R-squared.
print(r_value ** 2)
Whats next? Well you can play with what you just have learned. You can generate randome values with a normal distribution and generate a linear relation between this points and try to find the linear function that best fit our randome data.
%matplotlib inline import numpy as np from pylab import * np.random.seed(2) X= np.random.normal(3.0, 1.0, 1000) y= 100 - (X+ np.random.normal(0, 0.1, 1000)) * 3 scatter(X, y)