Linear Regression- The history, the theory and the maths

Dhirendraddsingh
Nerd For Tech
Published in
11 min readSep 27, 2021

--

Photo by Andy Kelly on Unsplash

LR(Linear Regression), the algorithm every one says they understand but few actually do in totality.

While most articles related to LR focus on the bare minimum theory and equations so you can pass an ML(Machine Learning) interview round, my aim here is to start from the beginning and take you on a journey of exploring the most common algorithm in statistics and base ML

A bit of History

LR comes from a family of statistical processes known as Regression Analysis which are as old as 1805. Regression Analysis is simply a method to model a relationship between a dependent variable and one or more independent variables. The most common form of Regression Analysis is of course Linear Regression but it was not the first, the method of least squares holds that place.

Isn't least squares a part of Linear Regression, yes and no.

Least squares is commonly two types, Linear and non-linear, Linear Least Square or Ordinary Least Squares is the one that is used for estimating parameters in Linear Regression

What is Linear Regression ?

A Linear Regression algorithm attempts to model a relationship between dependent variable/s and independent variables by fitting a straight line. This line is represented using the following equation which is :-

Where y is the dependent variable(Ex:- House Prices)

x is the independent variable(Ex:- No. of bedrooms)

β0 is the intercept

β1 is the slope

u is the random error component

As you can see the β1 or the slope is 2 which means with every unit change in x, gets increased by 2 units.

BUT how did we get this line, how do we know this is the best line that captures the relationship between x and y(a.k.a best-fitted line)

Let’s start with an example:

Imagine you are a student and you want to find out what GPA you will be getting this semester based on the no. of hours/day you are studying. So what you would want to do is to establish some relationship between the two variables .

Firstly you go and collect data from all your seniors on the same. So here y is GPA and x is average no. of hours studied/day

Data Sample for 10 students

Let’s plot it on a scatter plot

As you can clearly see there is some sort of linear relationship between the two variables, our job is to find out the best fitted line which explains the relationship and can be used to predict future GPA’s based on avg number of hours/day

For that we need to define a cost function and minimize it. So basically in Linear Regression our cost function will be Mean Squared Error

https://tinyurl.com/es8xcdaj

So as we can see the line we have chosen has errors(the vertical distance of original points from the line/predicted points), our job is to chose a line that minimizes the sum of squared errors (Why square? The squaring is necessary to remove any negative signs. It also gives more weight to larger differences)

A very naïve approach could be trying all permutations and combinations of β0 and β1, but that could take a lot of time and is not computationally sound method.

We will try the two most widely used approaches:

  1. Ordinary Least Squares (OLS) Method

To calculate the value of β0(intercept) and β1(slope) which minimizes our cost function and gives us the best fit line we use the following formulas:-

Intercept = mean of dependent variable-slope*mean of independent variable

Slope=((dependent variable-dependent variable’s mean)*(independent variable-independent variable’s mean))/(dependent variable-dependent variable’s mean)²

An excel where I calculated intercept and slope using the above formulae

β1(slope) will be 69.3/208= 0.3

β0(intercept) will be 2.4–0.3 = 2.1

So the equation we have is y = 0.3x+2.1

The most important question :- How did we derive the formulae for intercept and slope? That is what we will try to do right now, derivation of these two formulas. Let’s get into it :-

Our goal is to minimize our cost function:

So what we will do is the following:

  1. Calculate partial derivative of the above equation w.r.t β1 and β0
  2. Set the partial derivatives to 0 to get a minima(we would not cover second order derivative to establish it’s a minima)
  3. Solve for β1 and β0

Step 1-So basically in step 1 we get the derivative inside the summation(a derivative of the sum is the sum of derivatives)

Step 2- Using the power rule we get the 2 down

Step 3- Using the chain rule we calculate the derivative of whatever is inside the bracket, the derivative is -1

Step 1- So basically in step 1 we get the derivative inside the summation(a derivative of the sum is the sum of derivatives)

Step 2- Using the power rule we get the 2 down

Step 3- Using the chain rule we calculate the derivative of whatever is inside the bracket, the derivative is Xᵢ

So finally setting the partial derivatives(equation 1 and equation 2) equal to 0 :

Dividing both equations by -2 and then using equation 1:

Step 1: We add summation to each individual value

Step 2: Because β0 and β1 are constants, let’s get them out of the summation(assuming n sample values)

Step 3: Getting β0 variable isolated, if we divide sum of all values of Y by n samples we get the mean , same for X

We just derived the first part of the equation

Now solving for β1:

Step 1:We already have calculated β0 so let’s substitute it to equation 2

Step2 :Making x and y terms appear together

Step3: Isolate β1 and solve for it

Voila! We just derived the equations for intercept and slope used all the time in Ordinary Least Squares

2. Gradient Descent

Gradient descent is by far the most popular optimization strategy used in machine learning and deep learning. It is an optimization algorithm that’s used when training a machine learning model. It’s based on a convex function and tweaks its parameters iteratively to minimize a given function(loss function/cost functions mostly) to its local minimum.

Let’s discuss the major steps in Gradient descent using an example first then we will move on to the MSE function

Single Variable Function example

Derivative

Let’s assume our function is J(θ)= θ² , we want to find the value of θ which minimizes J(θ). So the first thing we have to do is we have to find the function’s derivative.

Why derivative ? Geometrically, the derivative of a function can be interpreted as the slope of the graph of the function or, more precisely, as the slope of the tangent line at a point. Why do we need a slope, well we got to know if our function is moving towards a minimum or not and secondly if we have a big enough slope we need to move more quickly and vice versa if slope is small.

J(θ)= θ² function represented in a graph

So we can clearly see the minimum is at x=0, and as we get closer to the minimum the slope keeps getting smaller and smaller. So let’s say we were at x=-3, y=9 , which means we got to move towards left and take big steps

The Learning Rate — Alpha

Let’s say the slope is big like in our example and we have to move very quickly, how do we do that ? We use a learning rate, a learning rate basically tells how big or small steps we should take to reach the local minimum.

Convergence

Depending upon the function and the learning rate ,the gradient descent might converge or not. Usually what is done defining a no. of max iterations for the gradient descent to move towards local minimum before it stops.

The maths

Derivative of θ² is simply 2θ. Using a as the learning rate this is how gradient descent will keep on updating θ after every iteration

Initial values of θ and a are 2 and 0.1 respectively

A sample 13 iterations ran

As you can see the algorithm moving towards 0, but it would not ever converge at 0 but we would find some local minimum

Gradient Descent of MSE

The above equation of MSE is the our cost function, we want to find the parameters β1(slope) and β0(intercept) which will minimize our cost function. Now given we have multiple variables in our function we want to minimize, what we will do is calculate partial derivatives using both β1(slope) and β0(intercept) just like we did in the OLS part. Here are the partial derivatives we calculated previously

Final Update rules:

So like the earlier single variable process we will start with updating one variable then move on to the next and keep on calculating the function until we have a local minimum reached

How gradient descent works with two unknowns

Why do we need gradient descent when we have a closed form solution in this case the Ordinary Least Square method ?

I will quote the best answer I have read on this

“The main reason why gradient descent is used for linear regression is the computational complexity: it’s computationally cheaper (faster) to find the solution using the gradient descent in some cases.

The formula(OLS) looks very simple, even computationally, because it only works for univariate case, i.e. when you have only one variable. In the multivariate case, when you have many variables, the formulae is slightly more complicated on paper and requires much more calculations when you implement it in software” Aksakal

Types of Linear Regression

  1. Simple Linear Regression: The simplest case of linear regression with one explanatory variable and one dependent variable. Ex:- weight and fitness
  2. Multiple Linear Regression: Usually in real life there are more than one explanatory variables to explain a dependent variable. Ex:- weight and height for fitness
  3. Multivariate Linear Regression: This is what people mostly confuse with Multiple Linear Regression, an MVR is a special case of LR with multiple explanatory variables and multiple dependent variable. Ex:- Predicting two subjects GPA using factors like no. of hours studied overall, past results in those subject etc.

Assumptions of Linear Regression:

  1. Linearity: The dependent variable y will have a linear relationship with any of the independent variables. Why ? This is the exact assumption which helps us while fitting the straight line, how will you fit a straight line if the relationship is non-linear
  2. Homoscedasticity: It is the assumption that the residuals(fitted-actual values) have a constant variance.
https://tinyurl.com/ks4a983s(All the dots are the residuals fitted)

So if you plot residuals against the fitted values and see the variance from it’s mean it should be constant. Why should we bother if this assumption does not meet? Classic LR can predict the usual cases that have been happening somewhere around the mean of the data and lacks significantly when the data becomes more diverge.

3. Independence: Linear regression model assumes that error terms are independent. This means that the error term of one observation is not influenced by the error term of another observation. In case it is not so, it is termed as autocorrelation.

Why should we bother? The estimated standard errors tend to underestimate the true standard error. P value thus associated is lower. This leads to conclusion that a variable is significant even when it is not. The confidence and prediction intervals are narrower than what they should be.

4. Multi-collinearity: It occurs in multiple regression model where two or more explanatory variables are closely related to each other. It becomes difficult to explain what explanation is caused by which variable and also we have variables explaining the same variance in dependent variable

5. Normal Distribution: The residuals should follow a normal distribution with a mean of 0. The calculation of confidence interval and variable significance is based on this assumption.

When Linear Regression doesn’t work :

Of course a very simple answer can be — When these assumptions don’t hold true then, but there is one reason why most Linear Regression’s do not work and that is called Interaction Effect.

In regression, when the influence of an independent variable on a dependent variable keeps varying based on the values of other independent variables, we say that there is an interaction effect. Let’s say you are working for Uber and you are trying to predict which places geographically are better for consumers using demand and supply as a variable. As it is evident, the two variables themselves are highly interactive and our LR model can not capture the interaction. That is when we go for tree based models.

More Resources and Citations:

  1. https://towardsdatascience.com/linear-regression-simplified-ordinary-least-square-vs-gradient-descent-48145de2cf76
  2. https://medium.com/analytics-vidhya/gradient-decent-in-linear-regression-ec2308439478
  3. https://mccormickml.com/2014/03/04/gradient-descent-derivation/
  4. https://www.researchgate.net/post/The-difference-between-least-square-fitting-and-gradient-descent
  5. https://medium.com/analytics-vidhya/gradient-decent-in-linear-regression-ec2308439478
  6. https://www.youtube.com/watch?v=ewnc1cXJmGA
  7. https://www.coursera.org/learn/machine-learning

Summary

What I have tried to do here is cover each and every aspect of a model so basic we all underrate it sometimes. Linear Regression is a tremendously powerful model and still used widely in the biotech sector. While this article could be more detailed in some aspects, I will leave that task to the reader to go more in-depth if they feel like it.

Let me know in the comments if you want me to emphasize on anything particular in the next article and also please feel free to reach out to me on LinkedIn if you have any more questions, I will do my best to reply.

Read my other articles:

  1. https://medium.com/nerd-for-tech/a-primer-to-time-series-forecasting-58bbd91cb3bd
  2. https://medium.com/nerd-for-tech/from-a-business-analyst-to-a-data-scientist-4720f536888d
  3. https://faun.pub/a-primer-to-blockchain-and-the-crypto-world-964e48ed96af

--

--

Dhirendraddsingh
Nerd For Tech

A Data Scientist , football lover and a student of Pyschology