Logistic Regression- The history, the theory and the maths

Published in

Nerd For Tech

9 min readJan 22, 2022

Logistic Regression is a special case of Generalized Linear Model , the same group which Linear Regression belongs to. Logistic Regression has been mostly used to model probabilities of an outcome rather than the outcome itself.

Because it can model probabilities ,we can also use it to classify events such as pass/fail , cancer/ not cancer etc. after passing a probability threshold (mostly 0.5 in cases of binary classes) and therefore it is also referred as a classification algorithm

A bit of History

The earliest variant of Logistic Regression is Linear Discriminant Analysis(LDA) by Ronald Fisher.

LDA, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events.

The main difference between LDA and Logistic Regression is the assumption that the independent variables are normally distributed, which is a fundamental assumption of the LDA.

What is Logistic Regression ?

A Logistic Regression algorithm is basically used to model probabilities of outcome, how likely is something supposed to happen. Will you pass or fail, does someone has cancer or not etc., things like that can be given a probability value between 0 and 1 based on the data.

And this is also the major difference in the approach of Logistic Regression from Linear Regression. In the former we try to model the outcome which might be a continuous variable, in the latter we model the probability which is always between 0 and 1.

Why not use Linear Regression ?

Problem Statement :- We have a binary output variable Y(cancer yes or no ), and we want to model the probability(P) as a function of X (our input i.e. tumor size).

The most obvious idea is to let probability that Y=1 i.e Cancer , be a linear function of X just like linear regression. Every increment of a component of X would add or subtract so much to the probability.

There are two reasons why this is a bad idea. One is that probabilities must be between 0 and 1, but our linear function will not necessarily respect that, even if all the observed Y they get are either 0 or 1. The other is that we might be better off making more use of the fact that we are trying to estimate probabilities, by more explicitly modeling the probability. Moreover, in many situations we empirically see “diminishing returns” — changing probability(p) by the same amount requires a bigger change in x when p is already large (or small). Linear models can’t do this.

What if we assume the log of probability of x be a linear function of x , this would not work as log is not unbounded in both directions and our linear function will be. There is one solution to this

Let’s assume the log odds of p is a linear function of x(log odds is unbounded in both directions)

This is the logistic function(or the infamous “S” curve) derived using log-odds ,it squishes every real value to a range of 0 to 1(perfect for modelling probabilities)

https://www.javatpoint.com/linear-regression-vs-logistic-regression-in-machine-learning

So what we basically do is, we take a straight line and put it through our sigmoid activation function. How do we get the parameters of this model we will learn later

How do we fit the curve or how do we find our parameters ?

To answer that there are some concepts we have to understand :-

a. Probability Distributions

Probability distributions are statistical functions that describe the likelihood of obtaining possible values that a random variable can take. In other words, the values of the variable vary based on the underlying probability distribution.

1.Normal Distribution/ Gaussian Distribution

The distribution has two parameters , SD(𝜎) and mean(𝜇)

Interpretation:

~68% of the values drawn from normal distribution lie within 1𝜎 of the mean
~95% of the values drawn from normal distribution lie within 2𝜎 of the mean
~99.7% of the values drawn from normal distribution lie within 3𝜎 of the mean

The probability density of observing a single data point x given the SD(𝜎) and mean(𝜇), that is generated from a Gaussian distribution is given by:-

2. Binomial Distributions

The binomial distribution is used when there are exactly two mutually exclusive outcomes of a trial(perfect for logistic regression). These outcomes are appropriately labeled “success” and “failure”. The binomial distribution is used to obtain the probability of observing x successes in N trials, with the probability of success on a single trial denoted by p. The binomial distribution assumes that p is fixed for all trials. The following is probability distribution function for binomial distributions

Ex. Let’s say you toss 5 coins , what is the probability of observing 0 head if the coin is fair:- 5c0 * 0.5⁰ * 0.5⁵

b. Maximum Likelihood Estimate (https://tinyurl.com/yckt66bt)

“In statistics, maximum likelihood estimation is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable”

Let’s continue with the gaussian distribution:-

Given a gaussian distribution which has SD(𝜎) and mean(𝜇), we have three random numbers 9, 9.5 and 11 . Which values of 𝜎 and 𝜇 maximize the probability of the random numbers being 9 , 9.5 and 11.

Let’s use the probability density function to find the joint probability of the random numbers. We have assumed probability of one random number does not affect the probability of the remaining numbers.

This is the joint probability of observing the 3 numbers in a gaussian distribution

How do we find the parameters that maximize the probability ?

DIFFERENTIATION . We can find the maxima(or minima)

Let’s take the log of both sides to make differentiation easier:-

This expression can be differentiated to find the maximum. In this example we’ll find the MLE of the mean, μ. To do this we take the partial derivative of the function with respect to μ, giving

We can simply set the left side to 0 and find the value of mean. We can do the same exercise for SD.

How do we learn the parameters of the generalized linear model, In the case of logistic regression ?

This is normally done by means of maximum likelihood estimation, which we conduct through gradient ascent.

Probability that Y =1 is given a particular X as an input and model parameter 0:-

Probability that Y =0 is given a particular X as an input and model parameter 0(we just subtract it from 1 as some of probabilities is ):-

Let’s combine both of the probabilities, the equation below is an outcome of Bernoulli distribution(The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted , so n would be 1 for such a binomial distribution)

Probability of Y given a particular X as an input and model parameter ∅:-

As you can see if Y =1 the second part of the equation becomes 1 and we get the probability of Y =1 i.e. h(x) and in the case of Y=0 the first part of equation becomes 1 and we get the probability of Y =0 i.e 1- h(x)

So let’s call all our independent variables X ,all our dependent binary variable Y , total observations M and model parameters θ and we can say:-

This is what is the famous likelihood function or L(θ). Why ? Because we are saying how plausible our model with parameters θ is given all of our data points.

Now what we will do is maximize the likelihood function. But there is a problem as our sigmoid function (through which we are multiplying probabilities) is not always convex. The solution is log as logarithm of the likelihood function is however always convex

So what we can do is use this function and do a gradient ascent to find Maximum Likelihood Estimate. Why ? The likelihood function is the plausibility of the model. If we maximize it intuitively that means it is the most plausible model

So let’s calculate the partial derivative of the L(θ) with respect to θ:-

Now we have a common term :-

We can write this as:-

The first term equates to :-

a derivative of a sigmoid with respect to it’s value is the product of sigmoid and 1- sigmoid

and the second term equates to x obviously

If you go on calculating you will finally arrive at :-

We will use this equation for our gradient ascent to find the parameters θ and we will get our best logistic regression model

The + sign represents the ascent, the alpha is the scaling parameter

Phew! Lot of maths but we are done for now :)

Assumptions of Logistic Regression:-

1: The Response Variable is Binary

2: The Observations are Independent

3: There is a Linear Relationship Between Explanatory Variables and the Logit of the Response Variable

In contrast to linear regression, logistic regression does not require:

A linear relationship between the explanatory variable(s) and the response variable.
The residuals of the model to be normally distributed.
The residuals to have constant variance, also known as homoscedasticity.

More Resources and Citations:

Closing note:-

Let me know in the comments if you want me to emphasize on anything particular in the next article and also please feel free to reach out to me on LinkedIn if you have any more questions, I will do my best to reply.

Read my other articles: