Logistic Regression

Published in

Analytics Vidhya

9 min readJan 5, 2021

In our day-to-day life we come across many problems in which we have certain problems that revolves around choosing a category such as pass/fail, win/lose, alive/dead ,healthy/sick ,Yes/No, etc. It’s like picking up a single choice out of two choice. Decision making plays an important role in our life and selecting any of the choice has its own consequences.

By reading the above stuff’s, you may dwell with the question whether to proceed with this blog or skip it?

Make a Choice.

Come on lets dive in assuming that you have chosen the YES Category. It was a good choice. It was an easy task for you but what if I have asked you that whether a random person with your age is likely to read my blog or not? You may have answered any choice but how would a machine solve the same problem?

For a machine to answer this question it needs a lots of Data with the labels(Outcome variable i.e., Yes/No)and learns from it. This type of machine learning is generally called as Supervised Learning. In this case lots of data with person’s age and whether he has read the blog or not(Yes/No). Let’s dive inside the machine’s mind.

Logistic Regression is the supervised machine learning technique that is used for classification problems. It is used for classifying problems such as

Binary Classification(Classifying the two categories problem)-Whether you we read my blog or not?(1-Will read my blog,0-Does not read my blog)
Multi-class Classification(Classifying the problem with more than two categories)-What would be your rating for my blog out of five?(1-verybad,2-poor,3-Nice,4-good,5-verygood)

In this blog lets deal with binary classification problem to learn about how Logistic Regression works.

The data points are shown below. Let’s see how the Logistic Regression predict.

Before Predicting Let’s understand how the Algorithm works.

Sigmoid Curve

Logit function is a S-curved or Sigmoid function that scales value between 0 & 1. This is how a Logit function resembles in a 2-Dimensional space.

The shape of the Sigmoid curve depends on the two parameters namely:

m — The Slope of the curve.
c — The Intercept.

We can see that there are different sigmoid curves for different values of m & c.

Which sigmoid curve is the best for our Data points?

To answer that question we have to understand terms such as the cost function and the gradient descent.

Cross Entropy Loss:

This loss tells us how good is our model is performing. The following diagram represents the gradient descent.

Gradient Descent:

The path of how the cross-entropy loss is minimized and attains the Global minima( where our best model can be predicted) is shown in the above Gradient Descent diagram.

Blue Dot -Represents our cross entropy loss of our model.(Cross entropy loss in Maximum)
Red Dot-Represents the global Minima.(The best model the machine can predict where the Cross entropy loss is Minimum)

How to reach the Global Minima?

Sometimes we may end up like a non-convex curve that is shown is the below diagram in which there may be a lot a local minima but the best model is found at global minima where the cross-entropy loss is minimum.

First the Cross entropy loss of our model is calculated and then it starts to move towards the global minima.
By changing the parameters value of our model which thereby reduces the cross entropy loss and thus the path towards lowering our cross entropy loss(Reaching the Global Minima) is started.

By repeating the above process the cross entropy loss is minimized and hence the global minima is achieved which is shown is the above Gradient Descent Diagram. Local minima doesn’t represent the best model ,the best model is found at global minima only.

Now lets see how the sigmoid curve changes based on these two parameters.

1) Changing the Slope(m) keeping the Intercept(c) as constant.

As we can see from the diagram by changing the Slope(m)keeping the intercept(c) as constant, the slope of the S-curve changes.

Mathematically , partial differentiation is applied keeping the c term as constant and then subtracted with the current m value.

n-(Learning Rate)-It is a parameter that defines how big or small the steps should be taken in order to reach the global Minima.

1) Changing the Intercept(c) keeping the Slope(m) as constant.

As we can see from the diagram by changing the intercept(c) keeping the Slope(m) as constant, the S-curve moves across the x-axis.

Mathematically , partial differentiation is applied keeping the m term as constant and then subtracted with the current c value.

By this process finally we can reach the Global Minima where our best model or best S-curve can be achieved for our Dataset.

Finally we have found the best S-curve. Now What?

What is a Threshold and how to find the Right threshold?

Green Region- Positive Region(All the data points in this region is classified as Positive Class)

Red Region- Negative Region(All the data points in this region is classified as Negative Class)

A threshold is set for a sigmoid curve and assume a vertical line is drawn on x-axis as shown in the above diagram. The data points which are present to the Right of that line, the algorithm classifies it as Positive class and all the data points which are present on the left of the vertical line the algorithm classifies it as Negative class.

Whether the threshold must be set to 0.1,0.5,0.9 or some other number. To find what is the Right Threshold, first we need to know what is a Confusion Matrix.

Confusion Matrix & classification metrics

A confusion matrix is a table that helps you to the know the performance of the classification model based on a test data for that the true values are known.

TN(True Negative) —Actual value as well as the predicted value is also Negative.
FP(False Positive) — Actual value is Negative but the predicted value is Positive. These are also known as Type I Error.
FN(False Negative) — Actual value is Positive but the predicted value is Negative. These are also known as Type II Error.
TP(True Positive) — Actual value as well as the predicted value is also Positive.

Based on these four value some Classification metrics are calculated but we will go consider only two of them namely,

Recall(Sensitivity)— When the actual value is Positive how often the prediction value is also Positive.

Precision — When a positive value is predicted, how often is the prediction correct?

Now let’s go back and try to figure out the Right Threshold.

The following Diagram represents the different threshold with its confusion matrix and classification metrics.

From the above diagram you could probably see that whatever threshold you set you will definitely end up in either Type I Error or Type II error. Its not possible to classify all the True Positive and False Positive accurately without any error. This will occur in almost all datasets. In general it is called as Precision/Recall Tradeoff because when there is a change the error’s values it directly affects the Precision and Recall Value.

Precision/Recall Tradeoff

Whenever you increase the precision ,the recall value decreases and the vice-versa happens. Precision and Recall are Inversely proportional to each other.

We need to select a threshold such that the precision and recall value are almost close to 1.

In some cases we set the threshold to a minimum or maximum value considering one of the classification metrics is more important than the other.

Case 1 : Type II error is more important than Type I error

Let’s assume that 12 people went to Wuhan market on a particular day on year ago and everyone had a symptom of some cough and everyone had visited a doctor. The doctor says some random new virus has been found in 8 people out of 12 people who visited the hospital .The doctor’s prediction is shown in the diagram. He says it’s a contagious virus and every 8 of them needs to be in a quarantine for some days. Out of 8 people ,the doctor Falsely predicted 3 people have that random disease(Type I error). Though the 3 people are in a self-quarantine it does not make any difference to the society but the 1 person who the doctor diagnosed as Negative was happened to be the most dangerous one because since the doctor said he was negative ,that person just normally does his regular work, took public transport and mingled with his family as usual. Since the doctor has Falsely predicted as negative to that one person ,that person has triggered the spread of virus and hence the virus was so contagious, it spreads even faster in our globalized world and thus led to the pandemic disrupting the economies and lives of people all over the globe.

In this case you can see that Type II Error seems to be more dangerous than Type I Error.

Case 2 : Type I error is more important than Type II error

Let’s assume that there are 12 people who are awaiting to get the loan from the bank. The bank gave loan to only 6 person randomly using some machine learning algorithm. The machine learning algorithm predicts whether a person would repay the money or not based on individual details and thus its prediction is shown in the above diagram. In that the bank Falsely predicts 2 persons that they will not repay the money but based on their property and salaries the are capable to repay the loan(Type II error).It doesn’t affect the bank too much. Coming on to the other side, the machine learning algorithm Falsely predicts 2 persons that they will repay the money and the bank gave loan to 2 people. In this case, that 2 people didn’t repay the loan and thus the bank losses its money.

In this case you can see that Type I Error seems to be more dangerous than Type II Error.

So depending on the context of the problem one may choose which error is better than other and can set the correct threshold for Predictions.

THE CODE FOR ABOVE VISUALIZATION

That’s it. I think because you ended up picking the right choice you have utilized your precious time studying my blog.

Life is a matter of Choices ,and every choice you make Makes you. — JOHN C. MAXWELL

Finally, Make a Choice.