In this blog I will be writing about Linear Regression, that is, what is linear regression, finding best fit regression line, checking goodness of fit etc.

Introduction

Models use machine learning algorithms, during which the machine learns from the data just like humans learn from their experiences. Machine learning models can be broadly divided into two categories based on the learning algorithm which can further be classified based on the task performed and the nature of the output.

1. Supervised learning methods: It contains past data with labels which are then used for building the model.

Regression: The output variable to be predicted is continuous in nature, e.g. scores of a student, diamond prices, etc.
Classification: The output variable to be predicted is categorical in nature, e.g.classifying incoming emails as spam or ham, Yes or No, True or False, 0 or 1.

2. Unsupervised learning methods: It contains no predefined labels assigned to the past data.

Clustering: No predefined labels are assigned to groups/clusters formed,e.g. customer segmentation.

What is Regression?

Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables).

Uses of Regression

Well there are plenty of use cases for regression but I will mention 3 major applications.

Three major uses for regression analysis are:

Determining the strength of predictors
Forecasting an effect
Trend forecasting

Simple Linear Regression

Linear regression is a quiet and the simplest statistical regression method used for predictive analysis in machine learning. Linear regression shows the linear relationship between the independent(predictor) variable i.e. X-axis and the dependent(output) variable i.e. Y-axis, called linear regression. If there is a single input variable X(independent variable), such linear regression is called simple linear regression.

The above graph presents the linear relationship between the output(y) variable and predictor(X) variables. The blue line is referred to as the best fit straight line. Based on the given data points, we attempt to plot a line that fits the points the best.

To calculate best-fit line linear regression uses a traditional slope-intercept form which is given below,

Yi = β0 + β1Xi

Yi = M Xi + C

where Yi = Dependent variable, β0 = constant/Intercept, β1 = Slope, Xi = Independent variable.

This algorithm explains the linear relationship between the dependent(output) variable y and the independent(predictor) variable X using a straight line

Y= B0 + B1 X.

But how the linear regression finds out which is the best fit line?

The goal of the linear regression algorithm is to get the best values for B0 and B1 to find the best fit line. The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum.

Random Error(Residuals)

In regression, the difference between the observed value of the dependent variable(yi) and the predicted value(predicted) is called the residuals.

εi = ypredicted – yi (actual)

What is the best fit line?

In simple terms, the best fit line is a line that fits the given scatter plot in the best way. Mathematically, the best fit line is obtained by minimizing the Residual Sum of Squares(RSS).

Cost Function for Linear Regression

The cost function helps to work out the optimal values for B0 and B1, which provides the best fit line for the data points.

In Linear Regression, generally Mean Squared Error (MSE) cost function is used, which is the average of squared error that occurred between the ypredicted and yi.

We calculate MSE using

Using the MSE function, we’ll update the values of B0 and B1 such that the MSE value settles at the minima. These parameters can be determined using the gradient descent method such that the value for the cost function is minimum.

Some of cost function are: MSE, MAE, RMSE, HUBER LOSS.

MSE :

Another way to do so is by squaring the distance, so that the results are positive. This is done by the MSE, and higher errors (or distances) weigh more in the metric than lower ones, due to the nature of the power function.

Advantages:

- This equation is differentiable.
- This equation also has one global minima.

Disadvantages:

- This is not robust to outliers.

- Penalizing the error changing the unit.

MAE :

MAE evaluates the absolute distance of the observations (the entries of the dataset) to the predictions on a regression, taking the average over all observations. We use the absolute value of the distances so that negative errors are accounted properly. This is exactly the situation described on the image above.

Advantages:

- Robust to Outliers

- It will also be in the same unit

Disadvantages:

- Convergence usually takes more time optimization is a complex task

- Time comsuming

RMSE :
A backlash in MSE is the fact that the unit of the metric is also squared, so if the model tries to predict price in US

, t h e M S E w i l l y i e l d a n u m b e r w i t h u n i t (U S

)² which does not make sense. RMSE is used then to return the MSE error to the original unit by taking the square root of it, while maintaining the property of penalizing higher errors.

Gradient Descent for Linear Regression

Gradient Descent is one of the optimization algorithms that optimize the cost function(objective function) to reach the optimal minimal solution. To find the optimum solution we need to reduce the cost function(MSE) for all data points. This is done by updating the values of B0 and B1 iteratively until we get an optimal solution or Global minima.

A regression model optimizes the gradient descent algorithm to update the coefficients of the line by reducing the cost function by randomly selecting coefficient values and then iteratively updating the values to reach the minimum cost function.

To optimize the changes of B0 and B1 values

Let’s take an example to understand this. Imagine a U-shaped pit. And you are standing at the uppermost point in the pit, and your motive is to reach the bottom of the pit. Suppose there is a treasure at the bottom of the pit, and you can only take a discrete number of steps to reach the bottom. If you opted to take one step at a time, you would get to the bottom of the pit in the end but, this would take a longer time. If you decide to take larger steps each time, you may achieve the bottom sooner but, there’s a probability that you could overshoot the bottom of the pit and not even near the bottom. In the gradient descent algorithm, the number of steps you’re taking can be considered as the learning rate , and this decides how fast the algorithm converges to the minima. were Learning rate is speed of convergence.

To update B0 and B1, we take gradients from the cost function. To find these gradients, we take partial derivatives for B0 and B1.

Performance Metrics:

However, there is one main difference between R^2 and the adjusted R^2: R^2 assumes that every single variable explains the variation in the dependent variable. The adjusted R^2 tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable.
R Squared:

Adjusted R Squared:

N = number of data points

P = Number of Independent features

Here in r square when gender feature is add then the r square value is increase were the feature is not linear to dependent feature.

In adjusted r square if the feature is important or linear to the dependent feature then the adjusted r square value will increase.

Overfitting And Underfitting:

Overfitting:

A Statistical model is said to be overfitted when the model does not make accurate predictions on testing data. When a model gets trained with so much data,it starts learning from the noise and inaccurate data entries in our data set. And when testing with test data results in High variance. And when testing with train data results in low bias.

Underfitting:

Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples(often called X) and the Target values(often called Y).

And when testing with test data results in High/Low variance. And when testing with train data results in High bias.

full stack data science

Linear Regression

Introduction

What is Regression?

Uses of Regression

Simple Linear Regression

What is the best fit line?

Cost Function for Linear Regression

R Squared:

Comments

Post a Comment

Popular posts from this blog

Ridge, Lasso And ElasticNet Regression .

Performance Metrices for classification and Regression.

CNN(Convolutional Neural Networks)