Linear Regression Basic interview question

 1. What is linear Regression?

 Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables.
It is mostly used for finding out the relationship between variables and forecasting. 

Linear regression is a quiet and the simplest statistical regression method used for predictive analysis in machine learning. Linear regression shows the linear relationship between the independent(predictor) variable i.e. X-axis and the dependent(output) variable i.e. Y-axis, called linear regressionIf there is a single input variable X(independent variable), such linear regression is called simple


To calculate best-fit line linear regression uses a traditional slope-intercept form which is given below,

                                               Yβ0 + β1X




x: input training data (univariate – one input variable(parameter)) y: labels to data (supervised learning)

When training the model – it fits the best line to predict the value of y for a given value of x. The model gets the best regression fit line by finding the best θ1 and θ2 values. θ1: intercept θ2: coefficient of x

Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally using our model for prediction, it will predict the value of y for the input value of x.

2. How we can calculate error in Linear Regression?

Linear regression most often uses mean-square error (MSE) to calculate the error of the model.

MSE is calculated by:

  1. measuring the distance of the observed y-values from the predicted y-values at each value of x;
  2. squaring each of these distances;
  3. calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

3. Difference between Loss and Cost function?

There is no major difference.

  1. When calculating loss we consider only a single data point, then we use the loss function.

  2. Whereas, when calculating the sum of error for multiple data then we use the cost function.

4. MAE , MSE and RMSE?

MAE :
MAE evaluates the absolute distance of the observations (the entries of the dataset) to the predictions on a regression, taking the average over all observations. We use the absolute value of the distances so that negative errors are accounted properly. This is exactly the situation described on the image above.

                 

   
                         

MSE :
Another way to do so is by squaring the distance, so that the results are positive. This is done by the MSE, and higher errors (or distances) weigh more in the metric than lower ones, due to the nature of the power function.

                                 

          
RMSE :
A backlash in MSE is the fact that the unit of the metric is also squared, so if the model tries to predict price in US,theMSEwillyieldanumberwithunit(US)² which does not make sense. RMSE is used then to return the MSE error to the original unit by taking the square root of it, while maintaining the property of penalizing higher errors.



5. Explain how Gradient descent work in Linear Regression ?

Gradient descent is an algorithm that approaches the least squared regression line via minimizing sum of squared errors through multiple iterations. So far, I've talked about simple linear regression, where you only have 1 independent variable (i.e. one set of x values).





6. Explain the intercept term means?

The intercept (often labeled as constant) is the point where the function crosses the y-axis. In some analysis, the regression model only becomes significant when we remove the intercept, and the regression line reduces to Y = bX + error


7. Write all the assumption for Linear Regression?

Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable. The regression has five key assumptions:

  1. Linear relationship
  2. Multivariate normality
  3. No or little multicollinearity
  4. No auto-correlation
  5. Homoscedasticity

8. How is Hypothesis testing used in Linear Regression?

Hypothesis testing is used to confirm if our beta coefficients are significant in a linear regression model. Every time we run the linear regression model, we test if the line is significant or not by checking if the coefficient is significant.

Let us understand it in Simple Linear Regression first.

When we fit a straight line through the data, we get two parameters i.e., the intercept (β₀) and the slope (β₁). 

Now, β₀ is not of much importance right now, but there are a few aspects around β₁ which needs to be checked and verified. Suppose we have a dataset for which the scatter plot looks like the following:


When we run a linear regression on this dataset in Python, Python will fit a line on the data which looks like the following:

We can clearly see that the data in randomly scattered and doesn’t seem to follow linear trend. Python will anyway fit a line through the data using the least squared method. We can see that the fitted line is of no use in this case. Hence, every time we perform linear regression, we need to test whether the fitted line is a significant one or not (in other terms, test whether β₁ is significant or not). We will use Hypothesis Testing on β₁ for the same.

Steps to Perform Hypothesis testing:
1.Set the Hypothesis
2.Set the Significance Level, Criteria for a decision
3.Compute the test statistics
4.Make a decision

9. How would you decide the importance of variable for multivariate regression?

  1. Variables that are already proven in the literature to be related to the outcome.
  2. Variables that can either be considered the cause of the exposure, the outcome, or both.
  3. Interaction terms of variables that have large main effects.

10. What is the difference between R^2 vs adjusted R^2?



N = number of data points
P = Number of Independent features




Comments

Popular posts from this blog

Ridge, Lasso And ElasticNet Regression .

Performance Metrices for classification and Regression.

CNN(Convolutional Neural Networks)