Data Science Interview question Blog 2
Q1. What is Logistic
Regression?
Answer:
The logistic regression technique involves
the dependent variable,
which can be represented in the binary (0 or 1, true or false, yes or no) values, which means that the outcome
could only be in either one form of two. For example, it can be
utilized when we need to find the probability of a successful or fail event.
Logistic Regression is used when the dependent variable (target) is categorical.
Model
Output = 0 or 1
Z = WX + B
hΘ(x) = sigmoid (Z)
hΘ(x) = log(P(X) / 1 - P(X)
) =
WX +B
If ‘Z’ goes to infinity,
Y(predicted) will become 1, and if ‘Z’ goes to negative infinity,
Y(predicted) will become 0.
The output from the hypothesis is the estimated probability. This is used to infer how confident can predicted value be actual
value when given an input X.
Cost Function
Cost ( hΘ(x) , Y(Actual)) = -log (hΘ(x)) if y=1
-log (1 - hΘ(x)) if y=0
This
implementation is for binary logistic regression. For data with more than 2 classes, softmax regression has to be used.
Q2. Difference between logistic and linear regression?
Answer:
Linear and Logistic regression are the most basic form of regression which are commonly used. The essential difference between these two is that Logistic regression is used when the dependent variable is binary. In contrast, Linear regression is used when the dependent variable is continuous, and the nature of the regression line is linear.
Key Differences between Linear and Logistic Regression
Linear regression models data using
continuous numeric value. As against, logistic regression models the data in the binary
values.
Linear regression requires
to establish the linear relationship among dependent and independent variables, whereas it is not
necessary for logistic regression.
In linear regression, the independent variable can be correlated with each other. On the contrary, in the logistic regression, the variable must not be correlated with each other.
Q3. Why we can’t do a classification
problem
using Regression?
Answer:
With linear regression you fit a polynomial through the data - say, like
on
the example below, we fit a straight line through {tumor
size, tumor type} sample set:
Above, malignant tumors get 1,
and non-malignant ones get 0, and the green line is our hypothesis h(x). To make predictions, we may say
that for any given tumor size x, if h(x) gets bigger than 0.5, we predict malignant
tumors. Otherwise, we predict benignly.
It looks like this way, we could correctly predict every single training set sample, but now let's change the task a bit.
Intuitively it's clear that
all tumors larger certain threshold are malignant. So let's add another sample with huge tumor size, and run linear regression again:
Now our
h(x)>0.5→malignant doesn't work anymore. To keep making correct
predictions, we need to change it to h(x)>0.2
or something - but that not how the algorithm should work.
We cannot change the
hypothesis each time a new sample arrives. Instead, we should learn it off the training set data, and then (using the hypothesis we've learned) make correct
predictions for the data we haven't seen before.
Linear regression is unbounded.
Q4. What is
Decision Tree?
A decision tree is a type of
supervised learning algorithm that can be used in classification as well as regressor problems. The input to a decision tree can be both continuous as
well as categorical. The decision
tree works on an if-then statement. Decision tree tries to solve a problem by
using tree representation (Node and Leaf)
Assumptions while creating a decision tree:
1) Initially all the training set is considered as a root
2) Feature values are preferred to be categorical, if continuous then they are discretized
3) Records are distributed recursively on the basis of attribute values
4) Which attributes are considered to be in root node or internal node is done by using a statistical approach.
Q5. Entropy,
Information Gain, Gini Index, Reducing
Impurity?
Answer:
There are different attributes which define
the split of nodes in a decision
tree. There are few algorithms to find the optimal
split.
1)
ID3(Iterative
Dichotomiser 3): This
solution uses Entropy and Information gain as metrics to form a better decision tree. The
attribute with the highest information gain is used as a root node, and a similar approach is followed after that.
Entropy is the measure that characterizes the impurity of an arbitrary
collection of examples.
Entropy varies from 0 to 1. 0 if all the data belong to a single class and 1 if the class distribution is equal. In this way, entropy will give a measure of impurity in the dataset.
Steps to decide which attribute to split:
2. For every attribute:
2.1 Calculate entropy for all categorical values.
2.2 Take average information entropy for the attribute.
1. Pick the attribute with the highest information gain.
2. Repeat until we get the desired tree.
Information Gain = 1 - ∑ (Sb/S)*Entropy (Sb)
Sb - Subset, S - entire data
1) CART Algorithm
(Classification and Regression trees): In CART, we use the GINI index as a metric. Gini index is used as a cost function
to evaluate split in a dataset
Steps to calculate Gini for a split:
1.
Calculate Gini for subnodes,
using formula sum of the square of probability for success and failure (p2+q2).
2.
Calculate Gini for split using weighted
Gini score of each node of that split.
Choose the split based on higher Gini value
Split on Gender:
Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
Weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59
Similar for Split on Class:
Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
Weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
Here Weighted Gini is high for gender, so we consider splitting based on gender
Q6. How to control
leaf height and Pruning?
Answer:
To control the leaf size, we can set the parameters:-
1. Maximum depth :
Maximum tree depth is a limit to stop the further splitting
of nodes when the specified
tree depth has been reached during
the building of the initial
decision tree.
NEVER use maximum depth to limit the further splitting of nodes. In other words: use the largest possible value.
2. Minimum split size:
Minimum split size is a limit to stop the further
splitting of nodes when the number of observations in the node
is lower than the minimum split size.
This is a good way to limit the growth
of the tree. When a leaf contains
too few observations, further splitting will result in overfitting (modeling of noise in the data).
Minimum leaf size is a limit to split a node when the number of observations in one of the child nodes is lower than the minimum
leaf size.
Pruning is mostly done to reduce the chances of overfitting the tree to the training data and reduce the overall complexity of the tree.
There are two types of pruning: Pre-pruning and Post-pruning.
1. 1.Pre-pruning is also known as the early stopping criteria. As the name suggests, the criteria are set as parameter values while building the model. The tree stops growing when it meets any of these pre-pruning criteria, or it discovers the pure classes.
2. 2.In Post-pruning, the idea is to allow the decision tree to grow fully and observe the CP value. Next, we prune/cut the tree with the optimal CP(Complexity Parameter) value as the parameter.
The CP (complexity parameter) is used to control
tree growth. If the cost of adding a variable
is higher, then the value of CP, tree growth
stops.
Q7. How to handle a decision tree for numerical and categorical data?
Answer:
Decision trees
can handle both categorical and
numerical variables at the same time as features. There is not any problem
in doing that.
At every split, the decision tree will take the best variable at that moment. This will be done according to an impurity measure with the split branches. And the fact that the
variable used to do split is categorical
or continuous is irrelevant (in fact, decision trees categorize continuous
variables by creating binary regions
with the threshold).
Q8. What is the Random Forest Algorithm?
Answer:
Random Forest is an ensemble
machine learning algorithm that follows the bagging technique. The base estimators in the random forest are
decision trees. Random forest randomly selects a set of features that are
used to decide the best split
at each node of
the decision tree.
Looking at it step-by-step, this is what a random forest model does:
1. Random subsets are created from the original dataset (bootstrapping).
2. At each node in the decision tree, only a random set of features are considered to decide the best split.
3. A decision tree model is fitted
on each of the subsets.
To
sum up, the Random forest randomly selects data points and features and builds
multiple trees (Forest).
Some Important Parameters:-
1. n_estimators:- It defines the number of decision trees to be created in a random forest.
2. criterion:- "Gini" or "Entropy."
3. min_samples_split:- Used to define the minimum number of samples required in a leaf node before a split is attempted
4. max_features: -It defines the maximum number of features allowed for the split in each decision tree.
5. n_jobs:- The number of jobs to run in parallel for both fit and predict. Always keep (-1) to use all the cores for parallel processing.
Q9. What is Variance
and Bias tradeoff?
Answer:
In predicting models, the prediction error is composed
of two different errors
1. Bias
2. Variance
It is important to understand
the variance and bias trade-off which tells about to minimize the Bias and Variance in the prediction and avoids overfitting & under fitting of the model.
Bias: It is the
difference between the expected or average prediction of the model and the
correct value which we are trying to
predict. Imagine if we are trying to build more than one model by collecting different data sets, and later
on, evaluating the prediction, we may end up by different prediction
for all the models. So, bias is something which measures how far these
model prediction from the correct prediction. It always leads to a
high error in training and test data.
Variance:
Variability of a model prediction for a given data point. We can build the
model multiple times, so the
variance is how much the predictions for a given point vary between
different realizations of the model.
Bias scenario's: using a phonebook to select participants in our survey is
one of our sources of bias. By only surveying certain classes of people, it
skews the results in a way that will be consistent if we repeated
the entire model building exercise. Similarly, not following up with respondents
is another source of
bias, as it consistently changes the mixture of responses we get. On our bulls-eye diagram, these move us
away from the center of the target, but they would not result in an increased
scatter of estimates.
Variance scenarios:
the small sample size is a source of variance. If we increased our sample size, the results would be more consistent each
time we repeated the survey and prediction. The results still might be highly inaccurate due to our large sources of
bias, but the variance of predictions will be reduced
Answer
1. Bagging and Boosting
Decision trees have been around
for a long time and also known
to suffer from bias and variance. You will have a large bias with simple trees and a large variance
with complex trees.
Ensemble methods - which combines
several decision trees to produce
better predictive performance than utilizing a single
decision tree. The main principle behind the ensemble model is that a group of weak learners come together to form
a strong learner.
Two techniques to perform
ensemble decision trees:
1.
Bagging
2.
Boosting
Bagging (Bootstrap Aggregation) is used when our goal is to reduce the variance
of a decision tree. Here the idea is
to create several subsets of data from the training sample chosen randomly with replacement. Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different
models. Average of all the predictions from different trees are used which is more robust
than a single decision tree.
Boosting is another
ensemble technique to create a collection of predictors. In this technique,
learners are learned sequentially with early learners
fitting simple models
to the data and then analyzing data
for errors. In other words, we
fit consecutive trees (random sample), and at every step, the goal is to solve for net error from the prior tree.
When a hypothesis misclassifies an input, its weight is increased, so that
the next hypothesis is more likely to classify it correctly. By combining the whole set at the end
converts weak learners into a better
performing model.
The different types of boosting algorithms are:
Comments
Post a Comment