Data Scientist

  Home  Data Warehouse  Data Scientist


“Data Scientist based Frequently Asked Questions in various Data Scientist job interviews by interviewer. These professional questions are here to ensures that you offer a perfect answers posed to you. So get preparation for your new job hunting”



55 Data Scientist Questions And Answers

21⟩ Explain me what is logistic regression? Or State an example when you have used logistic regression recently?

Logistic Regression often referred as logit model is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, etc.

 149 views

22⟩ Tell me is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. This makes the components easier to interpret. Not to forget, that’s the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesn’t change, it only changes the actual coordinates of the points.

If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select more number of components to explain variance in the data set.

 169 views

23⟩ Explain me why do you want to work at this company as a data scientist?

The purpose of this question is to determine the motivation behind the candidate's choice of applying and interviewing for the position. Their answer should reveal their inspiration for working for the company and their drive for being a data scientist. It should show the candidate is pursuing the position because they are passionate about data and believe in the company, two elements that can determine the candidate's performance. Answers to look for include:

☛ Interest in data mining

☛ Respect for the company's innovative practices

☛ Desire to apply analytical skills to solve real-world issues with data

 142 views

24⟩ Tell us how would you go about doing an Exploratory Data Analysis (EDA)?

The goal of an EDA is to gather some insights from the data before applying your predictive model i.e gain some information. Basically, you want to do your EDA in a coarse to fine manner.

We start by gaining some high-level global insights. Check out some imbalanced classes. Look at mean and variance of each class. Check out the first few rows to see what it’s all about. Run a pandas df.info() to see which features are continuous, categorical, their type (int, float, string).

Next, drop unnecessary columns that won’t be useful in analysis and prediction. These can simply be columns that look useless, one’s where many rows have the same value (i.e it doesn’t give us much information), or it’s missing a lot of values. We can also fill in missing values with the most common value in that column, or the median. Now we can start making some basic visualizations. Start with high-level stuff. Do some bar plots for features that are categorical and have a small number of groups. Bar plots of the final classes. Look at the most “general features”.

Create some visualizations about these individual features to try and gain some basic insights. Now we can start to get more specific.

Create visualizations between features, two or three at a time. How are features related to each other? You can also do a PCA to see which features contain the most information. Group some features together as well to see their relationships. For example, what happens to the classes when A = 0 and B = 0? How about A = 1 and B = 0? Compare different features. For example, if feature A can be either “Female” or “Male” then we can plot feature A against which cabin they stayed in to see if Males and Females stay in different cabins.

Beyond bar, scatter, and other basic plots, we can do a PDF/CDF, overlayed plots, etc. Look at some statistics like distribution, p-value, etc. Finally it’s time to build the ML model. Start with easier stuff like Naive Bayes and Linear Regression. If you see that those suck or the data is highly non-linear, go with polynomial regression, decision trees, or SVMs. The features can be selected based on their importance from the EDA. If you have lots of data you can use a Neural Network. Check ROC curve. Precision, Recall

 141 views

26⟩ Do you know the steps in making a decision tree?

☛ Take the entire data set as input.

☛ Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.

☛ Apply the split to the input data (divide step).

☛ Re-apply steps 1 to 2 to the divided data.

☛ Stop when you meet some stopping criteria.

☛ This step is called pruning. Clean up the tree if you went too far doing splits.

 140 views

27⟩ Tell us what is the significance of Residual Networks?

The main thing that residual connections did was allow for direct feature access from previous layers. This makes information propagation throughout the network much easier. One very interesting paper about this shows how using local skip connections gives the network a type of ensemble multi-path structure, giving features multiple paths to propagate throughout the network.

 136 views

28⟩ What is cross-validation?

It is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and gain insight on how the model will generalize to an independent data set.

 140 views

29⟩ Tell us what is root cause analysis?

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.

 145 views

31⟩ Tell me how do you work towards a random forest?

The underlying principle of this technique is that several weak learners combined to provide a strong learner. The steps involved are

☛ Build several decision trees on bootstrapped training samples of data

☛ On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates, out of all pp predictors

☛ Rule of thumb: At each split m=p√m=p

☛ Predictions: At the majority rule

 138 views

32⟩ Explain me when is Ridge regression favorable over Lasso regression?

You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.

 192 views

33⟩ Tell us are expected value and mean value different?

They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.

 145 views

34⟩ Tell us what methods do you use to identify outliers within a data set?

Data scientists must be able to go beyond classroom theoretical applications to real-world applications. Your candidate's answer to this question will show how they allocate their time to finding the best way to detect outliers. This information is important to know because it demonstrates the candidate's analytical skills. Look for answers that include:

☛ Raw data analysis

☛ Models

☛ Approaches

 145 views

39⟩ Please explain how do you overcome challenges to your findings?

The reason for asking this question is to discover how well the candidate approaches solving conflicts in a team environment. Their answer shows the candidate's problem-solving and interpersonal skills in stressful situations. Understanding these skills is significant because group dynamics and business conditions change. Consider answers that:

☛ Encourage discussion

☛ Demonstrate leadership

☛ Acknowledges recognizing and respecting different opinions

 142 views

40⟩ Explain me what makes CNNs translation invariant?

As explained above, each convolution kernel acts as it’s own filter/feature detector. So let’s say you’re doing object detection, it doesn’t matter where in the image the object is since we’re going to apply the convolution in a sliding window fashion across the entire image anyways.

 144 views