Data Scientist Questions And Answers

21⟩ Explain me what is logistic regression? Or State an example when you have used logistic regression recently?

Logistic Regression often referred as logit model is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, etc.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

149 views

22⟩ Tell me is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. This makes the components easier to interpret. Not to forget, that’s the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesn’t change, it only changes the actual coordinates of the points.

If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select more number of components to explain variance in the data set.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

169 views

23⟩ Explain me why do you want to work at this company as a data scientist?

The purpose of this question is to determine the motivation behind the candidate's choice of applying and interviewing for the position. Their answer should reveal their inspiration for working for the company and their drive for being a data scientist. It should show the candidate is pursuing the position because they are passionate about data and believe in the company, two elements that can determine the candidate's performance. Answers to look for include:

☛ Interest in data mining

☛ Respect for the company's innovative practices

☛ Desire to apply analytical skills to solve real-world issues with data

Is this answer helpful? 0 Yes | 0 No

Answer This Question

142 views

24⟩ Tell us how would you go about doing an Exploratory Data Analysis (EDA)?

The goal of an EDA is to gather some insights from the data before applying your predictive model i.e gain some information. Basically, you want to do your EDA in a coarse to fine manner.

We start by gaining some high-level global insights. Check out some imbalanced classes. Look at mean and variance of each class. Check out the first few rows to see what it’s all about. Run a pandas df.info() to see which features are continuous, categorical, their type (int, float, string).

Next, drop unnecessary columns that won’t be useful in analysis and prediction. These can simply be columns that look useless, one’s where many rows have the same value (i.e it doesn’t give us much information), or it’s missing a lot of values. We can also fill in missing values with the most common value in that column, or the median. Now we can start making some basic visualizations. Start with high-level stuff. Do some bar plots for features that are categorical and have a small number of groups. Bar plots of the final classes. Look at the most “general features”.

Create some visualizations about these individual features to try and gain some basic insights. Now we can start to get more specific.

Create visualizations between features, two or three at a time. How are features related to each other? You can also do a PCA to see which features contain the most information. Group some features together as well to see their relationships. For example, what happens to the classes when A = 0 and B = 0? How about A = 1 and B = 0? Compare different features. For example, if feature A can be either “Female” or “Male” then we can plot feature A against which cabin they stayed in to see if Males and Females stay in different cabins.

Beyond bar, scatter, and other basic plots, we can do a PDF/CDF, overlayed plots, etc. Look at some statistics like distribution, p-value, etc. Finally it’s time to build the ML model. Start with easier stuff like Naive Bayes and Linear Regression. If you see that those suck or the data is highly non-linear, go with polynomial regression, decision trees, or SVMs. The features can be selected based on their importance from the EDA. If you have lots of data you can use a Neural Network. Check ROC curve. Precision, Recall

Is this answer helpful? 0 Yes | 0 No

Answer This Question

141 views

25⟩ Tell me why do segmentation CNNs typically have an encoder-decoder style / structure?

The encoder CNN can basically be thought of as a feature extraction network, while the decoder uses that information to predict the image segments by “decoding” the features and upscaling to the original image size.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

138 views

26⟩ Do you know the steps in making a decision tree?

☛ Take the entire data set as input.

☛ Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.

☛ Apply the split to the input data (divide step).

☛ Re-apply steps 1 to 2 to the divided data.

☛ Stop when you meet some stopping criteria.

☛ This step is called pruning. Clean up the tree if you went too far doing splits.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

140 views

27⟩ Tell us what is the significance of Residual Networks?

The main thing that residual connections did was allow for direct feature access from previous layers. This makes information propagation throughout the network much easier. One very interesting paper about this shows how using local skip connections gives the network a type of ensemble multi-path structure, giving features multiple paths to propagate throughout the network.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

136 views

28⟩ What is cross-validation?

It is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and gain insight on how the model will generalize to an independent data set.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

140 views

29⟩ Tell us what is root cause analysis?

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

145 views

30⟩ Tell us what are the drawbacks of the linear model?

Some drawbacks of the linear model are:

☛ The assumption of linearity of the errors.

☛ It can’t be used for count outcomes or binary outcomes

☛ There are overfitting problems that it can’t solve

Is this answer helpful? 0 Yes | 0 No

Answer This Question

189 views

31⟩ Tell me how do you work towards a random forest?

The underlying principle of this technique is that several weak learners combined to provide a strong learner. The steps involved are

☛ Build several decision trees on bootstrapped training samples of data

☛ On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates, out of all pp predictors

☛ Rule of thumb: At each split m=p√m=p

☛ Predictions: At the majority rule

Is this answer helpful? 0 Yes | 0 No

Answer This Question

138 views

32⟩ Explain me when is Ridge regression favorable over Lasso regression?

You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

192 views

33⟩ Tell us are expected value and mean value different?

They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

145 views

34⟩ Tell us what methods do you use to identify outliers within a data set?

Data scientists must be able to go beyond classroom theoretical applications to real-world applications. Your candidate's answer to this question will show how they allocate their time to finding the best way to detect outliers. This information is important to know because it demonstrates the candidate's analytical skills. Look for answers that include:

☛ Raw data analysis

☛ Models

☛ Approaches

Is this answer helpful? 0 Yes | 0 No

Answer This Question

145 views

35⟩ Tell me how do you know which Machine Learning model you should use?

While one should always keep the “no free lunch theorem” in mind, there are some general guidelines.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

142 views

36⟩ Tell me what is power analysis?

An experimental design technique for determining the effect of a given sample size.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

142 views

37⟩ Tell me how is True Positive Rate and Recall related?

True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).

Is this answer helpful? 0 Yes | 0 No

Answer This Question

154 views

38⟩ Tell me which technique is used to predict categorical responses?

Classification technique is used widely in mining for classifying data sets.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

135 views

39⟩ Please explain how do you overcome challenges to your findings?

The reason for asking this question is to discover how well the candidate approaches solving conflicts in a team environment. Their answer shows the candidate's problem-solving and interpersonal skills in stressful situations. Understanding these skills is significant because group dynamics and business conditions change. Consider answers that:

☛ Encourage discussion

☛ Demonstrate leadership

☛ Acknowledges recognizing and respecting different opinions

Is this answer helpful? 0 Yes | 0 No

Answer This Question

142 views

40⟩ Explain me what makes CNNs translation invariant?

As explained above, each convolution kernel acts as it’s own filter/feature detector. So let’s say you’re doing object detection, it doesn’t matter where in the image the object is since we’re going to apply the convolution in a sliding window fashion across the entire image anyways.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

144 views

Data Scientist

Home Data Warehouse Data Scientist

55 Data Scientist Questions And Answers

21⟩ Explain me what is logistic regression? Or State an example when you have used logistic regression recently?

22⟩ Tell me is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

23⟩ Explain me why do you want to work at this company as a data scientist?

24⟩ Tell us how would you go about doing an Exploratory Data Analysis (EDA)?

25⟩ Tell me why do segmentation CNNs typically have an encoder-decoder style / structure?

26⟩ Do you know the steps in making a decision tree?

27⟩ Tell us what is the significance of Residual Networks?

28⟩ What is cross-validation?

29⟩ Tell us what is root cause analysis?

30⟩ Tell us what are the drawbacks of the linear model?

31⟩ Tell me how do you work towards a random forest?

32⟩ Explain me when is Ridge regression favorable over Lasso regression?

33⟩ Tell us are expected value and mean value different?

34⟩ Tell us what methods do you use to identify outliers within a data set?

35⟩ Tell me how do you know which Machine Learning model you should use?

36⟩ Tell me what is power analysis?

37⟩ Tell me how is True Positive Rate and Recall related?

38⟩ Tell me which technique is used to predict categorical responses?

39⟩ Please explain how do you overcome challenges to your findings?

40⟩ Explain me what makes CNNs translation invariant?

Quick Links:

Data Scientist

Home Data Warehouse Data Scientist

55 Data Scientist Questions And Answers

21⟩ Explain me what is logistic regression? Or State an example when you have used logistic regression recently?

22⟩ Tell me is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

23⟩ Explain me why do you want to work at this company as a data scientist?

24⟩ Tell us how would you go about doing an Exploratory Data Analysis (EDA)?

25⟩ Tell me why do segmentation CNNs typically have an encoder-decoder style / structure?

26⟩ Do you know the steps in making a decision tree?

27⟩ Tell us what is the significance of Residual Networks?

28⟩ What is cross-validation?

29⟩ Tell us what is root cause analysis?

30⟩ Tell us what are the drawbacks of the linear model?

31⟩ Tell me how do you work towards a random forest?

32⟩ Explain me when is Ridge regression favorable over Lasso regression?

33⟩ Tell us are expected value and mean value different?

34⟩ Tell us what methods do you use to identify outliers within a data set?

35⟩ Tell me how do you know which Machine Learning model you should use?

36⟩ Tell me what is power analysis?

37⟩ Tell me how is True Positive Rate and Recall related?

38⟩ Tell me which technique is used to predict categorical responses?

39⟩ Please explain how do you overcome challenges to your findings?

40⟩ Explain me what makes CNNs translation invariant?

BE THE FIRST TO KNOW

Quick Links: