Data Scientist Questions And Answers

1⟩ Explain me what is data normalization and why do we need it?

I felt this one would be important to highlight. Data normalization is very important preprocessing step, used to rescale values to fit in a specific range to assure better convergence during backpropagation. In general, it boils down to subtracting the mean of each data point and dividing by its standard deviation. If we don’t do this then some of the features (those with high magnitude) will be weighted more in the cost function (if a higher-magnitude feature changes by 1%, then that change is pretty big, but for smaller features it’s quite insignificant). The data normalization makes all features weighted equally.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

153 views

2⟩ Tell me how can outlier values be treated?

Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for large number of outliers the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values.The most common ways to treat outlier values –

☛ 1) To change the value and bring in within a range

☛ 2) To just remove the value.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

146 views

3⟩ Can you differentiate between univariate, bivariate and multivariate analysis?

These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.

If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be considered as an example of bivariate analysis.

Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

147 views

4⟩ Tell me how is kNN different from kmeans clustering?

Don’t get mislead by ‘k’ in their names. You should know that the fundamental difference between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.

kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

150 views

5⟩ Tell me why is resampling done?

Resampling is done in any of these cases:

☛ Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points

☛ Substituting labels on data points when performing significance tests

☛ Validating models by using random subsets (bootstrapping, cross validation)

Is this answer helpful? 0 Yes | 0 No

Answer This Question

134 views

6⟩ Explain me do gradient descent methods at all times converge to a similar point?

No, they do not because in some cases they reach a local minima or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

136 views

7⟩ Tell us how do clean up and organize big data sets?

Data scientists frequently have to combine large amounts of information from various devices in several formats, such as data from a smartwatch or cellphone. Answers to this question will demonstrate how your candidate's methods for organizing large data. This information is important to know because data scientists need clean data to analyze information accurately to offer recommendations that solve business problems. Possible answers may include:

☛ Automation tools

☛ Value correction methods

☛ Comprehension of data sets

Is this answer helpful? 0 Yes | 0 No

Answer This Question

148 views

8⟩ Tell us how do you identify a barrier to performance?

This question will determine how the candidate approaches solving real-world issues they will face in their role as a data scientist. It will also determine how they approach problem-solving from an analytical standpoint. This information is vital to understand because data scientists must have strong analytical and problem-solving skills. Look for answers that reveal:

Examples of problem-solving methods

Steps to take to identify the barriers to performance

Benchmarks for assessing performance

"My approach to determining performance bottlenecks is to conduct a performance test. I then evaluate the performance based on criteria set by the lead data scientist or company and discuss my findings with my team lead and group."

Is this answer helpful? 0 Yes | 0 No

Answer This Question

131 views

9⟩ Tell us why do we have max-pooling in classification CNNs?

Again as you would expect this is for a role in Computer Vision. Max-pooling in a CNN allows you to reduce computation since your feature maps are smaller after the pooling. You don’t lose too much semantic information since you’re taking the maximum activation. There’s also a theory that max-pooling contributes a bit to giving CNNs more translation in-variance.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

151 views

10⟩ Tell me how do you handle missing or corrupted data in a dataset?

You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value.

In Pandas, there are two very useful methods:

isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

150 views

11⟩ What is dimensionality reduction, where it’s used, and it’s benefits?

Dimensionality reduction is the process of reducing the number of feature variables under consideration by obtaining a set of principal variables which are basically the important features. Importance of a feature depends on how much the feature variable contributes to the information representation of the data and depends on which technique you decide to use. Deciding which technique to use comes down to trial-and-error and preference. It’s common to start with a linear technique and move to non-linear techniques when results suggest inadequate fit.

Benefits of dimensionality reduction for a data set may be:

(1) Reduce the storage space needed

(2) Speed up computation (for example in machine learning algorithms), less dimensions mean less computing, also less dimensions can allow usage of algorithms unfit for a large number of dimensions

(3) Remove redundant features, for example no point in storing a terrain’s size in both sq meters and sq miles (maybe data gathering was flawed)

(4) Reducing a data’s dimension to 2D or 3D may allow us to plot and visualize it, maybe observe patterns, give us insights

(5) Too many features or too complex a model can lead to overfitting.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

156 views

12⟩ Do you know what is the goal of A/B Testing?

It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

153 views

13⟩ Explain me what tools or devices help you succeed in your role as a data scientist?

This question's purpose is to learn the programming languages and applications the candidate knows and has experience using. The answer will show the candidate's need for additional training of basic programming languages and platforms or any transferable skills. This is vital to understand as it can cost more time and money to train if the candidate is not knowledgeable in all of the languages and applications required for the position. Answers to look for include:

☛ Experience in SAS and R programming

☛ Understanding of Python, PHP or Java programming languages

☛ Experience using data visualization tools

"I believe I can excel in this position with my R, Python, and SQL programming skill set. I enjoy working on the FUSE and Tableau platforms to mine data and draw inferences."

Is this answer helpful? 0 Yes | 0 No

Answer This Question

154 views

14⟩ Do you know why is naive Bayes so ‘naive’ ?

naive Bayes is so ‘naive’ because it assumes that all of the features in a data set are equally important and independent. As we know, these assumption are rarely true in real world scenario.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

151 views

15⟩ Tell us how regularly must an algorithm be updated?

You will want to update an algorithm when:

☛ You want the model to evolve as data streams through infrastructure

☛ The underlying data source is changing

☛ There is a case of non-stationarity

Is this answer helpful? 0 Yes | 0 No

Answer This Question

165 views

16⟩ Do you know what is logistic regression?

Logistic Regression is also known as the logit model. It is a technique to forecast the binary outcome from a linear combination of predictor variables.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

148 views

17⟩ Explain me why data cleaning plays a vital role in analysis?

Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because - as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

175 views

18⟩ Tell us why do we use convolutions for images rather than just FC layers?

This one was pretty interesting since it’s not something companies usually ask. As you would expect, I got this question from a company focused on Computer Vision. This answer has 2 parts to it. Firstly, convolutions preserve, encode, and actually use the spatial information from the image. If we used only FC layers we would have no relative spatial information. Secondly, Convolutional Neural Networks (CNNs) have a partially built-in translation in-variance, since each convolution kernel acts as it’s own filter/feature detector.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

148 views

19⟩ Tell us how has your prior experience prepared you for a role in data science?

This question helps determine the candidate's experience from a holistic perspective and reveals experience in demonstrating interpersonal, communication and technical skills. It is important to understand this because data scientists must be able to communicate their findings, work in a team environment and have the skills to perform the task. Here are some possible answers to look for:

☛ Project management skills

☛ Examples of working in a team environment

☛ Ability to identify errors

Is this answer helpful? 0 Yes | 0 No

Answer This Question

149 views

20⟩ What is star schema?

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.

Is this answer helpful? 0 Yes | 0 No

Answer This Question

156 views

Data Scientist

Home Data Warehouse Data Scientist

55 Data Scientist Questions And Answers

1⟩ Explain me what is data normalization and why do we need it?

2⟩ Tell me how can outlier values be treated?

3⟩ Can you differentiate between univariate, bivariate and multivariate analysis?

4⟩ Tell me how is kNN different from kmeans clustering?

5⟩ Tell me why is resampling done?

6⟩ Explain me do gradient descent methods at all times converge to a similar point?

7⟩ Tell us how do clean up and organize big data sets?

8⟩ Tell us how do you identify a barrier to performance?

9⟩ Tell us why do we have max-pooling in classification CNNs?

10⟩ Tell me how do you handle missing or corrupted data in a dataset?

11⟩ What is dimensionality reduction, where it’s used, and it’s benefits?

12⟩ Do you know what is the goal of A/B Testing?

13⟩ Explain me what tools or devices help you succeed in your role as a data scientist?

14⟩ Do you know why is naive Bayes so ‘naive’ ?

15⟩ Tell us how regularly must an algorithm be updated?

16⟩ Do you know what is logistic regression?

17⟩ Explain me why data cleaning plays a vital role in analysis?

18⟩ Tell us why do we use convolutions for images rather than just FC layers?

19⟩ Tell us how has your prior experience prepared you for a role in data science?

20⟩ What is star schema?

Quick Links:

Data Scientist

Home Data Warehouse Data Scientist

55 Data Scientist Questions And Answers

1⟩ Explain me what is data normalization and why do we need it?

2⟩ Tell me how can outlier values be treated?

3⟩ Can you differentiate between univariate, bivariate and multivariate analysis?

4⟩ Tell me how is kNN different from kmeans clustering?

5⟩ Tell me why is resampling done?

6⟩ Explain me do gradient descent methods at all times converge to a similar point?

7⟩ Tell us how do clean up and organize big data sets?

8⟩ Tell us how do you identify a barrier to performance?

9⟩ Tell us why do we have max-pooling in classification CNNs?

10⟩ Tell me how do you handle missing or corrupted data in a dataset?

11⟩ What is dimensionality reduction, where it’s used, and it’s benefits?

12⟩ Do you know what is the goal of A/B Testing?

13⟩ Explain me what tools or devices help you succeed in your role as a data scientist?

14⟩ Do you know why is naive Bayes so ‘naive’ ?

15⟩ Tell us how regularly must an algorithm be updated?

16⟩ Do you know what is logistic regression?

17⟩ Explain me why data cleaning plays a vital role in analysis?

18⟩ Tell us why do we use convolutions for images rather than just FC layers?

19⟩ Tell us how has your prior experience prepared you for a role in data science?

20⟩ What is star schema?

BE THE FIRST TO KNOW

Quick Links: