Data Scientist

  Home  Data Warehouse  Data Scientist


“Data Scientist based Frequently Asked Questions in various Data Scientist job interviews by interviewer. These professional questions are here to ensures that you offer a perfect answers posed to you. So get preparation for your new job hunting”



55 Data Scientist Questions And Answers

1⟩ Explain me what is data normalization and why do we need it?

I felt this one would be important to highlight. Data normalization is very important preprocessing step, used to rescale values to fit in a specific range to assure better convergence during backpropagation. In general, it boils down to subtracting the mean of each data point and dividing by its standard deviation. If we don’t do this then some of the features (those with high magnitude) will be weighted more in the cost function (if a higher-magnitude feature changes by 1%, then that change is pretty big, but for smaller features it’s quite insignificant). The data normalization makes all features weighted equally.

 153 views

2⟩ Tell me how can outlier values be treated?

Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for large number of outliers the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values.The most common ways to treat outlier values –

☛ 1) To change the value and bring in within a range

☛ 2) To just remove the value.

 146 views

3⟩ Can you differentiate between univariate, bivariate and multivariate analysis?

These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.

If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be considered as an example of bivariate analysis.

Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

 147 views

4⟩ Tell me how is kNN different from kmeans clustering?

Don’t get mislead by ‘k’ in their names. You should know that the fundamental difference between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.

kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.

 150 views

5⟩ Tell me why is resampling done?

Resampling is done in any of these cases:

☛ Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points

☛ Substituting labels on data points when performing significance tests

☛ Validating models by using random subsets (bootstrapping, cross validation)

 134 views

7⟩ Tell us how do clean up and organize big data sets?

Data scientists frequently have to combine large amounts of information from various devices in several formats, such as data from a smartwatch or cellphone. Answers to this question will demonstrate how your candidate's methods for organizing large data. This information is important to know because data scientists need clean data to analyze information accurately to offer recommendations that solve business problems. Possible answers may include:

☛ Automation tools

☛ Value correction methods

☛ Comprehension of data sets

 148 views

8⟩ Tell us how do you identify a barrier to performance?

This question will determine how the candidate approaches solving real-world issues they will face in their role as a data scientist. It will also determine how they approach problem-solving from an analytical standpoint. This information is vital to understand because data scientists must have strong analytical and problem-solving skills. Look for answers that reveal:

Examples of problem-solving methods

Steps to take to identify the barriers to performance

Benchmarks for assessing performance

"My approach to determining performance bottlenecks is to conduct a performance test. I then evaluate the performance based on criteria set by the lead data scientist or company and discuss my findings with my team lead and group."

 131 views

9⟩ Tell us why do we have max-pooling in classification CNNs?

Again as you would expect this is for a role in Computer Vision. Max-pooling in a CNN allows you to reduce computation since your feature maps are smaller after the pooling. You don’t lose too much semantic information since you’re taking the maximum activation. There’s also a theory that max-pooling contributes a bit to giving CNNs more translation in-variance.

 151 views

10⟩ Tell me how do you handle missing or corrupted data in a dataset?

You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value.

In Pandas, there are two very useful methods:

isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.

 150 views

11⟩ What is dimensionality reduction, where it’s used, and it’s benefits?

Dimensionality reduction is the process of reducing the number of feature variables under consideration by obtaining a set of principal variables which are basically the important features. Importance of a feature depends on how much the feature variable contributes to the information representation of the data and depends on which technique you decide to use. Deciding which technique to use comes down to trial-and-error and preference. It’s common to start with a linear technique and move to non-linear techniques when results suggest inadequate fit.

Benefits of dimensionality reduction for a data set may be:

(1) Reduce the storage space needed

(2) Speed up computation (for example in machine learning algorithms), less dimensions mean less computing, also less dimensions can allow usage of algorithms unfit for a large number of dimensions

(3) Remove redundant features, for example no point in storing a terrain’s size in both sq meters and sq miles (maybe data gathering was flawed)

(4) Reducing a data’s dimension to 2D or 3D may allow us to plot and visualize it, maybe observe patterns, give us insights

(5) Too many features or too complex a model can lead to overfitting.

 156 views

12⟩ Do you know what is the goal of A/B Testing?

It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad.

 153 views

13⟩ Explain me what tools or devices help you succeed in your role as a data scientist?

This question's purpose is to learn the programming languages and applications the candidate knows and has experience using. The answer will show the candidate's need for additional training of basic programming languages and platforms or any transferable skills. This is vital to understand as it can cost more time and money to train if the candidate is not knowledgeable in all of the languages and applications required for the position. Answers to look for include:

☛ Experience in SAS and R programming

☛ Understanding of Python, PHP or Java programming languages

☛ Experience using data visualization tools

"I believe I can excel in this position with my R, Python, and SQL programming skill set. I enjoy working on the FUSE and Tableau platforms to mine data and draw inferences."

 154 views

17⟩ Explain me why data cleaning plays a vital role in analysis?

Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because - as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.

 175 views

18⟩ Tell us why do we use convolutions for images rather than just FC layers?

This one was pretty interesting since it’s not something companies usually ask. As you would expect, I got this question from a company focused on Computer Vision. This answer has 2 parts to it. Firstly, convolutions preserve, encode, and actually use the spatial information from the image. If we used only FC layers we would have no relative spatial information. Secondly, Convolutional Neural Networks (CNNs) have a partially built-in translation in-variance, since each convolution kernel acts as it’s own filter/feature detector.

 148 views

19⟩ Tell us how has your prior experience prepared you for a role in data science?

This question helps determine the candidate's experience from a holistic perspective and reveals experience in demonstrating interpersonal, communication and technical skills. It is important to understand this because data scientists must be able to communicate their findings, work in a team environment and have the skills to perform the task. Here are some possible answers to look for:

☛ Project management skills

☛ Examples of working in a team environment

☛ Ability to identify errors

 149 views

20⟩ What is star schema?

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.

 156 views