Cookie Consent by <a href="" rel="nofollow noopener">
Skip to content

22 machine learning interview questions to hire the best candidates

With trendy tools like chat GPT, artificial intelligence (AI) is becoming very popular across different sectors, from finance, retail, healthcare, marketing, among others. Machine learning is an arm of artificial intelligence and computer science. It uses data to imitate human intelligence, which gradually improves over time.

With higher demand for AI and machine learning engineers, recruiters and hiring managers are focused on finding the right talent to join the team. For that, it is crucial to deliver a great candidate experience and to properly assess the skills of the candidate.

If you are looking to hire the perfect machine learning engineer for your team, and want to automate the pre-screening of candidates, you can always use job simulations. It uses your specific job requirements to simulate the tech environment of the role. You can then combine it with interviews, and cultural-fit assessments to understand the whole picture of the candidate.

The candidates who have a good performance in these skill assessments understand the concepts of machine learning, and computer science. There are also other skills that are assessed such as problem solving, deep learning, statistics, and neural networks.


If you are going to do a technical interview to evaluate a machine learning engineer, you should test their knowledge on the concepts that are necessary to perform well at the job.

If you are not considering using one of our job simulations, and want to do a traditional technical interview, we compiled a list of 22 questions and answers to find the perfect machine learning engineer. You can also understand the reasoning behind each answer, so you are even better equipped to do the technical interview.

For a fast and easy assessment of tech candidates, we divided the questions according to the difficulty level:

1. Questions to assess a junior machine learning engineer 

2. Questions to assess a mid-level machine learning engineer

3. Questions to assess a senior machine learning engineer

-> Tip: you can also use these machine learning questions as a skill assessment before the interview, just remove the explanation before you share it with the candidates.


Question 1. Check the data points between X and Y in the graph below. If you train a linear regression model to the data, what would be the mean absolute error of this training data set?


Option A) ~0  Option B) Infinity
Option C) -1   Option D) ~1

Correct answer: (A)

Explanation: This questions evaluates the junior machine learning engineer, looking more in depth to the linear regression skill. Since the data points appear to be on a straight line, hence a linear regression model would fit almost perfectly to the data leaving almost no residual error.


Question 2. You are evaluating the performance of your k-Nearest Neighbor algorithm and need to choose the optimal value for k. According to the graph below, what k do you choose?

Option A) Option B) 10
Option C) 15   Option D) 25

Correct answer: (B)

Explanation: This questions assesses the knowledge on k-nearest neighbors algorithm and model improvement, both important skills for a junior machine learning engineer. The optimum number of k lies close to the smallest error rate so around k~10.


Question 3. Imagine you train two binary classification models. To compare their performance, you calculate the True Positive Rate and the False Positive Rate as a function of the cutoff threshold, see below. Which model performs better?


Option A) Random Forest  Option B) Logistic Regression
Option C) Both perform equally Option D) Can't tell based on the plotted data.

Correct answer: (A)

Explanation: With this question you are evaluating logistic regression and model evaluation. The optimum point in a TPR-FPR plot usually lies at the point where the TPR is maximized while keeping the FPR at a minimal value. This point represents the optimal balance between the two metrics and is often referred to as the "receiver operating characteristic" (ROC) curve. The optimum point on the ROC curve can be identified as the point closest to the top left corner, where the TPR is 1 and the FPR is 0. The closer the ROC curve is to the top left corner, the better the model is at distinguishing between positive and negative samples.


Question 4. The three scatter plots show a binary classification problem in a 2-dimensional feature space (x1, x2). The lines mark the decision boundaries resulting from three different non-linear logistic regression models. Which of the three models shows the highest variance?


Option A) A Option B) B
Option C) Option D) Can't say

Correct answer: (C)

Explanation: This questions shows if they applicant understands about data classification and model evaluation.
The correct answer is C since high variance means your model is overfitting, therefore fitting the training data (almost) perfectly but not generalizing well enough on new data. Out of the three examples, C fits the data basically perfectly, hence showing the highest variance. Example B underfits the data therefore showing low variance and example A sits in between.


Question 5. Which is the most appropriate activation function in the output layer of a deep learning model used for binary classification?

Option A) ReLu Option B) Sigmoid
Option C) Softmax Option D) Don't know

Correct answer: (B)

Explanation: If you ask this question to your junior machine learning engineer candidates, you are evaluating their knowledge about deep neural networks.
The sigmoid function maps any input to the range of 0 and 1, which represents the probability of the positive class. The output of the sigmoid function can then be thresholded to obtain the final binary prediction.

Question 6. Your graph has v vertices and e edges, is connected and has no cycles. Select the correct statement.

Option A) v=e Option B) v=e+1
Option C) v+1=e Option D) Can't say

Correct answer: (B)

Explanation: This helps you to assess an important skill for a mid-level machine learning engineer: graph theory.
In a connected acyclic graph (i.e. a tree), the number of vertices v is equal to the number of edges e plus one. You can use induction on the number of vertices in the tree. For a tree with one vertex, there are no edges, so e = v - 1 is true. Assume that the statement is true for all trees with up to k vertices, and consider a tree with k + 1 vertices. Choose any leaf (a vertex with degree 1) in the tree, and remove it along with its incident edge. The resulting graph is a tree with k vertices, and by the induction hypothesis it has k-1 edges. Adding the removed edge back in gives a tree with k+1 vertices and k edges, so e = v - 1 is true for this tree as well.


Question 7. Select the correct statement(s) about K-fold Cross-Validation. [multiple solutions possible]

Option A) By increasing the number K of groups to split up the data set, the run-time of the process is increased. Option B) Lower K values result in lower confidence on the overall cross-validation result.
Option C) If the number of groups K is chosen to be the same as observations N in the data set, you perform Leave-One-Out Cross-Validation. Option D) Can't say

Correct answer: (A), (B), and (C)

Explanation: Now you focus on important skills for a mid-level machine learning engineer such as model improvement, and model training. Here's an explanation of why a, b and c are correct:

      • Increasing the number of folds (K) in k-fold cross validation will increase the run-time of the process, as the data will be split into more parts and each part will be used as a validation set once.
      • Correct because with a lower number of folds, each fold represents a larger proportion of the data, leading to a higher variance in the estimated model performance.
      • True since in LOOCV, each observation is used exactly once as the validation set, and the remaining data is used for training.


Question 8. Imagine you are training your linear regression model via gradient descent. In the graph below, the loss function is plotted as a number of iterations for various learning rates. What could be a feasible relation of the learning rates (lr)?


Option A) lr_blue < lr_black < lr_orange < lr_green  Option B) lr_blue > lr_green > lr_orange > lr_black
Option C) lr_orange = lr_green = lr_blue = lr_black Option D) Can't say without knowing the data set.

Correct answer: (B)

Explanation: By asking this to your candidate, you are assessing their knowledge regarding model training, model improvement, and linear regression.
Finding the right learning rate needs trial and error as it is very dependent on your problem and data set. Very high learning rates most probable doesn't let your model predict anything properly as it is too coarse to come closer to the optimum hence the loss could start spiking up after a couple of iterations (blue). Very low learning rates usually go in the direction of a minimum but very slowly which makes them very computationally costly (black). The optimum lies in the middle, where the optimum is quickly approached and a saturation is found after not too many iterations (orange). The green example would be of a learning rate still too high to approach the minimum, hence staying stuck at a similar loss.


Question 9. Below you see two data sets (graph A) and (graph B) on a linear upward trend with the data points randomly distributed within a defined range. Both data sets share the same regression line (blue) yet differ in population variance. What can you say about the sum of residuals in both cases?


Option A) The sum of residuals is much bigger in A then in B  Option B) The sum of residuals is much bigger in B then in A.
Option C) Both have basically the same sum of residuals. Option D) Can't say.

Correct answer: (C)

Explanation: You are now evaluating the knowledge about regression.
C is the right answer since both have positive and negative errors to the fit, that overall cancel each other out, leaving a slightly positive sum of residuals in both cases.


Question 10. The plot below shows two features A and B describing a binary classification problem (dots are class 1, crosses are class 0). Using a decision tree algorithm, you can split the data based on feature A (horizontal axis) or feature B (vertical axis). Here, a vertical split is applied where smaller values for A result in class 1, and larger values for A result in class 0. What is the accuracy of this decision's classification?


Option A) 100% Option B) 16,70%
Option C) 33,30% Option D) 83,30%

Correct answer: (D)

Explanation: The skills being evaluated are data classification, and decision trees.
Since the straight line classifies three dots (1) to the left of the vertical line and two crosses (0) to the right of the line correctly, only one cross (0) is misclassified. Hence, the accuracy is (6-1)/6 = 5/6 = 83.3%.


Question 11. When training a deep neural network, what action can be applied to reduce overfitting? [multiple solutions possible]

Option A) Addition of regularization. Option B) Reduction of training data.
Option C) Reduction of network complexity. Option D) Utilization of data augmentation.

Correct answer: (A), (C), and (D)

Explanation: Evaluate the knowledge about deep neural networks development. Bellow you can see why a), b) and c) are correct:

      • A) Regularization techniques, such as L1 and L2 regularization, add penalties to the loss function for large weights, which discourages the model from learning overly complex representations.
      • C) Reducing the number of neurons or layers in the network can help prevent overfitting by limiting the capacity of the model to memorize the training data.
      • D) Data augmentation involves creating new training examples by applying random transformations to the original training data, increasing the size of the training set and reducing overfitting.




Question 17. In the following picture, the red box marks a region of significant discontinuity. What type of discontinuity are we talking about?


Option A) Depth Option B) Surface color
Option C) Illumination Option D) Distinct

Correct answer: (A)

Explanation: In this first question to assess more senior candidates, you are checking their convolutional neural networks skills.
Depth discontinuity refers to a sharp change in the depth or distance of objects in a scene. In computer vision and image processing, depth discontinuity is often used to detect boundaries between foreground and background objects, or to distinguish between objects at different distances from the camera. 


Question 18. Your decision tree algorithm splits different categorical features applied on a problem to predict whether a car is fast or slow according to the plot below. Which feature shows the highest information gain?


Option A) Cylinder Option B) Car colour
Option C) Horsepower Option D) Manufacturer continent

Correct answer: (A)

Explanation: Here you are assessing the knowledge in model evaluation, and decision trees.
In a tree-based algorithm such as decision trees, the feature with the highest information gain is typically selected as the root node or the top splitting feature for each node in the tree. Information gain measures the reduction in entropy or the increase in the purity of the target variable after splitting the data on a particular feature. By selecting the feature with the highest information gain, the algorithm aims to maximize the reduction in entropy, leading to the best possible splits and more accurate predictions. In this example, only cylinder can separate at least one branch into purely one classification (slow for V6) whereas all branches of all other features have mixed classifications (both slow and fast).


Question 19. When training decision trees, the information gain can be calculated. This is a measure used to decide for the amount of information gained from the features towards the dependent variable. Another measure is the ratio between the information gain and the intrinsic value of a feature. For categorical features, when is it better to use information gain compared to the gain ratio?

Option A) When the categorical feature has a high cardinality. Option B) When the categorical feature has a low cardinality.
Option C) Cardinality of a categorical feature should not influence the choice of the right measure. Option D) Can't say

Correct answer: (B)

Explanation: With this question you check the knowledge regarding decision trees.
If the cardinality of a feature is low (e.g., a binary feature), then the information gain is likely to work well as a feature selection criterion, as the intrinsic information content of the feature is relatively low.


Question 20. The three scatter plots show a binary classification problem in a 2-dimensional feature space (x1, x2). The lines mark the decision boundaries resulting from three different non-linear logistic regression models. Check the statements that are true! [multiple solutions possible]

Option A) The training accuracy is maximum in plot C. Option B) The best model for the classification problem at hand is C as it shows zero training error.
Option C) Model A is the most applicable to new, unseen data. Option D) Model C is overfitting compared to A & B.

Correct answer: (A), (C), and (D)

Explanation: You are now assessing model evaluation knowledge.
High variance means your model is overfitting, therefore fitting the training data (almost) perfectly but not generalizing well enough on new data. Out of the three examples, C fits the data basically perfectly, hence showing the highest variance. Example B underfits the data therefore showing low variance and example A sits in between showing the best generalization of the model to new data. 


Question 21. The scatter plot below shows a binary classification problem (circle-1, plus-0) in a 2-dimensional feature space (x1, x2) where both features can only take binary values as well. Can you create a decision boundary with a logistic regression model that leads to 100% accuracy?

Option A) Yes Option B) No
Option C) Need more information. Option D) Don't know.

Correct answer: (B)

Explanation: Logistic regression provides no decision boundary that clearly separate the classes here. Other non-linear algorithms could be applied, like tree-based models or SVMs.

Question 22. What statement is correct about the optimization algorithm AdaGrad applied commonly in training machine learning algorithms?

Option A) AdaGrad applies first order differentiation. Option B) AdaGrad applies second order differentiation.
Option C) AdaGrad applies dynamic order differentiation choosing the degree based on the problem at hand. Option D) Can't say

Correct answer: (A)

Explanation: Assess model training and algorithms with this question.
AdaGrad applies first order differentiation because it uses the gradient information to adjust the learning rates of individual parameters. The gradient is a first-order derivative of the loss function with respect to the model parameters, and it represents the direction of maximum increase of the loss function.


Machine learning and artificial intelligence are tech skills that increasingly growing in the job market. If you are a recruiter or a talent specialist looking to find the right candidates, helps you to deliver a great candidate experience, while evaluating the tech skills of the applicants.

With our job simulations tailored to your job ad, you can automate the pre-screening process, and reduce time, cost and bias while assessing tech candidates. With this approach, you can replace CV screening and shift towards a skill-based hiring.


Start today your 14-day free-trial and identify the best tech talent to join your team.