The use of AI and tools like ChatGPT is surging, and recruiting is one of the most interesting use...
22 machine learning interview questions to hire the best candidates
What is machine learning?
Machine Learning uses data to imitate human intelligence, which gradually improves over time.
With trendy tools like chat GPT, artificial intelligence (AI) is becoming very popular across different sectors, from finance, retail, healthcare, marketing, among others. Machine learning is an arm of artificial intelligence and computer science. It uses data to imitate human intelligence, which gradually improves over time.
With higher demand for AI and machine learning engineers, recruiters and hiring managers are focused on finding the right talent to join the team. For that, it is crucial to deliver a great candidate experience and to properly assess the skills of the candidate.
If you are looking to hire the perfect machine learning engineer for your team, and want to automate the pre-screening of candidates, you can always use Skillfill.ai job simulations. It uses your specific job requirements to simulate the tech environment of the role. You can then combine it with interviews, and cultural-fit assessments to understand the whole picture of the candidate.
The candidates who have a good performance in these skill assessments understand the concepts of machine learning, and computer science. There are also other skills that are assessed such as problem solving, deep learning, statistics, and neural networks.
How to assess a machine learning engineer?
If you are going to do a technical interview to evaluate a machine learning engineer, you should test their knowledge on the concepts that are necessary to perform well at the job.
If you are not considering using one of our job simulations, and want to do a traditional technical interview, we compiled a list of 22 questions and answers to find the perfect machine learning engineer. You can also understand the reasoning behind each answer, so you are even better equipped to do the technical interview.
For a fast and easy assessment of tech candidates, we divided the questions according to the difficulty level:
1. Questions to assess a junior machine learning engineer
2. Questions to assess a mid-level machine learning engineer
3. Questions to assess a senior machine learning engineer
-> Tip: you can also use these machine learning questions as a skill assessment before the interview, just remove the explanation before you share it with the candidates.
Questions to assess a junior machine learning engineer
Question 1. Check the data points between X and Y in the graph below. If you train a linear regression model to the data, what would be the mean absolute error of this training data set?
Option A) | ~0 | Option B) | Infinity |
Option C) | -1 | Option D) | ~1 |
Correct answer: (A)
Explanation: This questions evaluates the junior machine learning engineer, looking more in depth to the linear regression skill. Since the data points appear to be on a straight line, hence a linear regression model would fit almost perfectly to the data leaving almost no residual error.
Question 2. You are evaluating the performance of your k-Nearest Neighbor algorithm and need to choose the optimal value for k. According to the graph below, what k do you choose?
Option A) | 5 | Option B) | 10 |
Option C) | 15 | Option D) | 25 |
Correct answer: (B)
Explanation: This questions assesses the knowledge on k-nearest neighbors algorithm and model improvement, both important skills for a junior machine learning engineer. The optimum number of k lies close to the smallest error rate so around k~10.
Question 3. Imagine you train two binary classification models. To compare their performance, you calculate the True Positive Rate and the False Positive Rate as a function of the cutoff threshold, see below. Which model performs better?
Option A) | Random Forest | Option B) | Logistic Regression |
Option C) | Both perform equally | Option D) | Can't tell based on the plotted data. |
Correct answer: (A)
Explanation: With this question you are evaluating logistic regression and model evaluation. The optimum point in a TPR-FPR plot usually lies at the point where the TPR is maximized while keeping the FPR at a minimal value. This point represents the optimal balance between the two metrics and is often referred to as the "receiver operating characteristic" (ROC) curve. The optimum point on the ROC curve can be identified as the point closest to the top left corner, where the TPR is 1 and the FPR is 0. The closer the ROC curve is to the top left corner, the better the model is at distinguishing between positive and negative samples.
Question 4. The three scatter plots show a binary classification problem in a 2-dimensional feature space (x1, x2). The lines mark the decision boundaries resulting from three different non-linear logistic regression models. Which of the three models shows the highest variance?
Option A) | A | Option B) | B |
Option C) | C | Option D) | Can't say |
Correct answer: (C)
Explanation: This questions shows if they applicant understands about data classification and model evaluation.
The correct answer is C since high variance means your model is overfitting, therefore fitting the training data (almost) perfectly but not generalizing well enough on new data. Out of the three examples, C fits the data basically perfectly, hence showing the highest variance. Example B underfits the data therefore showing low variance and example A sits in between.
Question 5. Which is the most appropriate activation function in the output layer of a deep learning model used for binary classification?
Option A) | ReLu | Option B) | Sigmoid |
Option C) | Softmax | Option D) | Don't know |
Correct answer: (B)
Explanation: If you ask this question to your junior machine learning engineer candidates, you are evaluating their knowledge about deep neural networks.
The sigmoid function maps any input to the range of 0 and 1, which represents the probability of the positive class. The output of the sigmoid function can then be thresholded to obtain the final binary prediction.
Questions to assess a mid-level machine learning engineer
Question 6. Your graph has v vertices and e edges, is connected and has no cycles. Select the correct statement.
Option A) | v=e | Option B) | v=e+1 |
Option C) | v+1=e | Option D) | Can't say |
Correct answer: (B)
Explanation: This helps you to assess an important skill for a mid-level machine learning engineer: graph theory.
In a connected acyclic graph (i.e. a tree), the number of vertices v is equal to the number of edges e plus one. You can use induction on the number of vertices in the tree. For a tree with one vertex, there are no edges, so e = v - 1 is true. Assume that the statement is true for all trees with up to k vertices, and consider a tree with k + 1 vertices. Choose any leaf (a vertex with degree 1) in the tree, and remove it along with its incident edge. The resulting graph is a tree with k vertices, and by the induction hypothesis it has k-1 edges. Adding the removed edge back in gives a tree with k+1 vertices and k edges, so e = v - 1 is true for this tree as well.
Question 7. Select the correct statement(s) about K-fold Cross-Validation. [multiple solutions possible]
Option A) | By increasing the number K of groups to split up the data set, the run-time of the process is increased. | Option B) | Lower K values result in lower confidence on the overall cross-validation result. |
Option C) | If the number of groups K is chosen to be the same as observations N in the data set, you perform Leave-One-Out Cross-Validation. | Option D) | Can't say |
Correct answer: (A), (B), and (C)
Explanation: Now you focus on important skills for a mid-level machine learning engineer such as model improvement, and model training. Here's an explanation of why a, b and c are correct:
-
-
- Increasing the number of folds (K) in k-fold cross validation will increase the run-time of the process, as the data will be split into more parts and each part will be used as a validation set once.
- Correct because with a lower number of folds, each fold represents a larger proportion of the data, leading to a higher variance in the estimated model performance.
- True since in LOOCV, each observation is used exactly once as the validation set, and the remaining data is used for training.
-
Question 8. Imagine you are training your linear regression model via gradient descent. In the graph below, the loss function is plotted as a number of iterations for various learning rates. What could be a feasible relation of the learning rates (lr)?
Option A) | lr_blue < lr_black < lr_orange < lr_green | Option B) | lr_blue > lr_green > lr_orange > lr_black |
Option C) | lr_orange = lr_green = lr_blue = lr_black | Option D) | Can't say without knowing the data set. |
Correct answer: (B)
Explanation: By asking this to your candidate, you are assessing their knowledge regarding model training, model improvement, and linear regression.
Finding the right learning rate needs trial and error as it is very dependent on your problem and data set. Very high learning rates most probable doesn't let your model predict anything properly as it is too coarse to come closer to the optimum hence the loss could start spiking up after a couple of iterations (blue). Very low learning rates usually go in the direction of a minimum but very slowly which makes them very computationally costly (black). The optimum lies in the middle, where the optimum is quickly approached and a saturation is found after not too many iterations (orange). The green example would be of a learning rate still too high to approach the minimum, hence staying stuck at a similar loss.
Question 9. Below you see two data sets (graph A) and (graph B) on a linear upward trend with the data points randomly distributed within a defined range. Both data sets share the same regression line (blue) yet differ in population variance. What can you say about the sum of residuals in both cases?
Option A) | The sum of residuals is much bigger in A then in B | Option B) | The sum of residuals is much bigger in B then in A. |
Option C) | Both have basically the same sum of residuals. | Option D) | Can't say. |
Correct answer: (C)
Explanation: You are now evaluating the knowledge about regression.
C is the right answer since both have positive and negative errors to the fit, that overall cancel each other out, leaving a slightly positive sum of residuals in both cases.
Question 10. The plot below shows two features A and B describing a binary classification problem (dots are class 1, crosses are class 0). Using a decision tree algorithm, you can split the data based on feature A (horizontal axis) or feature B (vertical axis). Here, a vertical split is applied where smaller values for A result in class 1, and larger values for A result in class 0. What is the accuracy of this decision's classification?
Option A) | 100% | Option B) | 16,70% |
Option C) | 33,30% | Option D) | 83,30% |
Correct answer: (D)
Explanation: The skills being evaluated are data classification, and decision trees.
Since the straight line classifies three dots (1) to the left of the vertical line and two crosses (0) to the right of the line correctly, only one cross (0) is misclassified. Hence, the accuracy is (6-1)/6 = 5/6 = 83.3%.
Question 11. When training a deep neural network, what action can be applied to reduce overfitting? [multiple solutions possible]
Option A) | Addition of regularization. | Option B) | Reduction of training data. |
Option C) | Reduction of network complexity. | Option D) | Utilization of data augmentation. |
Correct answer: (A), (C), and (D)
Explanation: Evaluate the knowledge about deep neural networks development. Bellow you can see why a), b) and c) are correct:
-
-
- A) Regularization techniques, such as L1 and L2 regularization, add penalties to the loss function for large weights, which discourages the model from learning overly complex representations.
- C) Reducing the number of neurons or layers in the network can help prevent overfitting by limiting the capacity of the model to memorize the training data.
- D) Data augmentation involves creating new training examples by applying random transformations to the original training data, increasing the size of the training set and reducing overfitting.
-
Question 12. Which of the following properties are used in a bagging tree algorithm? [multiple solutions possible]
Option A) | Every individual learner/tree is applied to the whole data set. | Option B) | The model performance is calculated by taking the average performance of all learners/trees. |
Option C) | Each learner/tree has a low bias and a high variance. | Option D) | None of these. |
Correct answer: (B) and (C)
Explanation: With this question you can see if your machine learning candidate knows about decision trees.
Bagging is an ensemble method that can be applied to decision tree algorithms to reduce overfitting and improve performance. It involves creating multiple bootstrapped samples from the training data and training a base learning algorithm on each sample. The results of the individual learners are then combined through a majority vote or average to form a single prediction, resulting in a more robust and stable model. Hence, A is wrong as each learner is applied to a bootstrapped sample, while B and C apply.
Question 13. In order to train a multi-classification model using logistic regression, you could apply the One-Vs-All methodology. How does it work?
Option A) | You train n models to solve n different classification problems. | Option B) | You train n-1 models to solve n different classification problems. |
Option C) | You train one model to solve n different classification problems. | Option D) | You train n models to solve one classification problem. |
Correct answer: (A)
Explanation: By asking this to your candidate, you assess their knowledge regarding classification, logistic regression, and model training.
The One-Vs-All (also known as One-Vs-Rest) method trains multiple binary classifiers, one for each class in a multi-class problem, where each classifier distinguishes that class from all others. The classifier with the highest prediction score for a given sample is chosen as the final prediction. This simple technique transforms the multi-class problem into multiple binary classification problems, which can be solved using logistic regression.
Question 14. What happens if you replace all ReLu functions with linear activation functions in the architecture below?
Option A) | The network will not be able to model non-linear functions. | Option B) | The network will perform slightly better. |
Option C) | The network will take slightly longer in training time. | Option D) | Can't say. |
Correct answer: (A)
Explanation: Here you are making your machine learning candidate use deep neural networks skills.
In general, replacing ReLU activation functions with linear activation functions can result in a decrease in the performance and capacity of the network, and may make the training process slower and more difficult. ReLU activation functions add non-linearity to the network, allowing it to model complex relationships in the data. Linear activation functions do not have this property, so replacing ReLUs with linear activation functions may result in a decreased ability of the network to model complex relationships in the data. Performance will be worse because of this fact. The training time is not affected by the change.
Question 15. What is a potential problem when using a sigmoid activation function compared to e.g. using a ReLU activation function when training a neural network?
Option A) | The sigmoid function is computationally more elaborate compared to the ReLU function. | Option B) | Large gradients get saturated in the sigmoid function. |
Option C) | The sigmoid function is not non-linear. | Option D) | Can't say |
Correct answer: (B)
Explanation: in this question, you are also assessing the deep neural networks skills.
Vanishing Gradients: The sigmoid activation function produces output values that are limited to a small range close to 0 or 1, which can result in very small gradients during backpropagation. This can cause the gradients to disappear or "vanish", making it difficult for the network to learn from the data. This is known as the vanishing gradient problem. The sigmoid function is non-linear, and computationally similarly elaborate as the ReLu function.
Question 16. A classification problem to predict good vs. bad wines is approached by training three different algorithms: random forest model, a support vector machine and a multilayer perceptron. The resulting confusion matrices on the test set are given below. Which model is the most accurate?
Option A) | Support vector machine | Option B) | Random forest |
Option C) | Multilayer perceptron | Option D) | Can't say. |
Correct answer: (B)
Explanation: Now you are evaluating important skills for a machine learning engineer like data classification, model evaluation, and scikit-learn.
The accuracy of a model can be calculated from a confusion matrix using the following equation: Accuracy = (TP + TN)/(TP + TN + FP + FN) Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives. Doing the math on these: Random Forest = 289/320, Support Vector = 280/320, Multilayer Perceptron = 286/320. Hence, the random forest is the most accurate.
Questions to assess a senior machine learning engineer
Question 17. In the following picture, the red box marks a region of significant discontinuity. What type of discontinuity are we talking about?
Option A) | Depth | Option B) | Surface color |
Option C) | Illumination | Option D) | Distinct |
Correct answer: (A)
Explanation: In this first question to assess more senior candidates, you are checking their convolutional neural networks skills.
Depth discontinuity refers to a sharp change in the depth or distance of objects in a scene. In computer vision and image processing, depth discontinuity is often used to detect boundaries between foreground and background objects, or to distinguish between objects at different distances from the camera.
Question 18. Your decision tree algorithm splits different categorical features applied on a problem to predict whether a car is fast or slow according to the plot below. Which feature shows the highest information gain?
Option A) | Cylinder | Option B) | Car colour |
Option C) | Horsepower | Option D) | Manufacturer continent |
Correct answer: (A)
Explanation: Here you are assessing the knowledge in model evaluation, and decision trees.
In a tree-based algorithm such as decision trees, the feature with the highest information gain is typically selected as the root node or the top splitting feature for each node in the tree. Information gain measures the reduction in entropy or the increase in the purity of the target variable after splitting the data on a particular feature. By selecting the feature with the highest information gain, the algorithm aims to maximize the reduction in entropy, leading to the best possible splits and more accurate predictions. In this example, only cylinder can separate at least one branch into purely one classification (slow for V6) whereas all branches of all other features have mixed classifications (both slow and fast).
Question 19. When training decision trees, the information gain can be calculated. This is a measure used to decide for the amount of information gained from the features towards the dependent variable. Another measure is the ratio between the information gain and the intrinsic value of a feature. For categorical features, when is it better to use information gain compared to the gain ratio?
Option A) | When the categorical feature has a high cardinality. | Option B) | When the categorical feature has a low cardinality. |
Option C) | Cardinality of a categorical feature should not influence the choice of the right measure. | Option D) | Can't say |
Correct answer: (B)
Explanation: With this question you check the knowledge regarding decision trees.
If the cardinality of a feature is low (e.g., a binary feature), then the information gain is likely to work well as a feature selection criterion, as the intrinsic information content of the feature is relatively low.
Question 20. The three scatter plots show a binary classification problem in a 2-dimensional feature space (x1, x2). The lines mark the decision boundaries resulting from three different non-linear logistic regression models. Check the statements that are true! [multiple solutions possible]
Option A) | The training accuracy is maximum in plot C. | Option B) | The best model for the classification problem at hand is C as it shows zero training error. |
Option C) | Model A is the most applicable to new, unseen data. | Option D) | Model C is overfitting compared to A & B. |
Correct answer: (A), (C), and (D)
Explanation: You are now assessing model evaluation knowledge.
High variance means your model is overfitting, therefore fitting the training data (almost) perfectly but not generalizing well enough on new data. Out of the three examples, C fits the data basically perfectly, hence showing the highest variance. Example B underfits the data therefore showing low variance and example A sits in between showing the best generalization of the model to new data.
Question 21. The scatter plot below shows a binary classification problem (circle-1, plus-0) in a 2-dimensional feature space (x1, x2) where both features can only take binary values as well. Can you create a decision boundary with a logistic regression model that leads to 100% accuracy?
Option A) | Yes | Option B) | No |
Option C) | Need more information. | Option D) | Don't know. |
Correct answer: (B)
Explanation: Logistic regression provides no decision boundary that clearly separate the classes here. Other non-linear algorithms could be applied, like tree-based models or SVMs.
Question 22. What statement is correct about the optimization algorithm AdaGrad applied commonly in training machine learning algorithms?
Option A) | AdaGrad applies first order differentiation. | Option B) | AdaGrad applies second order differentiation. |
Option C) | AdaGrad applies dynamic order differentiation choosing the degree based on the problem at hand. | Option D) | Can't say |
Correct answer: (A)
Explanation: Assess model training and algorithms with this question.
AdaGrad applies first order differentiation because it uses the gradient information to adjust the learning rates of individual parameters. The gradient is a first-order derivative of the loss function with respect to the model parameters, and it represents the direction of maximum increase of the loss function.
Deliver the best candidate experience and find the best machine learning engineers
Machine learning and artificial intelligence are tech skills that increasingly growing in the job market. If you are a recruiter or a talent specialist looking to find the right candidates, skillfill.ai helps you to deliver a great candidate experience, while evaluating the tech skills of the applicants.
With our job simulations tailored to your job ad, you can automate the pre-screening process, and reduce time, cost and bias while assessing tech candidates. With this approach, you can replace CV screening and shift towards a skill-based hiring.
ASSESS TECH TALENT BEYOND CVs
Start today your 14-day free-trial and identify the best tech talent to join your team.