Data Scientist is the sexiest job of the 21st century - since 2016, Data Scientist has been in Glassdoor’s top 3 “Best Jobs” in the United States, and even enjoyed the top spot for a few years.
Today, we will focus on data scientists and their role and contributions within a company. Data Science can be a very wide-encompassing term - at some organizations, a data scientist can be the next career-level for a senior analyst, but with no substantial change in responsibilities. In other organizations, a data scientist is someone who focuses on experimentation, especially around the product. One of the key distinctions we see in a data scientist is the element of using advanced techniques to get insights from data and make conclusions around similar data. This might be:
This last result of data science is one that the world has become increasingly familiar with and affects us in our daily lives. If we are logging on to our email from a new device - an algorithm will raise an alert that something in our usage has changed, and we are then asked to perform a check to prove we are humans and not some bot hacking into the app:
With more and more things happening online (especially since the pandemic) and the increasing amount and availability of data - there is an increasing need for roles that will make sense of it all. According to LinkedIn, there are more than half a million data scientists in the world - and this specifically excludes data analysts, data engineers and other folk in data! Topping that - there are ~1,200,000 open roles for data scientists worldwide. The US Bureau of Labor Statistics projects the role to grow by 22% in the USA alone - much faster than the average of all occupations.
While we have all gotten used to hearing “data science” used more and more in our job and popular culture (think Moneyball, the movie about how the Oakland A’s baseball team uses mathematics to build a winning team), there has also been an increase in “machine learning” and “artificial intelligence” (think Jarvis/Vision in the Marvel movies) - and it will be helpful to differentiate. However, one note is that machine learning and artificial intelligence are not not data science - these include more techniques, and are applied/focused differently. Looking for someone with that expertise would mean creating a job description for AI engineers or ML engineers, and not a Data Scientist. A Data Scientist would translate a business problem to data and create an MVP, while AI and ML engineers would focus on integrating the data science model into the app or website.
As we mentioned above - data science is the practice of using advanced techniques to get insights from data and make conclusions around data. As an example, one of the ways in which we use data science here at Skillfill is creating custom skill tests for candidates based on the job description our clients provide us - we use natural language processing (NLP) techniques to analyze the job description, and then map to our knowledge graph based on the contents of that job description - with this knowledge graph, we can create Python, SQL, Math and Stats (and more!) questions for candidates to uniquely test the candidate on the dimensions our client is looking for.
Machine Learning can be thought of as the heavy lifting of data science - algorithms are used to make conclusions or infer outcomes in an automated way - requiring less involvement from a team of analysts. In this way, the traffic light example is a closer example of machine learning. Data science was used to determine that certain types of behavior are suspicious, and require someone to verify that they are human. However, when it is applied at scale and to run automatically - as on a website login page - this is now a more clear case of machine learning.
Artificial Intelligence is a deeper subset of data science and machine learning - which includes many more futuristic things, like robotics and computer vision. Supervised learning techniques are usually trying to classify or predict data based on previous experiences - e.g. classifying an iris flower as either “Setosa”,” Versicolour”, or “Virginica” sub-species of iris, or identifying the object of a picture as a dog or a blueberry muffin:
whereas unsupervised learning techniques has no previous examples - so it finds classes, labels, patterns and groups within a dataset with no human input required). A good example of this is a recommendation algorithm - from sites like Netflix or Amazon - that makes highly custom recommendations based on your viewing or shopping history.
Now let’s get back on our main topic - Data Scientists - and what they do within a company. There are a lot of data science techniques and ways in which a data scientist can add value to a company - but we will focus on the three most common and practical applications:
Beyond the knowledge of all the techniques available, and how to apply them - a key skill for a data scientist is choosing the right technique based on the business problem.
Regression is probably the most common term among these top data science techniques - and one that most of us probably heard at some point in a high school or college math or statistics class - usually in the phrase “linear regression”. There are many examples of linear regression - house prices based on square meters (or feet), gas mileage based on the weight of a car, or income based on years of education. While linear regression is the most common phrase we’ve heard of - it can get a lot more complex. Linear regression is usually coming up with a line that best fits the data, as shown below:
However, the data isn’t always going to be best described by a line. In these cases, one might have to come up with a polynomial regression to better fit the data. There are more advanced regression algorithms however, such as random forest, support-vector machines, and others!
Classification is a branch of data science that focuses on exactly that - classifying data into categories. A common first project is around classifying cars into European, Asian, or North American cars based on certain features, or using image processing techniques to classify cancers as malignant or benign.
Logistic regression is an example of classification - even if it sounds like it should be in the previous “Regression example”, and one that predicts “yes or no”, “true or false” values with final output of a probability - a number between 0 and 1. A good example is - will a student pass his exam?
The graph above shows that the more hours a student spends studying (the horizontal, x-axis), the higher the probability that the student will pass the exam (the vertical, y-axis). If you’re thinking - could this be done using linear regression the answer is - kind of! You could predict the score (a numerical value) of a student based on the number of hours studied, and your model would most likely show higher scores correlated with higher # hours studying. When using classification - the model does not have to be so simple, and based on one feature - such as the number of hours studied. Real-world and applied examples would probably bring in many other features to come up with an accurate prediction - a classification model for cancer would probably bring in age, personal history with cancer, family history with cancer, and any known genetic factors, and maybe other specific factors based on the type of cancer you’re trying to predict (e.g. lung cancer - does the person regularly smoke cigarettes?).
Another common type of technique used by data scientists is clustering. This might sound similar to classification, but shouldn’t be confused as it differs in that it is an unsupervised learning technique. With clustering, we are not classifying a test result into “pass” or “not pass”, but are counting on the algorithm to find the interesting patterns or groupings among the data and report those back:
A good example of this is trying to find groupings among your customers if you sell sports equipment online. There are multiple ways of grouping your customers - for example by sport - your tennis customers, your baseball customers, etc, but you could also group them at a different level. For example, this could be:
A regular customer might be someone who requires sports gear and apparel throughout the year - maybe someone who goes to a gym a few times a week, or goes running on a regular basis. Their needs aren’t very specialised, and include athletic shorts, a sports t-shirt and athletic shoes. Seasonal sports customers are people who would be treated very differently - for example, people who only purchase from your store at a specific time of the year - e.g. winter. These are customers you want to reach out to differently with content that will interest them - for example, new skis or a sale on ice hockey gear. Specialised sports customers are those which would be treated differently and need to be classified differently. For example, a tennis player would have some of the same needs as a “regular customer” - athletic shorts, and a sports t-shirt, but would also be interested in a new racket, or a sale on tennis balls.
Finding a data scientist is complex because there is no universal definition of data scientist - this can vary drastically across companies and industries. This gets even trickier in that Machine Learning and Artificial Intelligence roles are themselves a part of the Data Science world! These roles can be very complex, and use a variety of techniques and packages - and should not be confused with data-pipeline related jobs, such as data engineers. One of the first tasks is deciding the core requirements, key skills and tools used in your company, and then tailoring an assessment to find the right candidate - a new university grad who knows the fundamentals and newest packages, or an experienced applicant who has years of experience applying data science techniques to business challenges.
Skillfill has the tool for HR to pre-screen Data Science and other tech applicants. The workflow is easy: Simply upload a job ad and the Skillfill AI-engine extracts all the required skills and translates them into a highly specific skill test consisting of Multiple Choice questions and coding quizzes. The recruiter can then review an applicants’ results and easily understand which are the most suitable candidates to move forward with. The test results are broken out by skill category and by individual question. This allows the hiring manager to get a better understanding of the top applicant’s strengths and weaknesses, and can be used as a basis for a technical interview if he or she wishes.
Find more information on Skillfill EVALUATE here.