The “Analyst” title is widely used across different industries and functions (just think of finance...
Data Engineer - Role and Responsibilities
A Data Engineer can be thought of as an architect in the data world. An architect will look at an empty space, and start creating plans for a future building that people will work or inhabit. This will involve creating blueprints - deciding on the dimensions, how it is organized - and possibly more importantly - how people will interact with the house - where the doors and windows are placed, and how narrow or wide the corridors are.
A Data Engineer will similarly prepare and create datatables for consumption by other people in the organization. The data engineer will look at the raw data and think about the final tables that could be created from this, what they would be used for, how they would be used - and implement it. These decisions will have a big say on how end-customers use the data, and how easy/hard it may be. Data engineering can be one of the functions that business-roles have less experience with, since they can be “hidden” or separated from the organization by a team of data scientists or data analysts.
When an architect invests time in designing a house well - the residents of the house will enjoy the space and have a comfortable life. Similarly - a data engineer that creates a good database (a collection of tables) will facilitate the ability to find, explore and work with data, and more concretely - reduce the time spent on writing complex SQL queries and reduce costs from querying inefficient tables.
In this article - we’ll go through ETL and the connection with stakeholders and the downstream impact that data engineers can have in the organization.
Extract - this is extracting the data from the source file. One of the most common extractions people are familiar with is extracting a zip file on a computer to access the files inside. Similarly, a data engineer will extract data to work on the data within. This can be as simple as a csv, but with bigger datasets - this can sometimes mean file types that are optimized for lots of data - e.g. a parquet file.
A zip file and parquet file are similar in that they both work great for storing data, but are not very workable - it is impossible to work with zip files, and parquet files do not offer the same usability as an excel e.g. pivot tables, sorting or filtering. Extraction starts to get complicated when setting up a process that will do this automatically (e.g. you receive new data every day), and when you start to do some preliminary checks. Imagine you work for an e-commerce company and get daily data from the website around sales and website activity - you might check that you have the right data (sales and not website activity data), that you got the latest data (the date or timestamp inside of the dataset is all equal to yesterday), and that there wasn’t some sort of bug or outage (there are more than 0 rows).
Transform - This is one of the more technical parts of the process - and most alien to business users. The raw data can be a very large dataset, filled with sometimes irrelevant data, such as the number of rows, the time it was extracted, or other fields. If we continue from the example of website shopping data from above - the file might look as follows:
This type of data might be hard for an analyst or data scientist to work with - the Items are all organized together, and same for the prices. An analyst looking to report on the average price of purses in April would have a hard time pulling the data, first having to do his own unpacking (mini-extraction) of the price column.
The final transformation a data engineer implements might look something like this:
You might notice that an analyst wouldn’t be able to get the final transaction price - the shipping price is missing! The above transformation would require another table produced that looks as follows, in order for the analytics team to report on the total price by month.
With the above transformations (from the original raw dataset) - the analyst would be able to create reports that calculate e.g. daily revenue, as well as analyses around the individual products that are getting sold. But importantly - this wouldn’t be an efficient way of organizing the data - comparable to an architect building only one door to a house, but putting it at the back of the house - the resident has to walk an extra distance every time they enter or leave the house. Going back to our architecture analogy - an architect should think about features of the house and how they will be used in the grand scheme of things - e.g. putting a bathroom close to the bedrooms or living room (where people spend a lot of time), as opposed to in the garage, where people would have a harder time getting to it in their daily lives.
One thought might be to have one table that contains EVERYTHING - but this would not be efficient or cost-effective - and we’ll go further on this point later on.
Load - This final step means loading the data into the right place - comparable to adding an attachment to an email. This step means that you load your new tables into your systems - for example Google BigQuery, Redshift, Snowflake, or other systems. This part usually means setting up your script to have the permissions to access and modify your data. This can sometimes be creating a new table, but usually will mean adding the last day or week of data (just created in the Transform stage) to the existing data table.
# Downstream Data Usage - Connection with Analytics & Business Customers
Analytics Stakeholders 👨💻 👩💻
The ETL example also highlights a good point - the extracting and loading steps are maybe more “straightforward”, in that the work is between a data engineer and the systems that he or she is working on (e.g. S3 for file storage). The transform step can be a bit more complicated since the final output needs to keep in mind the business needs, and also in mind how the analyst or data science team will use the data. Depending on the transformations or how much simplification the data engineer does - the analysts or data scientists might have to do more work - pulling certain values out of arrays or json columns. Going back to the architect example - the house he or she designs should be easy to live in and make sense to human people - e.g. placing the dining room close to the kitchen to facilitate the resident’s ability to move food from the kitchen to the dining table is making their life easy - and comparable to giving an analytics team a table like the below:
As opposed to the JSON formatted version of the above:
The first table is much easier to read-through and explore - a simple glance at the top few lines will explain what the table is about - the second table will require a bit more investment.
Thinking more long-term, this will make the data-querying harder each and every single time the table is used - more time spent by analysts on writing more complex SQL queries to pull the data - as well as slightly more computation time - which will lead to increased costs from the data system provider as well as increased load on the system (possibly slower queries for everyone else).
Also- this is a good place to explain why a data engineer would not create one table with EVERYTHING inside it - whenever a query is run, the size of the table affects the run time and computation time - usually resulting in a more expensive query. An investment in good data engineering is a future $$$ savings.
**Internal Business Stakeholders** 👩💼👨💼
The close collaboration and relationship that a data engineering team should have with business customers can be a lot more black and white, especially with the analytics team usually coming in-between. However - when a data engineer is designing the table that will eventually feed reports used by business customers - it is important to understand the current and future data requirements.
A marketing director might ask for the above table (population by country) in order to direct marketing efforts in these countries at an early stage- but as the business goals evolve, he might ask for the information “population by city” in those countries. The above table would not be able to provide this information, and a data engineer would have to redesign or create another table to make this information accessible.
The population by city table will work for a while, but in time might also grow outdated- how soon before the marketing director asks for population by postcode?
There is no hard and fast rule to designing these databases- but it is always useful to consider the common-sense dimensions that would be useful for reporting- country, city, and possibly zip code are usually a good start. Depending on the vertical or product one is analyzing, specific dimensions will also be valuable - Facebook might break out daily active users by age group, while Linkedin might break out new users by education level, while a manufacturing firm might break out spend by supplier.
One easy solution might appear to be adding as many dimensions as possible to the table - if we stretch the data above, a datatable with the columns exact address, street, postcode, neighborhood name, city, country. The downside to this will be the costs to store all the additional data. If we assume there are ~ 50 houses on a street- this would add 50x more rows in our final table. If we have 50 streets per zip code- this would be 50x more rows on top of the 50x rows from having an exact address already 2500x more rows than if one had stayed with zip code as the lowest aggregation.
# Data Storage
The above point is a great segway into another important part of the job for data engineers - where to store data and why. Some of the more commonly known and usable storage formats are things like a CSV, Excel Worksheet or Google Sheets- these are great for their ability to store a certain amount of information and be able to extract some insights (e.g. pivot tables). However, the limit to an excel is currently around 1 million rows (it was ~65K rows in 1997) - hardly usable for bigger websites or products and the amount of data produced from them.
Data Engineers are usually working with storage platforms like Amazon’s S3 to store raw, large-data formatted files- these files, such as parquet, can hold large amounts of data much more efficiently than a CSV. From here, depending on how often the tables are used, the data can be extracted into e.g. Hive or Presto tables. These two sources work very differently- Hive tends to be more efficient for holding more data, and larger data sources, but slower for day-to-day data extraction (e.g. via SQL), while Presto will pull data more quickly. You can compare this to the difference between going into a gas station for a snack vs buying something from a vending machine. The gas station has more space inside, and a larger selection, but will typically take longer as you browse or walk to get your snack, whereas with a vending machine you press the button for what you want, pay and are done.
## Exponential Data Growth and Your Business
Companies and people are creating, storing and using data more than ever - according to Eric Schmidt, former CEO and chairman of Google, every two days we make more than was created from the dawn of man until 2003. In order to make more and more of this data usable, companies are turning to data engineers to design their database, create datatables and ensure internal stakeholders have up-to-date, clean, and usable data. The Dice Tech 2020 Job report predicted 50%+ job growth in data engineering jobs.\
To ensure your business hires the right people in the field of Data Engineering, Skillfill has developed a neat tool for you:
The HR department uploads the job description The Skillfill algorithm reads the requirements and translates these to a pre-screened test to send out to candidates You identify top talent in record time!
By making your hiring process faster, fairer, and more transparent to your applicants you hire the best candidates, quicker and avoid losing them to the competition in a tough job market. If you too are interested in smoothly growing the data side of your business, check out the Skillfill website.
ASSESS TECH TALENT BEYOND CVs
Start today your 14-day free-trial and identify the best tech talent to join your team.