Five traps to avoid when building Analytics and Data Science from scratch

by Valentin Leister - Head of Data Science and Business Intelligence at virtualQ

A hands-on guide using the example of an early stage startup#

Every data-driven startup eventually reaches a stage where the founding team is not able to fulfill its growing Analytics and Data Science needs. At that stage, usually a Data Professional is hired to build and lead these functions. Over the past 2.5 years, I had the privilege to build the data team and analytics infrastructure at virtualQ, a SaaS startup with a mission to improve customer service through smart handling of incoming phone calls.

While I believe our company’s growth and current state of the data team and infrastructure prove that we got many things right, not everything was smooth sailing. Being used to work in larger companies, I had a steep learning curve while adapting to an environment of limited resources and rapid growth in headcount, product complexity and users.

This article has the goal to help you avoid late nights and frustration by describing the most common pitfalls when setting up Data Analytics and Data Science functions from scratch, be it at a startup or a mature company. Of course, your specific challenges will differ depending on the size of your data team, maturity stage and business model.

Trap #1: Fixing dirty data after it has been produced#

Img Herculean Data cleaning efforts, 1300 BC

Pictured above is the tale of Hercules, cleaning out the stables of Augeas, which had not been cleaned in over thirty years and accommodated 3000 cattle. It has since become a synonym for performing a large and unpleasant task that has long called for attention, exactly what data cleaning was to the founders before they hired you. As you set up the data function from scratch, tidying and cleaning all existing data can therefore feel like a truly Augean task.

The argument why dirty data is bad goes beyond the scope of this post, but the concept of GIGO (Garbage In, Garbage Out) sums it up nicely. As you approach the data producers in your company (teams, product owners) to “scrub the stable”, many will understand the value of your initiative and collaborate happily. However, you may also receive some pushback because your stakeholders do not view clean data as a priority. Furthermore, cleaning data involves some effort from your stakeholders, since they need to help you understand existing data and have to adapt to new processes to keep it tidy. After all, human error is the main source of dirty data.

When faced with resistance to cleaning up data at its origin, the dangerous trap is to use complex data engineering to do so instead. In this case, the people or apps continue to produce dirty data, but you bring it into clean shape at a later stage by performing a lot of data wrangling. Unfortunately, this approach creates a lot additional complexity and maintenance effort for your data pipeline and should therefore be avoided if possible. A very simplified example of this would be to match free form entries to classes via Regex, instead of defining a drop down list in the first place.

Instead, you should educate all data producers about the importance of having clean data and adequate processes at an early stage. Explain the future cost of inaction to convince them and start with your clean-up before the stable becomes Augean.

Main Takeaway: Tackle dirty data by cleaning it at the source and setting up processes to keep it tidy. Be persistent, even if you face initial resistance from other functions.

Further reading:

Trap #2: Launching the feature without the data#

Img Here’s your event logs to report those performance metrics to the board

Data professionals´ nightmares are made of two things: dirty data (which we discussed previously) and having no data at all. Imagine your team is asked to help decide over the fate of a new feature: should it receive more investment or should it be shut down? To do so, you will need to dig into your data and perform an analysis to answer key questions such as: “which customers have looked at the new feature” and “who uses the feature regularly”. If the moment you are approached to support the decision is the first time you think about the data needed for your analysis, then you essentially sleepwalked into Trap #2: Launching the feature without the data. It will be very tricky to put together the insights which are expected from you.

To prevent this situation, ensure all essential events are produced, saved, cleaned, and stored in a database, a process called data instrumentation. There is a tradeoff when instrumenting your feature: As a data user, the more data you get, the easier your job will be. On the other hand, increasing instrumentation always comes with additional development effort and complexity.

It is your job to communicate your data requirements to product management and development when a feature is planned. I recommend to go through the following 4 steps:

  1. Ask your stakeholders what a successful rollout means for them (Example: What feature adoption rate would you hope to see after 3 months?)
  2. Agree on the essential KPIs and their exact definition (Example: Feature adoption is measured by number of clients that use the feature at least once a week)
  3. Identify the event logs required to calculate the KPIs (Example: Get a log which tracks each feature use, timestamp and client id)
  4. Communicate your requirements, ideally during backlog grooming sessions.

Don’t rely solely on your developers to take care of good instrumentation, because they often only use a small fraction of the data their apps produce. They have different priorities such as core performance and cannot intuit what data you will need to answer business questions.

Lastly, don’t try to patch things up if you failed to properly instrument your feature in the first place. You may be able to recreate some of the missing data through extensive data engineering. For example, you may be tempted to compensate for a lack of proper usage logs by pulling in data from your third party provider´s API that stores some of your user´s touch points. I would like to discourage you from doing so, since you end up curing the symptoms of bad instrumentation, rather than the underlying problem. And on top of it, you get rampant data pipeline complexity and maintenance effort.

Main takeaway: Before you launch a feature, agree with your stakeholders on the KPIs and work closely with your product manager and developers to ensure you have the data to calculate them.

Further reading:

Trap #3: Starting with the solution - not the problem#

Img Your team looking for a place to implement their Convolutional Neural Network

This may sound obvious but it can be hard to get right in practice. For many employees, the motivation to join a startup is to create the next big thing and to apply fascinating new technology. A junior data team can bring in plenty of drive and enthusiasm but limited work experience. Assuming your team is smart and self-motivated, you should provide some space to innovate and experiment to unlock their full potential. However, by granting that freedom, a project can gradually drift away from the initial customer problem you want to solve, because you are magically drawn to the tech stack en vogue.

For example, your team may choose to develop a Neural Network based solution, even though a rule based algorithm can provide a perfectly acceptable solution to your customers´s problem, in a substantially shorter timeframe and with reduced complexity. A good practice to remain on course, is to seek a regular exchange with your product manager and your Customer Relations Team: They speak to customers and know what their acute pain points are. In case you are building an internal tool, set up regular updates with your stakeholders to discuss how your envisaged final product will look like and work - using mockups and preliminary results can be very helpful for that purpose.

Main takeaway: Constantly remind yourself and your team of the acute customer problem you want to solve. Then, work your way backwards and find the technology that provides you with the best solution.

Further reading:

Trap #4: Limit Data Science to product R&D#

Img A normal day at work in your data team, as imagined by the HR department

Startups also bring together people with greatly varying degrees of data literacy and technical understanding. Depending on age and background, a data scientist’s role may seem very alienated to leads in other functions. That can make them less inclined to speak to your team about their pain points and challenges. In return, you will have limited visibility and understanding of their problems, reducing your ability to create valuable solutions for them (remember Trap #3).

However, almost every company function can benefit from the skillset and technology available in the data team: huge improvements in efficiency and performance can be achieved through automation, scraping, predictive modeling and iterative improvement via A/B tests. Yet your stakeholders may only observe your team’s (very technical) work on the core product and not be aware of their ability to help out.

It is your task as data leader to be proactive and seek a regular dialogue with each function. Strive to improve the data literacy of key stakeholders where needed. Depending on the degree of collaboration, it may also make sense to “Champion” one of your team members to a specific function, increasing the ownership and involvement from both sides.

Main takeaway: If you want to harness the skills in your team to become an efficient, data-driven organisation, you need to collaborate with all business functions in a proactive manner and educate key stakeholders in data literacy.

Further reading:

Trap #5: Overengineering your reporting#

Img It’s just a simple report I had in my previous role, they said

Senior employees are often hired to bring in valuable know-how from larger organisations. They are also used to the perks of corporate life, including a back office with plenty of manpower and funding, free lunch vouchers and pension plans. You, on the other hand, are running your data team with limited resources, heavily prioritizing tasks and bringing in stale sandwiches.

These two worlds collide when senior employees request the same standard of reporting/analytics they were used to in their previous role. They may be unaware that these reports were built over years by a designated data team. If you implement these projects as requested, they will tie up a good part of your team´s precious resources. The problem is that your stakeholder is focused too much on the solution, instead of the problem (just as your team in Trap #3). Therefore, when they describe the desired layout, tables and filters to you, respond by asking what questions they want to be able to answer. Identify the truly essential elements, explain your situation (limited resources) and cut out all the “nice to have” bits.

This is even more important because you may find that the new reporting is rarely used, even though it is executed exactly as requested. This is due to your stakeholders feeling the same pressure as you do: they need to divide their time wisely across many tasks since they have less resources now, so time for number crunching is limited.

To hold your stakeholders accountable, involve them in your Roadmap Planning. Ask them how often they think they will use the new report/tool and explain that you need to weigh each request against another on the impact it has. Remind them you will check on how much they will use what they ask for.

Main Takeaway: Change the focus on your stakeholders´problem, not a pre-defined solution and discuss what elements of reporting really are indispensable. Involve them in your roadmap planning to prioritize and hold them accountable.

Further reading:

Closing remarks#

I hope you find comfort or advice in one of the five situations described and that it may help you to make different mistakes, not the same. As a Quote from Otto von Bismarck (German chancellor from 1871 to 1890), put it:

“Only a fool learns from his own mistakes. The wise man learns from the mistakes of others.”