Data Engineers: The New IT Bottleneck?

Question: “How many data scientists does it take to screw in a lightbulb?”

Answer: “One, if she has a data engineer on speed dial.”

This joke is not to denigrate data scientists or mock their data engineering skills. Companies don’t hire data scientists to be data engineers. Yet, in many companies, data scientists spend too much time finding, profiling, formatting, and transforming data and publishing machine learning output. This undermines their value and usefulness to their company.

The Data Science Assembly Line 

Recognizing this dilemma, some managers surround data scientists with a squadron of data engineers who excel at data programming and pipeline creation. Although data scientists can code and manipulate data, they are hard pressed to build complex data pipelines that pull and integrate big data from dozens of disparate sources at different rates and levels of granularity. That’s what data engineers do. Combining both on a single team is a match made in heaven, or so it seems.

Common wisdom says that you need two or three data engineers to support a single data scientist. For companies with complex data requirements, that ratio may grow to four to five data engineers[1]. Pairing data scientists with one or more data engineers increases the productivity of data scientists, enabling them to focus on what they do best: apply math, statistics, and machine learning algorithms to address business constraints and optimize business processes.

By creating a data science team—instead of hiring lone-wolf data scientists—organizations create an assembly line of specialists: data engineers prepare data→data scientists create machine learning models→product engineers deploy the models. Everyone does what they do best, the team’s productivity climbs, and everyone is happy.

But not so fast! 

Downsides of specializing. This specialist approach to analytics has some downsides. For one, it dramatically increases headcount, nullifying the cost-savings from increasing the productivity of individual data scientists. Second, it creates coordination delays that happen when one specialist hands off work to another who may not be ready to work on it right away. Eventually, this produces a bottleneck that slowly strangles productivity.

Third, as any child who’s played the game “Telephone” knows, whenever a person communicates a task to another, something gets lost in translation. This “context-switching” often leads to missing data or elements that require rework and more delay. Finally, specialists never gain the satisfaction of seeing a project through from beginning to end. As a result, they are less vested in the outcome and don’t get the opportunity to optimize a solution to meet business needs.

Solving the Conundrum

So if both a data science assembly line and a solo data scientist lead to sub-optimal results, what should a chief data and analytics officer do?

The good news is that we’ve been down this path before. Data analysts have long been victims of an IT bottleneck, first to create custom reports in the 1990s, and next to create custom data sets and departmental data marts in the 2000s. Some companies have largely liberated data analysts from this constraint by giving them self-service tools and governed access to curated data sets via a data catalog. IT’s role is no longer to deliver analysis, but facilitate it by focusing on IT-related tasks, such as ingesting, cleaning, securing, standardizing, and lightly integrating data. (See “A Reference Architecture for Self-Service Analytics.”)  

Platform Abstractions. In the same way, we need to liberate data scientists not by making them more dependent on IT (i.e., with a bevy of data engineers) but less. One way to do that is to create a more robust data and applications infrastructure that abstracts the complexity of data engineering and deployment tasks. Jeff Magnusson, a vice president of data platform at Stitch Fix, wrote a controversial and widely circulated article titled “Engineers Shouldn’t Write ETL: A Guide to Building a High-Functioning Data Science Department” that argued that companies should not have dedicated ETL or data engineers.

In a recent podcast with Eckerson Group, Magnusson says, “I’d rather focus good, strong engineers on building tools and abstractions to make ETL, data movement, and data science easier versus having those folks engineering each specific data pipeline that needs to get developed. And so, by creating those tools, that in turn empowers data scientists to take full ownership of their pipelines from data acquisition to production, and then they can control their iteration cycles, and that often increases velocity.”

Of course, the idea of building a common platform that abstracts underlying complexity has been common wisdom in data warehousing circles for decades. Most data warehousing teams have hordes of ETL developers building a single, centralized, business-modeled data warehouse designed to serve a multiplicity of current and future reporting needs, not the spreadmarts of individual analysts. The data warehouse abstracts the data in a way that regular business users armed with a BI tool can query it without deep knowledge of the underlying data or SQL access methods.

This is not to say that data scientists should source data from a data warehouse; they can’t use pre-formatted, aggregated data that is designed for business reporting not machine learning. But they should be able to find relevant data sets and prepare them without deep technical knowledge. They need a library (or catalog) of existing data pipelines, data sets, and other artifacts that they can quickly search, modify, and use to do their work. And they need an elastic, easy-to-configure platform with built-in guardrails that enables them to test, run, and deploy data models without wreaking havoc on other users of the system or creating downstream errors and performance issues once they deploy a model.

Self-Service Data Science Tools. Most organizations can’t afford to hire top-notch platform engineers to build their own self-service data science environments. But fortunately, many vendors have recognized the need and developed products to increase the productivity of data scientists as well as make data science easy enough for the average data analyst–or citizen data scientist.

A new class of data science tool, called AutoML, performs a lot of the manual heavy lifting involved in creating data science models. For example, the leader in this space, DataRobot, automatically creates a data pipeline required to generate a model using a specific algorithm. It also tests dozens of models simultaneously, handles all the processing and workloads behind the scenes, and provides one-click deployment and execution through its application programming interface, among other things. AutoML tools essentially abstract the math and infrastructure required to create, tune, and run analytic models.

Also emerging are DataOps tools products from vendors, such as Infoworks and StreamSets, that greatly simplify the process of creating data pipelines from diverse sets of both streaming and batch-oriented data. These GUI-based tools enable citizen or regular data scientists to ingest, join, split, and model data without writing any code. Once designed, the tools automate the execution of jobs and monitor performance to ensure applications achieve service-level agreements. In effect, these are ETL tools for the modern era of self-service data science and analysis.

And cloud platform providers, Amazon Web Services, Microsoft Azure, and Google Cloud Platform, now offer a dizzying array of platform services designed to simplify the process of creating, running, and deploying data pipelines and analytic models.

Summary 

If your goal is to maximize the productivity of your expensive data scientists, think twice before hiring a bevy of data engineers. Although this approach works in theory, it struggles in practice—there aren’t enough good data engineers to meet demand. It might be better to invest the money into a robust data and application infrastructure that abstracts the complexities of building and managing data pipelines and promotes self-service for data scientists.

[1] Jesse Anderson, “Data Engineers vs. data scientists”, O’Reilly, April 11, 2018.  

Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson