Data Engineering Coming of Age
The data scientist gets much attention as an important role in the age of analytics. Equally important, but with less trumpeting, is the role of data engineer. The data scientist finds meaning and insights in data. Data engineers design and build the data ecosystem that is essential to analytics. Data engineers are responsible for the databases, data pipelines, and data services that are prerequisites to data analysis and data science.
Why all the hype for data science with so much less attention to data engineering? Perhaps it is because Harvard Business Review hasn’t declared data engineer as the sexiest job of the century. Maybe it is because business executives see data scientist results while data engineering results are less visible—viewed as infrastructure or “backroom stuff.” Regardless of hype levels, however, data engineering has come of age. At a recent conference session on data architecture, Michelle Goetz of Forrester Research reported finding twelve times as many unfilled data engineering jobs as data science jobs.
I believe the large number of unfilled jobs is a reflection of the complexity of data engineering. Data engineer is a demanding job that requires both breadth and depth. Breadth of knowledge ranges from relational databases to NoSQL, from batch ETL to data stream processing, and from traditional data warehousing to data lakes. Depth of skills includes hands-on working with Hadoop, programming in Java or Python, and data modeling from star schema to document stores and graph databases. The data engineer is part database engineer building the databases that implement data warehouses and data lakes, and part software engineer building the processes, pipelines, and services that move data through the ecosystem and make it accessible to data consumers. (See figure 1.)
Figure 1 – Data Engineering Activities and Deliverables
Breadth, depth, and diversity of skills indicate that data engineering is better undertaken as a team responsibility than as a solo activity. No individual can be the expert in all areas of data engineering. A team of people with complementary skills is a more practical approach than data engineering silos. Data engineers need to be team players and they also need to be collaborators working with data scientists, architects, stewards, and subject matter experts (SMEs) as shown in figure 2. Data engineering teams also collaborate with data curators, data governance teams, BI and data warehousing teams, and database administrators (DBAs).
Figure 2 – Data Engineers as Collaborators
Data science and data engineering work together, with the data engineer organizing data and optimizing it for specific use cases. This frees the scientists to focus on what they can discover and learn from the data. Data engineers provide databases, datasets, and data services for data scientists, and also for other data consumers including data analysts, business analysts, and report writers. To be highly effective in meeting the needs of this diverse group of people, engineers must be capable with structured and unstructured data, with SQL and NoSQL databases, with data lakes and data warehouses, with high-latency and real-time data, with data architecture and system architecture, and with design and programming.
I reviewed dozens of data engineer job postings to build a list of the 24 most common skills needed. From that list I created a Data Engineering Skills Assessment and Gap Analysis tool that you can use to self-assess your capabilities (available for download at the top right of this page). Provide responses about your experience and education and your organizations level of need for each skill. The tool will then calculate skill level and skills gap information. High skill areas show your strengths. Wide gaps show the areas where you should pursue more education and experience. Narrow gaps show the areas where you’ll make greatest contributions to a data engineering team.