An individual responsible for developing, deploying, managing, and monitoring data pipelines that ingest and transform data from one or more sources to one or more targets.
Data engineering performs the lifecycle of work involved in assembling data for analytics. This means architecting, designing, building, operating and adapting the pipelines that ingest, process and deliver data from source to consumer. While the discipline was born in the world of SQL, ETL and data warehousing, data engineering evolved in recent years to address data science. Many data engineers now execute projects that support scripting with python and R, often to apply advanced algorithms to semi- or unstructured data.
Data engineering is a critically important part of analytics that receives little attention compared to data science. Recent research shows 12 times as many unfilled data engineer jobs as data scientist positions. Breadth and depth of required skills limits the number of qualified people to work as data engineers. Clearly the demand for data engineers outstrips the supply, and the gap continues to grow. The large number of unfilled jobs reflects the complexity of data engineering. Breadth of knowledge ranges from relational databases to NoSQL, from batch ETL to data stream processing, and from traditional data warehousing to data lakes. Depth of skills includes hands-on work with Hadoop; programming in Java, Python, R, Scala, or other languages; and data modeling from relational and star-schema to document stores and graph databases. The data engineer is part database engineer (building the databases that implement data warehouses, data lakes, and analytic sandboxes) and part software engineer (building the processes, pipelines, and services that move data through the ecosystem and make it accessible to data consumers). One goal of data fabric is to automate much of data engineering to increase reuse and repeatability, and to expand data engineering capacity.
Data engineers do the heavy lifting required to create and manage the information supply chain. We used to call these individuals ETL developers and data architects. They identify source data, map data flows, model databases, define and monitor data transformation jobs, and work with database administrators to create, manage, and tune databases and optimize performance. Some also design business views for business users, especially if they are built within a database.