A data integration process that pulls data from one or more source systems, transforms it in an external server, and then loads it into a target repository. This can be accomplished in batches or via streaming.
Extract-Transform-Load (ETL)... is the most widely used data pipeline pattern. From the early 1990’s it was the de facto standard to integrate data into a data warehouse, and it continues to be a common pattern for data warehousing, data lakes, operational data stores, and master data hubs. Data is extracted from a data store such as an operational database, then transformed to cleanse, standardize, and integrate before loading into a target database. ETL processing is executed as scheduled batch processing, and data latency is inherent in batch processing. Mini-batch and micro-batch processing help to reduce data latency but zero-latency ETL is not practical. ETL works well when complex data transformations are required. It is especially well-suited for data integration when all data sources are not ready at the same time. As each individual source is ready, the data source is extracted independently of other sources. When all source data extracts are complete, processing continues with the transformation and loading of the entire set of data.
We’ll define Big ETL as having a majority of the following properties (much like the familiar 4 Vs): The need to process “really big data” – your data volume is measured in multiple Terabytes or greater. The data includes semi-structured or unstructured types – JSON, Avro, etc. You are interacting with non-traditional data storage platforms – NoSQL, Hadoop, and other distributed file systems (S3, Gluster, etc).
Simply speaking, ETL aimed to displace hand-written code for populating data warehouses (and marts) with automated procedures both in the initial build and, more importantly, in the ongoing changes needed as business needs evolved. Procedure design and editing is via graphical drag-and-drop. Metadata describing the steps and actions is stored and reused. This metadata is used either to drive a data processing engine or is transformed into code prior to execution.