A data integration process that pulls data from one or more source systems, loads it into a target repository, and then transforms it using the repository’s engine. It contrasts with traditional ETL that transforms data in an external server before loading it into a target.
Spark can accelerate ETL processing, for example by combining and executing these three tasks (extract, transform and load) in-memory with a single set of coding commands. It also can streamline ELT, meaning the data is extracted, loaded and then transformed once it arrives at the target. In addition, Spark can help discover big data sources, identifying data patterns and classifications, before extraction and/or when they are consolidated at the target.
Extract-Load-Transform... is a variation on ETL that was conceived to offset some of the latency that results from pure ETL processing. Waiting for all transformation work to complete delays availability of data for business use. Loading immediately after extract, then transforming data in place reduces the delay. ELT has the greatest impact on accelerated data availability when multiple sources that are not ready for extract simultaneously would be held in a staging area with ETL processing. With ELT the data warehouse serves the role of data staging so each source becomes available for use upon extraction and loading. Data transformation is performed in place in the data warehouse once all sources are loaded. Data quality and data privacy considerations are sometimes a concern with ELT processing. When data is made available for use without transformation, it is widely exposed without first performing data cleansing and sensitive data masking or obfuscation.