Big Data Requires Big ETL
Data is being generated in greater volumes than ever before thanks to new data sources such as sensors, application logs, Internet of Things (IOT) devices, and social networks. Furthermore, to drive business revenue and efficiency, IT is pressed to acquire new data to keep up with ever-expanding storage and processing requirements.
As businesses look beyond the relational database for solutions to their big data challenges, Extract, Transform, Load (ETL) has become the next component of analytic architecture poised for major evolution. Much new data is semi-structured or even non-structured, and constantly evolving data models are making the accepted tools for structured data processing almost useless.
Because the majority of available tools were born in a world of “single server” processing, they cannot scale to the enormous and unpredictable volumes of incoming data we are experiencing today. We need to adopt frameworks that can natively scale across a large number of machines and elastically scale up and down, based on processing requirements. Much like the conceptual tipping point that brought to life the term “big data,” we have reached the same scale of evolution with ETL. We therefore nominate the term “Big ETL” to describe the new era of ETL processing.
We’ll define Big ETL as having a majority of the following properties (much like the familiar 4 Vs):
- The need to process “really big data” – your data volume is measured in multiple Terabytes or greater.
- The data includes semi-structured or unstructured types – JSON, Avro, etc.
- You are interacting with non-traditional data storage platforms – NoSQL, Hadoop, and other distributed file systems (S3, Gluster, etc).
Free Tools Change Everything
Unlike traditional ETL platforms that are largely proprietary commercial products, the majority of Big ETL platforms are powered by open source. These include Hadoop (MapReduce), Spark, and Storm. The fact that Big ETL is largely powered by open source is interesting for several reasons:
- Open-source projects are driven by developers from a large number of diverse organizations. This leads to new features that reflect a varied set of challenges across solutions and industries. It also creates a community of developers and users working together to improve the product.
- One of the most important features of ETL platforms is the ability to connect to a range of data platforms. Instead of waiting for a vendor to develop a new component, new integrations are developed by the community. If you need to connect an MR pipeline to Redis NoSQL, or build a Spark SQL job on top of Cassandra, no problem. Chances are someone has already done this and open-sourced[LR1] their work. And if you don’t find what you need, you can build and open source it yourself.
- Most importantly, the fact that these engines are open source (free) removes barriers to innovation. Organizations that have a great use case for processing big data are no longer constrained by expensive, proprietary enterprise solutions. By leveraging open-source technology and cloud-based architecture, cutting-edge systems can be built at a very low cost.
Return to Command Line
Unlike traditional ETL tools, most of the tooling for Big ETL does not necessitate traditional GUI-based development. Those familiar with the traditional landscape will recognize that almost all of these ETL tools leverage a palette-based, drag-and-drop development environment. In the new world of Big ETL, development is largely accomplished by coding against the platform’s APIs or through high-level Domain Specific Languages (DSLs). The DSLs include Hive, a SQL like framework for developing big data processing jobs, and Pig, a multipurpose procedural programming language.
To some extent, this return to coding is a reflection of landscape maturity. As time goes on, we likely will see more GUI-based development environments popping up. However, this movement is increasingly questioning the value proposition of traditional graphical ETL tools. Is graphical coding always more efficient than code? Does this concept create obstacles when we need to represent very customized and complex data flows? Has the promise of Metadata and reusability been fully delivered by GUI based tools? And, finally, are these[LR1] concepts just as possible, and just as central, in code?
With the advent of YARN, Hadoop 2’s new resource manager, we will see an increasing number of legacy tools adapt big data processing capabilities, but the movement to code will definitely continue.
More Pre-processing Required
Because of the inability of NoSQL and Hadoop to perform ad-hoc joins and data aggregation, more ETL is required to pre-compute data in the form of new data sets or materialized views needed to support end-user query patterns. In the NoSQL world, it is common to see the same event appear in several rows and/or collections, each aggregated by different dimensions and levels of granularity.
Access to most dimensional data must be “de-normalized” into relative events or facts. This means Big ETL now has an increased payload in materializing these additional views. Additionally, process orchestration, error recovery, and data quality become more critical than ever to ensure there are no anomalies caused within the required data redundancy.
The bottom line is not only do we have to process enormous data volumes, but also must process it to a greater extent, as well as take greater care in data quality and data monitoring.
These are exciting times in data management. Innovation is the only sustainable competitive advantage a company can have, and we have seen unprecedented technology breakthroughs over the past few years. IT departments are busy enabling businesses with opportunities that were unimaginable just a few years ago, the open-source community is driving innovation at a mind-bending pace, and Big ETL will continue to tackle new and exciting data management challenges, displacing brittle legacy architectures throughout the enterprise.
This article was co-authored by Joe Caserta and Elliot Cordo and a version of this was originally published February 9, 2015 on data-informed.com.