Sparking Data Warehouse Modernization

Apache Spark has caught fire since its inception at UC Berkeley ten years ago. This distributed, in-memory data processing platform has snuffed out the batch-oriented, disk-based MapReduce as a preferred engine for Hadoop, cloud object store data lakes and Kafka streaming platforms. More surprisingly, it is becoming a fixture in the world of data warehousing, replacing traditional extract, transform, load (ETL) tools as a means to populate modern cloud data warehouses.

A primary reason is performance. By processing data mostly in-memory, Spark avoids unnecessary I/O round trips to disk, improving latency by orders of magnitude compared with its predecessor. Another reason is developer friendliness. Spark’s flexible support for the Scala, Java, Python and R programming languages has encouraged many coding teams to build it into their data lake and streaming environments, in order to address advanced analytic use cases such as machine learning. For batch or streaming workloads, Spark is increasingly the engine of choice, deployed by a third or more of enterprises.

Many EDW teams now run ETL jobs on Spark in order to improve performance and meet new analytics requirements from the business. This is particularly appealing for new cloud options such as Snowflake’s SQL data warehouse, which has a close partnership with the Spark platform and notebook provider Databricks. When you need fast, efficient analysis of large and growing data datasets, some of them ACID-compliant, a Spark-EDW combination can make sense.

As with any initiative, Spark-EDW rollouts require disciplined planning and execution. This article offers some guidelines based on the successes and lessons learned of numerous enterprise deployments. I’d welcome your thoughts and feedback.

  • Carefully define your business use case. The first question to answer is not how to design your data architecture, or whether and how to use Spark, but rather to understand exactly what decisions business users want to make. Do they need to tailor next-best-offer recommendations for online shoppers, reduce the risk of fraud or support operational reporting dashboards? Specifying your analytics problem shapes everything that follows – architectural design, the implementation process, Spark configuration parameters, etc. (For insights on the interplay between “business discovery” and your data strategy, check out Dave Well’s new blog “The Scope and Complexities of Data Strategy.”)
  • Inventory your necessary data sets. Each of the use cases above maps well to data warehouse structures and each requires a distinct combination of datasets and data types. Inputs for a next best offer recommendation, triggered by a purchase event, might include that customer’s online and in-store purchase history, their clickstream history on the vendor’s website and pre-established purchase patterns of customers with similar profiles. Spark can load and transform those pieces into a unified, commonly-formatted dataset for real-time recommendations or periodic analytics within EDWs like Snowflake. Adding other inputs to the mix, such as real-time tracking of the customer’s mobile purchases or app usage, might require new collectors or additional components.  
  • Consider the role of advanced algorithms. Machine learning can improve the accuracy of each of the use cases above by teaching itself to improve predictions based on selected, often-adjusted features or variable inputs. The more data and processing power you feed into ML, the more effective it will be, which has significant implications for overall architectural design and planning. (Also see the article I wrote with Jordan Martz, “Data Management Best Practices for Machine Learning.”)
  • Select your data warehouse. Spark can layer onto traditional systems to speed up what you have today. But the benefits might be higher with modern cloud-based EDWs such as Snowflake and Azure SQL DW, which many enterprises already are embracing to better absorb fast-rising data volumes, variety, and velocity. These platforms offer advantages in terms of resource elasticity, decoupled compute-storage scalability, and economic pricing, that serve large Spark-friendly workloads quite efficiently. 
  • Define the role of Spark in ETL. Spark performance and throughput advantages can be ideal for ETL with large data sets. Spark can accelerate ETL processing, for example by combining and executing these three tasks (extract, transform and load) in-memory with a single set of coding commands. It also can streamline ELT, meaning the data is extracted, loaded and then transformed once it arrives at the target. In addition, Spark can help discover big data sources, identifying data patterns and classifications, before extraction and/or when they are consolidated at the target.  All this helps more efficiently structure and govern new data types in an EDW environment.
  • Outsource the integration work where possible. You also might consider pre-packaged options such as Azure Databricks, which performs many cloud, EDW and Spark integration steps for you out of the box. While vendor- Spark distribution versions might lack the very latest features from the Apache community, additional testing often makes them more stable. Databricks provides a vetted Spark solution, developer notebooks, ML frameworks, pipeline productivity tools, training and API integration with a variety of environments. (Also check out my article, “Best Practices for Real Time Data Pipelines with Change Data Capture and Spark.”)
  • Plan on a phased implementation. Like any ambitious data initiative, Spark EDW projects (either on-premises or cloud-based) need a methodical, iterative approach based on a steep learning curve. If you choose to modernize on a cloud-based DW platform, this might require migrating on-premises data and metadata into your new DW, then creating, adapting or porting over your existing ETL framework. Many organizations start with a test pilot, then expand their cloud datasets and use cases over time. Machine learning entails multiple test phases, in which the software learns from training data, then progressively applies predictions to production data. Each of these steps carries resource and time requirements.
  • Define your skill and training requirements. Most data warehouse managers have long experience with ETL and SQL scripting work, so can more easily use the Spark SQL module to execute SQL queries. But these traditional DW types are less familiar with Java, Scala, Python and R, the primary languages needed to develop Spark jobs. In many cases they will need new training or new personnel such as data scientists to apply Spark to the data warehouse.  

It’s hard to understate the upside of modernizing your data warehouse environment with Spark, which explains why so many enterprises are going down this path. Their lessons learned along the way, summarized here, can be useful signposts for those just getting started.


Kevin Petrie

Kevin is the VP of Research at Eckerson Group, where he manages the research agenda and writes about topics such as data integration, data observability, machine learning, and cloud data...

More About Kevin Petrie