Striking the Right Balance with the Modern Cloud Data Platform
ABSTRACT: The cloud data platform must strike a balance on three dimensions: suite vs. extensibility, standardization vs. customization, and stability vs. agility.
To understand the role of the cloud data platform, picture a seesaw with kids piling onto each side. It’s not easy to keep everyone in balance—and out of the dirt.
The cloud data platform is a combination of data warehouse and data lake elements that supports workloads ranging from business intelligence (BI) and data science to data-driven applications. Like a data warehouse, it queries and transforms data, delivers query outputs to BI tools, and secures and governs data usage. Like a data lake, it stores myriad types of data objects; and applies data science models to those objects. Cloud data platforms, also known as data lakehouses, enable high performance and scalability on elastic cloud infrastructure.
The cloud data platform combines data warehouse and data lake elements to support business intelligence (BI), data science, and data-driven applications
Needless to say, that’s a lot for one platform to support.
Data engineers, data analysts, data scientists, and developers all pile onto this seesaw. To support them all, the cloud data platform needs to strike a careful balance. First, it needs to combine the benefits of a software suite (all-in-one) and extensibility. Second, it needs to offer both standard and custom approaches. Third, it needs to give data teams both stability and agility.
This blog explores these three dichotomies to help data teams select a cloud data platform that satisfies their complex requirements. We start with suite vs. extensibility.
1. Suite vs. Extensibility
Suite
The cloud data platform should integrate the basic functionality of data management into a suite. Like a data warehouse, it should enable data engineers to build tables, validate records, and track lineage. Like a data lake, it should help apply multiple analytics methods to its wide-ranging datasets. It executes structured query language (SQL) commands to retrieve data for BI tools, and should execute Python programs to run machine learning models or Java scripts to feed data into mobile applications. And the suite continues to expand. Cloud data platforms now support the full lifecycle for machine learning projects, including data and feature engineering, model development, and model production.
Extensibility
However, the cloud data platform cannot be all things to all people. It also must enable data teams to add functionality by integrating with an ecosystem of best of breed tools. This ecosystem includes a variety of commercial and open-source offerings. For example:
● Workflow tools such as Airflow orchestrate analytical and operational workflows.
● Data pipeline tools such as Fivetran and dbt help ingest and transform data.
● Libraries such as PyTorch and TensorFlow provide AI/ML algorithms for data scientists to train.
● Notebooks such as Jupyter and Zeppelin help develop and customize AI/ML algorithms.
● Feature stores such as Tecton and Rasgo that help define AI/ML features.
Data teams should select a cloud data platform whose data management suite still offers this level of extensibility.
Because it cannot be all things to all people, the cloud data platform must integrate with an ecosystem of best of breed tools.
2. Standardization vs. Customization
Standardization
The cloud data platform should help data teams standardize datasets, data pipelines, and consumption patterns to make data management more efficient. It helps standardize data by performing transformation tasks—for example, joining tables, filtering out unneeded data, converting file formats, or tagging semi/unstructured data. It helps standardize pipelines by automating the configuration, executing, and monitoring of ETL commands, transformation (T) in particular. In addition, the cloud data platform helps create standard consumption patterns, using application programming interfaces (APIs) to deliver data to the BI tools, AI/ML models, and applications that consume them. In these ways, data teams get standard artifacts they can use, adapt, and reuse as they scale the business.
Customization
However, the cloud data platform also needs to support customization. Expert data teams need to ingest and transform new datasets to address new use cases that the business demands. The cloud data platform must support custom-scripted pipelines that address these new use cases, and deliver data for consumption by custom models or applications. To ensure they get this, data teams should select a cloud data platform that supports various programming languages, as well as development frameworks that help build, test, revise, deploy, and rollback different versions of pipeline code. They also need to maintain open access to the best of breed ecosystem.
The cloud data platform helps data teams create standard artifacts they can use, adapt, and reuse. It also helps experts customize to address new use cases.
3. Stability vs. Agility
Stability
The cloud data platform needs to help data teams maintain stability. They must meet their service level agreements (SLAs), support chargeback, and ensure compliance. It should make it easy for cloud operations (CloudOps) engineers to configure virtualized cloud resources; monitor performance and capacity utilization; and spot and remediate issues. The cloud data platform also can expose its workings—in the form of logs, traces, metrics, and alerts—to observability tools that monitor and optimize performance across heterogeneous environments. By stabilizing workloads in these ways, the cloud data platform makes workloads more predictable and lower risk.
Agility
However, the cloud data platform still needs to help data teams stay agile. They need to help the business address urgent market demands, innovate, and release competitive enhancements. Data scientists might need to spin up a sandbox to explore new data, train a new ML model on that data, and deploy that model into production. Developers might need to build, test, and release—then monitor and revise—a new data-driven application that operationalizes the ML model’s outputs. Data teams should select a cloud data platform that supports fast actions and iterations like these.
So, what does this look like in practice? Shell offers an instructive case study. Shell uses the Databricks Lakehouse platform to integrate data management tasks and support a variety of standard BI and custom data science use cases. Their data scientists use best of breed tools to devise agile, custom models that predict inventory levels across the supply chain and recommend purchases to customers. Shell maintains stability while supporting agile projects to compete.
Cloud data platforms will continue to evolve and absorb functionality. As they do, data teams should evaluate and select those that strike the right balance–and keep them from toppling into the dirt.