Twelve Must-Have Characteristics of a Modern Data Stack
ABSTRACT: The modern data stack must be automated, low code/no code, AI-assisted, graph-enabled, multimodal, streaming, distributed, meshy, converged, polyglot, open, and governed.
The term “modern data stack” can mislead because it is not really a stack of anything. Rather, it refers to a loose collection of technologies, often cloud-based, that together process and store data to support modern analytics. The specific technologies vary by company and use case. In aggregate, they must provide some pretty consistent capabilities to the enterprise using them. This blog defines the modern data stack by describing those must-have characteristics. They are automated, low code/no code, AI-assisted, graph-enabled, multimodal, streaming, distributed, meshy, converged, polyglot, open, and governed.
That’s a lot to manage! Let’s explore each characteristic in turn.
- Automated. The modern data stack needs its data prep, cataloging, and other tools to help automate the work of data processing. Data teams can only keep up with business requirements if they automate the ingestion, transformation, classification, tagging, linking, joining, and schema creation of tables, files, images, and so on. And they can only ensure data quality if they automate testing and QA processes. By configuring the rules for these various tasks and letting the tools perform the work, data analysts, engineers, and scientists can manage exploding data supply and demand.
- Low code/no code. The modern data stack must support many of the tasks described above with a low code/no code approach that avoids hand-writing code wherever possible. For example, citizen data scientists and citizen data integrators might use a graphical user interface (GUI) to inspect, prepare, and analyze data without making requests of overburdened data engineers. Data engineers and developers also can use the GUI to save time building basic ingestion, transformation, and query jobs. But they can pull up a command line window to write code for more complex SQL or Python jobs.
- AI assisted. New AI assistants such as OpenAI’s ChatGPT now provide a higher level of intelligence for the modern data stack. They help perform tasks such as discovering datasets, creating starter code, explaining algorithms, debugging pipelines, and writing documentation. While AI assistants require expert oversight to avoid errors and reduce compliance risk, they have the potential to boost productivity. About 40% of practitioners said they already use ChatGPT for data engineering in a recent LinkedIn poll by Eckerson Group. (Voting was not final at the time of publication.)
- Graph enabled. The modern data stack must include the ability to build and glean intelligence from knowledge graphs. By organizing and portraying entities as networks of interlinked nodes, knowledge graphs can shed new light on the inter-relationships between those entities. For example, graphs can help analysts visualize human relationships to detect fraud, build social marketing campaigns, or predict the spread of disease. And graph-based catalogs can help data analysts, engineers, or scientists organize their data assets. In these and other ways, graphs have become a must-have requirement of the modern data stack.
- Multi modal. The “extract, transform, and load” (ETL) pattern, dominant for decades, no longer supports today’s diverse use cases. The modern data stack must enable multiple patterns, including ETL, ELT, ETLT, and reverse ETL. For example, ELT might work well for a complex transformation that requires the processing capabilities of a data warehouse. ETLT, meanwhile, might transform data lightly in flight, then perform more robust transformations after loading it into the target. And reverse ETL delivers data from the data warehouse to SaaS applications.
- Streaming. Modern data volumes and velocities make it infeasible to keep loading duplicative batches of operational records into the data warehouse on an hourly, daily, or weekly basis. Rather, the modern data stack depends on streaming pipelines to extract and load incremental updates in real-time to analytics platforms. These streaming pipelines use technologies such as change data capture, Apache Kafka, and Spark Streaming to deliver data at latencies that range from milliseconds to minutes. Streaming improves processing efficiency, reduces bandwidth requirements, and drives faster analytics. It also enables enterprises to capture and analyze new sources such as IoT sensors, clickstreams, and IT system logs.
- Distributed. The modern data stack must store and process distributed datasets across hybrid, multi cloud environments. This is because Fortune 2000 enterprises still depend on some legacy applications and cannot afford to move all their data to the cloud. They also continue to spin up new analytics projects on new clouds to meet fast-changing business requirements. The modern data stack must enable enterprises to manage all this data, wherever it resides, while tying it together with new architectural approaches such as a data fabric or data mesh.
- Meshy. The data mesh can help manage this modern, distributed data stack. Its creator Zhamak Deghani proposes that domain experts within business units deliver their data as a product to users throughout the organization, leveraging a self-service platform and federated governance framework. The modern data stack supports this vision by empowering both data engineers and business domain experts to create standard, modular, and reusable data products. (Also check out my colleague Jay Piscioneri’s blog to understand why the vision of the data mesh is both “exciting and unnerving.”)
- Converged. While distributed and meshy, the modern data stack also needs converged technologies that help do more work with fewer tools and platforms. For example, the data lakehouse supports both business intelligence and data science projects. Like a data warehouse, the lakehouse transforms and queries tables with familiar SQL functions. Like a data lake, it stores and processes multi structured data in an elastic cloud object store. Another converged platform is the data fabric, which combines data discovery, ingestion, transformation, cataloging, and security. Other converged technologies include data pipeline tools that combine ingestion, transformation, DataOps, and orchestration; and data catalogs that address observability and data marketplaces.
- Polyglot. Much like our modern multilingual world, the modern data stack must accommodate different stakeholders that speak different programming languages. SQL remains the lingua franca for databases, data warehouses, and the pipelines that attach to them. But many data scientists prefer to build their algorithms and pipelines with Python, and many developers of Apache Spark applications prefer Scala. The modern data stack must support these and other popular languages such as Java and Ruby.
- Open. The natural corollary to a polyglot stack is an open stack. The modern data stack needs open APIs and open data formats to ensure its contributors have full access to today’s ecosystem of innovative tools. Analysts, engineers, and managers need the ability to plug and play best-of-breed tools, confident they will interoperate with the rest of the ecosystem. These tools span commercial products and open-source projects such as Apache Spark for data processing, Kafka for streaming, and dbt for transformation. It also includes the Github software development environment, Apache Airflow platform for workflow orchestration, and Apache Iceberg open table format for analytics.
- Governed. Lest we forget, the modern data stack needs strong governance even more than traditional architectures. Data governance teams must be able to create and enforce policies, standards, and rules about data usage across the stack. The heterogeneous nature of the modern data stack, coupled with the “black box” nature of generative AI tools, makes governance harder than ever. Enterprises must govern each aspect of managing the modern data stack, including design, operation, monitoring, and adaptation, to ensure they control risk and assist compliance efforts. More than all the other characteristics described here, this depends on strong people and process more than technology.
These must-have characteristics add up to a tall order for today’s enterprise. Because no single tool or vendor can meet all these needs, data teams must select, integrate, and cultivate the right elements with a healthy focus on people and process. We’ll be discussing how to make this happen at the Modern Data Stack Conference in San Francisco on Tuesday April 4th. I’ll chair a panel with George Fraser of Fivetran, Barr Moses of Monte Carlo, CEO of Monte Carlo, Benn Stancil of Mode, and Sarah Catanzaro of Amplify Partners. Join us if you can and stay tuned on LinkedIn for updates on what I learn.