Future Data Architectures: Musings from 2018
A dizzying time warp occurs between the summer solstice and the fourth of July: the days get shorter but the temperature rises; school is out, but neighborhoods get quiet; the year is half over but summer is just starting; we rush to clear our desks so we can relax without them.
As we prepare for our summer idylls, it seems wholly appropriate to reflect on the future of data analytics architectures. Already in 2018, we’ve helped several clients modernize their data architectures, with more to come. They’ve stretched us into thinking about how current data and analytics technology advancements will evolve in the near future. Here are my summer musings about where it’s all headed:
- Process Data in Place. It has always been the goal of data architects to minimize data movement and replication. Moving and duplication data is costly, error-prone, and creates synchronization problems. This was always just a pipedream, and then along came Hadoop and now cloud platforms, which make the dream more of a reality. Organizations with small data volumes, simple queries, and low numbers of concurrent users can query an object store directly using Presto or SQL. This is a harbinger of things to come. However, in the near term, for performance reasons, we still need to filter, aggregate, and push data into a columnar in-memory store, whether a database, cube, or cache. The laws of big data physics are obstinate, but I’m confident that in the next several years, we’ll be querying more and more data directly without moving it into dedicated analytical stores.
- Process Apps and Data in One Place. Of course, if we take the statement above to its logical conclusion, then it’s inevitable that we’ll process both data and transactions in the same environment in the future. Today, there is an entire category of databases, known as Hybrid Transactional/Analytical Processing (HTAP) systems, that do this. HTAP databases can both update and query streaming data without having to transform and move new records into a dedicated analytical store. This makes HTAP databases ideal for handling real-time applications with analytical requirements, such as operational reporting and operational analytics (e.g., fraud detection, online recommendations). Products such as SAP Hana, Oracle Exadata, MemSQL, and Splice Machine offer HTAP capabilities. HTAP is not ideal for complex queries against large volumes of historical data, at least yet. But the path ahead is clear.
- Stack Convergence. Outside of the core database engines that read and write data, there is another revolution going on. For years, BI vendors have converged upon a single analytics stack with integrated reporting, dashboards, exploration, and advanced analytics for on-premises, web, and cloud deployments. At the same time, data integration stacks have also converged, providing integrated support for data capture, profiling, cleansing, integration, scheduling, preparation, cataloging, and monitoring for on-premises, web, big data, and cloud deployments. Leading vendors are still a few years from offering truly robust end-to-end stacks, but the future is clear: you will buy one database, one data pipeline platform, and one analytics platform.
- Multi-Stack Convergence. At the same time, BI and data integration stacks are bleeding into each other. BI vendors have increasingly added data management and preparation functions, while data integration and database vendors have added analytic and visualization components. Qlik, for instance, has long had an in-memory engine, but almost every other BI vendor has added similar capabilities. Alteryx long ago added data preparation capabilities, but now most BI vendors offer similar sets of functionality. Data integration and pipeline vendors are more timid about adding analytics and visualization but some aren’t, such as Teradata, Zaloni, Dremio, and Incorta.
- User and IT Platforms. This convergence will soon yield a power user triad: a platform that provides an integrated data catalog, data preparation tool, and exploration/reporting tool. This will become a standard toolset for power users who need true self-service capabilities. Similarly, we’re seeing the advent of an IT-centric data pipeline platform that ingests, cleans, secures, and governs data as it flows from sources into a data lake and beyond, whether in batch or real-time, in increments or snapshots or both.
- Semantic Backbones. Finally, the most encouraging development is the emergence of external abstraction layers that insulate business users from changes in database engines, data pipeline tools, and BI tools. AtScale offers one of the first universal semantic layers that supports a multiplicity of database engines and BI tools. A universal semantic layer enables business professionals to spend most of their time fleshing out business models rather than building plumbing. Likewise, Dremio offers an abstraction layer that serves as a data backbone and distribution layer, insulating business users and IT administrators from the complexities of sourcing diverse data from inside and outside the organization.
- Open source semantics. Ideally, I’d like to see an open source semantic backbone that vendors could plug their component software into. That would benefit user organizations and create a vibrant marketplace for components, much like the iTunes market for mobile apps. Think about it: if AtScale or Dremio open sourced their products, it would dramatically change how organizations think about purchasing software. Of course, any vendor with a long-trail of development on the front or back ends (e.g., Information Builders, Oracle, SAP, or IBM) could open source their core and then compete on components. Can some vendors step up and do this? Pretty please? A bigger vendor, like IBM, might be ideal since they have enough components left to drive revenue.
The future is all around us. It doesn’t just magically appear. By extending current trends a few years into the picture, we can get a clearer vision of what kind of tooling and architectures we’ll see in the next three to five years.