The Rise of the Cloud Data Lake Engine: Architecting for Real-Time SQL Queries
No longer an adolescent with Hadoop awkwardness, the cloud data lake is approaching adulthood thanks to the nurturing of cloud object storage, new processors, and elastic compute capability. It now runs many workloads—including both data science and business intelligence (BI), traditionally owned by the enterprise data warehouse (EDW)—but faces some final growing pains when it comes to speed, complexity, and efficiency.
The cloud data lake engine is a new category of analytics platform that encourages cloud data lake maturity by further improving these three characteristics. It applies real-time SQL querying and a consolidated semantic layer to multi-tenant object storage. It enables enterprises to support interactive BI workloads, simplify data access, and use resources more efficiently.
The cloud data lake engine is faster than predecessors because it uses in-memory, columnar, and parallel processing to meet real-time or near-real-time service agreements, breaking a longstanding speed barrier for BI. It is simpler, providing a central control point for data engineers to create the semantic layer. It also reduces the need for separate OLAP cubes, BI extracts, etc., or for semantic layers within BI visualization tools. Finally, the cloud data lake engine is more efficient, reducing compute cycles, cumbersome ETL work, and infrastructure requirements.
The cloud data lake engine can yield business benefits by increasing analytics value, democratizing data consumption, and assisting governance efforts. Enterprise data teams should evaluate commercial and open source technologies—in particular, their real-time and semantic layer capabilities, compute usage methods, and support for open architectures—as they consider layering a cloud data lake engine onto their environments.