Let’s first review why the earlier generation of data lakes failed to meet enterprise requirements. In a nutshell, they were slow, complex, and inefficient. First-generation data lakes were slow because batch-oriented processors such as Apache MapReduce and Hive failed to meet the low latency requirements of BI queries. They were complex because they required programming expertise to manage. Above all, the gen-1 data lake, typically based on a monolithic vendor stack of Hadoop components, was inefficient. As with other on-premises platforms, data teams struggled to manage and tune the underlying resources to meet SLAs for performance and availability. Compute and storage, coupled within each node, could not scale independently. This led to under-utilization, broken SLAs, and cost overruns. Without isolation capabilities, workloads competed with one another. The second-generation data lake, built for the cloud, is faster, simpler and more efficient than its adolescent predecessor. It is almost an adult. As Hadoop fizzled, cloud service providers (CSPs) rolled in with second-generation data lakes based on object storage that easily scale to the levels required for both traditional BI and data science. These gen-2 data lakes are faster, simpler, and more efficient. First, they are faster because they use in-memory processors such as Apache Spark to reduce latency and improve throughput. This gets them closer to the performance levels required for mainstream BI use cases such as scheduled or ad hoc reporting, scorecards, and dashboards. Secondly, gen-2 data lakes simplify management. They support numerous automated CSP tools and services to address data management and advanced analytics. Developers and data scientists can rapidly select, customize, and deploy pre-built machine learning models. Data engineers can build data pipelines using automated tools that reduce the need for expert scripting. Third, gen-2 data lakes operate more efficiently. They improve utilization and cost compared with their on-premises predecessors. Cloud storage and compute resources scale up or down separately and on demand. Queries, transformation jobs, and other workloads consume CPU resources as needed, and workloads can be isolated.
(Whitepaper)