A type of repository that stores structured, semi-structured and unstructured data in its native format. They originated as on-premises repositories running on Hadoop, but have evolved to run in the cloud as object stores.
Let’s first review why the earlier generation of data lakes failed to meet enterprise requirements. In a nutshell, they were slow, complex, and inefficient. First-generation data lakes were slow because batch-oriented processors such as Apache MapReduce and Hive failed to meet the low latency requirements of BI queries. They were complex because they required programming expertise to manage. Above all, the gen-1 data lake, typically based on a monolithic vendor stack of Hadoop components, was inefficient. As with other on-premises platforms, data teams struggled to manage and tune the underlying resources to meet SLAs for performance and availability. Compute and storage, coupled within each node, could not scale independently. This led to under-utilization, broken SLAs, and cost overruns. Without isolation capabilities, workloads competed with one another. The second-generation data lake, built for the cloud, is faster, simpler and more efficient than its adolescent predecessor. It is almost an adult. As Hadoop fizzled, cloud service providers (CSPs) rolled in with second-generation data lakes based on object storage that easily scale to the levels required for both traditional BI and data science. These gen-2 data lakes are faster, simpler, and more efficient. First, they are faster because they use in-memory processors such as Apache Spark to reduce latency and improve throughput. This gets them closer to the performance levels required for mainstream BI use cases such as scheduled or ad hoc reporting, scorecards, and dashboards. Secondly, gen-2 data lakes simplify management. They support numerous automated CSP tools and services to address data management and advanced analytics. Developers and data scientists can rapidly select, customize, and deploy pre-built machine learning models. Data engineers can build data pipelines using automated tools that reduce the need for expert scripting. Third, gen-2 data lakes operate more efficiently. They improve utilization and cost compared with their on-premises predecessors. Cloud storage and compute resources scale up or down separately and on demand. Queries, transformation jobs, and other workloads consume CPU resources as needed, and workloads can be isolated.
Data lakes have been a godsend for data analysts and data scientists. For years, these power users subsisted on a steady diet of data dumps dished out by slow-moving IT departments. The lucky ones gained access to the staging area of a data warehouse, or perhaps an operational system. Most manipulated the data locally on a workstation or desktop computer, created a model, and uploaded it to a corporate server where systems engineers might embed it in an operational application… A data lake reverses this archaic dynamic.
This area, the data lake, is used as a stopping point, and some would say dumping point, for data that is being transported from the source system. Data that is inserted into the data lake would be in the same format as the source system. Usually you would not change the structure of the data from the source system. This comes in handy for several reasons, one being troubleshooting without having to return to the source, taxing a production transactional system.