Daniel Graham- Data Lakes vs Data Warehouses
In this episode, Daniel Graham dissects the capabilities of data lakes and compares it to data warehouses. He talks about the primary use cases of data lakes and how they are vital for big data ecosystems. He then goes on to explain the role of data warehouses which are still responsible for timely and accurate data but don't have a central role anymore. In the end, both Wayne Eckerson and Dan Graham settle on a common definition for modern data architectures.
Daniel Graham has more than 30 years in IT, consulting, research, and product marketing, with almost 30 years at leading database management companies. Dan was a Strategy Director in IBM’s Global BI Solutions division and General Manager of Teradata’s high-end server divisions. During his tenure as a product marketer, Dan has been responsible for MPP data management systems, data warehouses, and data lakes, and most recently, the Internet of Things and streaming systems.
- There are three use cases of a data lake. Predominantly fast data loading, secondarily a distribution hub and third is discovery zone which has reduced governance for data scientists
- The data warehouse is not the center anymore, the future is all distributed. The amount of data outside the data warehouse is bigger than what's stored inside.
- It's a whole lot cheaper to put large amounts of unstructured data in the data lake instead of a data warehouse.
- If you are dealing with a video data format, it's better to put in a data lake for performance reasons.
- The data warehouse has all the trustworthy reliable data whereas the data lake tends to have all the raw data.
- The point of the data warehouse is to make it easy for business users to do all kinds of queries with BI tools.
- The data lake is more for the programmers and the data engineers, often the data scientists.
- If somebody gets rid of their data warehouse, you can sell all the stocks on those guys!!
- BI tools with high-performance in-memory engines can't replace a data warehouse.
- Modern BI tools are necessary but don't perform well with data lake environments.
- A lot of the data warehouses today are real-time data warehouses having several thousand users logging in every day.
- The data lake is not a full data integration, it's just one level of refinement.
- 20% of the data lakes will be relational by 2020.
- Most companies have accidental data architectures. Now they need to remodel their data architectures looking for synergies between different systems.
- Any time there is an arrow/line between two data systems in the architecture, that is where your cost and complexity are because that's where data transforming takes place.
- The data warehouse, the data lakes, and analytics; all of it needs DataOps.
- IaaS will slowly lose it's shine and will be replaced by PaaS and APIs in the future.
Below is one question and answer from the podcast
Wayne Eckerson: Is it a requirement to have a data lake even when you have a data warehouse?
Daniel Graham: The data lake is not for everyone. If you're working with 10 or 20 terabytes of data, you probably don't need a data lake. If your workload fits on a couple of BC dell servers, you don't need a data lake. Remember that data lakes were born at Yahoo to handle thousands of servers interacting with millions of clients every minute. Size really matters here. So don't jump into the data lake just because it's trendy. If you do, you might want to update your resume at the same time.
The data lake answer depends on the customers need of course and if you're facing an avalanche of big data, then you probably need a data lake. There are three primary use cases for a data lake
1. The data is arriving in real-time faster than it's prudent to load it in a database. So things like Kafka and HDFS are really good at this so they can load data at this speed capturing real-time streams.
2. The second need for a data lake arises when your data integration architecture is getting big and complex. The data integration tools work but you need a distribution hub to capture and distribute thousands of files daily. So now you have a big data problem associated with gathering files and distributing files through your data hub. Some data lakes capture all the data feeds in a single system and distribute it to downstream applications. That includes the data warehouse as well, the data comes out of the data lake, goes into the data warehouse and then of course you need strong metadata capture from catalog tools that allow you to keep track of all the files coming and going because the worse thing that can happen is you lose the files, don't know where they are, you don't know the name of them. More importantly, you find the files among ten files which have the same file name with slight differences and now you don't know which one you need. So a metadata catalog is crucial to the data distribution strategy for a data lake
3. The third and last use case is agile analytics or what some people call the discovery zone for data scientists. They need a place to explore unknown questions with unknown data. The word scientist implies that we don't know the questions and we don't know the answers. We need to experiment, we need to fail forward. Not fast, but forward. If you're working with big data, the agile discovery zone for machine learning and data scientists is crucial. You need to have a lesser amount of governance and a lot of performance and a lot of chances just to experiment. Some of the experiments blow up, some of them will produce magic for you.
So it's true that in some cases, the scientist can do their work on a small amount of data, even on their laptops. But when you start to have access to big data, it dramatically helps the accuracy of the answer, especially when locating the outliers. So having ten's, hundred's of terabytes of data on consumers or sensor data is going to produce a lot more accuracy and a lot more insights than five or ten megabytes to gigabytes.