The Truth About Data Lakes In Plain English
Data, data everywhere and not a drop to drink. This is how many executives feel about their environments. There are several components to a great data warehouse and bringing insights to users and executives. The data lake can be one of those pieces.
What Is A Data Lake?
Let’s say you are on the way to Disney World from Texas (the greatest state on the planet) and you are driving (I don’t recommend this by the way) and you make a stop at a family member’s house in Destin (a beach area in north Florida). You stop there to refresh, relax, reload and get cleaned up to finish the trip. Your cousin’s house is the stopping point while you prepare for the last leg of your trip.
In this way, we can think of a data lake as a stopping point for data. Keep in mind that data can live forever in the data lake, just like you decide to move in with your cousin and never make it to Disney. We will cover this later.
Data from the source system that is on its way to the data warehouse may need to “stop off” in another database. Just like you need to be cleaned up at your cousin’s house before arriving at Disney, the data will be cleaned at this stage and prepared to go into the data warehouse. This area has undergone many different names, the newest being a data lake, some may call it a staging area, others may call it a landing area. I’ve even heard some people refer to this area as an ODS (operational data store). Technically, there are subtle differences in each of these, however for the sake of this post we will use them interchangeably, as I’ve heard done over the years.
What Is It Used For?
This area, the data lake, is used as a stopping point, and some would say dumping point, for data that is being transported from the source system. Data that is inserted into the data lake would be in the same format as the source system. Usually you would not change the structure of the data from the source system. This comes in handy for several reasons, one being troubleshooting without having to return to the source, taxing a production transactional system.
The data lake also offers a place to process the data away from the transactional system and stage it for loading into the warehouse. This keeps processing away from both the pre and post systems for loading data. A well-defined and maintained data lake also makes reloading a data warehouse simple. No need to hit a front-end system; just rerun routines for loading the warehouse.
The Good, The Bad, And The Ugly
In recent years people have argued that the OLAP reporting system should hit the data lake directly. There are pros and cons to this approach.
Enabling experienced power users to hit the data lake directly has several benefits:
- Empowers users. Knowledgeable users can have quick access to raw data to run queries and dig deep into the data without having to wait on processes that may or may not exist to move data to the data warehouse.
- Quick connect. Users, such as data scientists, do not need to wait on IT for connection to a data warehouse. They can go directly to the data lake and run queries and or extract data.
- Tool agnostic. For most BI Implementations there is a tool of choice. However, sometimes users want to use their own tools, and direct access to a data lake makes this is possible. (Not that it’s impossible with the data warehouse as well)
- SMEs know the data and can help their department. Having knowledgeable people hit the data can be instrumental for the advancement of BI. Subject matter experts know the complexities and idiosyncrasies of their data and can work around the ‘dirtiness’ of the data.
If you allow direct access, keep a few things in mind.
- User knowledge. If users aren’t steeped in SQL you could find yourself utilizing most of your BI team’s time to trouble shoot bad SQL statements, which could detract from your overall business intelligence goals and projects.
- Bad data. Sometimes the data in the data lake is not pre-cleaned, and users could be getting data that prompts more questions than answers. Also, lack of conformed dimensions in data lakes makes things a bit more difficult to navigate for users.
- Support. If you have a small BI team, allowing them to bring their own OLAP tool could be dangerous and cause a support nightmare. It’s difficult enough to support one tool and be well versed on the upgrades and features available. Allowing any and every tool could cause more harm than good. Although there are tools with unique features, many of the core features are the same; pick one and stick with it.
- The Ugly. Even the benefits of direct access to the data lake are based on a WELL-BUILT data lake. Otherwise, you’ll have a data swamp that isn’t usable to anyone. Worst case scenario, tables are undefined with no source to target documents or they are out of date.
As with most things, the answer to how to use your data lake depends on your environment and ability to support your users. Keep the above items in mind when designing or redesigning your “resting place” i.e. your data lake. Just try not to get too comfortable at your cousin’s house, remember the goal is Disney – Your Data Warehouse!
Don’t Worry, BI