James Serra: Myths of Modern Data Management

In this podcast, Wayne Eckerson and James Serra discuss myths of modern data management. Some of the myths discussed include ‘all you need is a data lake,’ ‘the data warehouse is dead,’ ‘we don’t need OLAP cubes anymore,’ ‘cloud is too expensive and latency is too slow,’ ‘you should always use a NoSQL product over a RDBMS.’

Serra is big data and data warehousing solutions architect at Microsoft with over thirty years IT experience. He is a popular blogger and speaker and has presented at dozens of Microsoft PASS and other events. Prior to Microsoft, Serra was an independent data warehousing and business intelligence architect and developer.

Key Takeaways

  • Data lakes and data warehouses work great together
  • Data warehouses are for questions, data lakes are for when you don’t know what questions to ask
  • A tabular model is really a type of cube
  • Cloud is often way more secure than on-premise
  • NoSQL solutions have their place, but they’re for a limited use case
  • You can start quicker in the cloud but you have to spend time up front building up the architecture – there are no shortcuts

The following is an edited transcript of the podcast

Wayne: The top myth that you and I discussed was “All we need is a data lake.” Why do you think that’s a myth?

James: I’ve seen a number of customers come up with that, and I’ve tried to implement a solution just on a data lake bypassing a relational database system. They always failed for a number of reasons.

One of the biggest is you delve into a data lake, you’re looking at Hadoop technologies and it’s a world that many end users don’t know anything about. Data lakes are usually built in Hadoop – and Hadoop is just a glorified file folder – so if you just create a bunch of folders and slap in some files and ask an end user, ‘Hey, go fire up Hive and create a metadata store and use Pig to clean it, and there you go.’ Well, 99 percent of end users have no idea how to do that.

So a data lake is great for a number of reasons, one of which is giving certain end users quick access to the data, but that’s limited to the data scientists and the very powerful end users who are familiar with those tools. Most other end users want the data in a relational database management system, so they can just go to a blank canvas, list fields that they want in their report, drag them over there, and they can start creating reports and dashboards.

So I always recommend data lakes are great in combination with a data warehouse. It’s the best of both worlds: you can have quick access to the data for those power users and then have a relational database system where those other users have access to it.

The great thing about data lakes is schema-on-read. I can put data in there in its raw format. I don’t have to do anything upfront, but you have to pay the piper somewhere. So schema-on-read means when I pull the data out, I do have to give a schema and it doesn’t go away, and that’s when a complexity comes in. Many end users are not able to do that and don’t want to do that. So there are cases where you need IT to do that for them, and that’s where the relational database comes in. But by having that data lake as a staging area and exploratory area for the end users is great for them. And for everybody else you can move that data and copy it to a relational database.

Wayne: So does that mean you don’t need a data warehouse anymore? That’s another common myth that we hear out there.

James: Yeah, and when I hear people say, ‘Is the data warehouse dead?’ I think what they really mean to say is ‘Is the traditional data warehouse dead?’ meaning the relational data warehouse. The fact is you definitely need a data warehouse.

The same reasons apply now as in the past of why I wanted a data warehouse. It reduces stress on the production system. It’s a way to integrate many sources of data. It’s a way to keep historical records. It’s a great way to rename tables when you bring them over from source systems and you give them better naming conventions in there for end users to pull data out. It protects against sources and upgrades. It’s a place to do master data management. It’s where you do your data quality, and it becomes one version of the truth.

That’s why we recommend customers have the enterprise data warehouse as that version of the truth. From there you can export data into cubes or data marts and use them for specific areas of your business, but the enterprise data warehouse is where to collect it all and make that a single version of the truth. And they’re needed now more than ever because of the proliferation of other data sources, whether they be Internet of Things devices, Facebook and Twitter data, competitive data, weather data, and it just goes on and on. The way to get great value out of this data and make better business decisions is to have these various types of data in one data warehouse so those end users can query this data and get better value with it.

You hear a lot of people saying data is a new currency, the new gold, and the new oil. This is just a way to understand how to get ahead in your industry. It’s to make better use of this data to the point where a lot of companies go through a lot of work. They create a large data warehouse, and they not only use it for their own internal usage, but they start selling that data because they realize how valuable it is. And they’ve started making a secondary business on collecting all this data and reselling it to others.

Wayne: Okay, so you’re advocating for having a dual environment with the data lake as you’re ingestion’s staging area and data science sandbox, and then, a data warehouse for refined data that has been validated, cleaned, and designed for reporting and analysis.

James: Exactly.

Wayne: We recommend the same thing to our clients. One question that always comes up though is should you move all of your source data, both relational and non-relational sources, into the data lake first before you move it into the data warehouse? Or can you or should you move some of that data like the relational sources right into the data warehouse and skip the data lake, which then becomes really a place where you put all your non-structured data.

James: Sure, I hear that. Think of a customer who has a large data warehouse on-prem and they’re moving it to the cloud. And they are expanding it out to include a data lake and they’ve written all these packages to move all this data to a data warehouse.

If we go to them and say every one of those packages – and sometimes it could be thousands – has to be changed to the data lake, then to the data warehouse, it’s going to be a lot of extra work. What’s the value of that? Well, you only get the value if that data needs to be in a data lake because it needs to be combined with others.

So some customers use the data lake switched where the data goes all the way to the data warehouse and then goes to the data lake. They do that because they want to incorporate non-relational, semi-structured data. They can’t put that in a data warehouse, so they put it in a data lake, but then they go, ‘Well, okay, we generate reports off of that, but we also want to combine it with some data stored in the data warehouse.’ So in that case some of that data from the data warehouse will go into the data lake copied. It’s kind of reversed the process, but there is no reason you have to move all the data into the data lake. Maybe over time you can get to that point, but let’s get something up and running as quick as possible and that will limit the data that goes into the data lake. So it’s a case-by-case basis, but when you’re dealing with a lot of relational databases people often skip the data lake and go right to the data warehouse.

Wayne: Yeah, especially if you already have ETL written for it as you say. I’ve also seen cases where companies want all the data in the data lake, but they don’t want to take relational data, unpack it into a file format, and repack it into a relational format. So they’ll replicate that source data, relational data, and they’ll shunt one copy to the relational data warehouse and one copy into the data lake. That’s another way to get around that issue where you want all your data in the data lake for your data scientists, but you don’t have to make more work for yourself by unformatting and reformatting that relational data. That would be for new sources.

James: Yeah, exactly. The reality is I can’t propose this as one solution to solve everything. It’s going to be on a case-by-case basis. There will be a lot of exceptions to what’s called the normal modern day warehouse architecture. A lot of it deals with migrations and they’ve already got those built, and now we’re going to say, ‘Well, what’s the value of adding this extra work in there?’

So I’m in total agreement on that, and there’s a big difference between migrating what you already have and creating a new solution. The reality is 90 percent of the customers are migrating so they have to take those things into account.

Wayne: Now, here’s another issue that I’ve run across a lot: you’ve built something in your data lake and now you want to productionize it because other people in the company want access to it. Where do you productionize it, in the data lake where the data already exists, or do you kind of reproduce all of that data and logic in a data warehouse? Do you have any thoughts on that?

James: At a high level I like to think of the data lake as your staging and preparation area and your data warehouse as your serving, presentation layer. The data warehouse is also for security and compliance because it has easier mechanisms for security compared to a data lake in Hadoop. Hadoop is file-based so you can’t take a file that has got all this information and make it accessible only for certain people. That’s where it gets moved to the data warehouse.

So you combine the two, however you can have those power users access the data lake and generally it’s limited to a handful of people, because it’s also so difficult to do security in the data lake. If I just dump a bunch of files in there they have to know how to clean it, how to join it with others, and how to master it. So all that work has got to be known in the power user’s head, where as in the data warehouse the IT does a lot of that work.

I generally break it into those two, but it’s the best of both worlds because you can have certain people access the data lake. The data lake also gives you the ability to use data when you don’t know the questions to ask. You’ve seen this data, but you haven’t played with it, ‘Let’s go in there. Let’s see if it has value. Let’s find out some questions to ask this data’, as opposed to a data warehouse where you know the questions to ask, ‘I’m going to create this dashboard. It’s going to look like this. I’ll do the work upfront. I’m moving into there.’ So the data lake becomes more of an exploratory sandbox and data scientists use it predicting and machine learning.

The data warehouse is generally looking at the past, what happened and why did it happened. So you separate those two out and think of that functionality and separation fo tasks when you’re building out your solutions.

Wayne: Now, to be fair, a lot of enhancement have been made to Hadoop in terms of security, SQL support, and metadata so you can get more traditional, relational type views and the ability to create tables using Parquet files. The reality is that these two environments, relational and Hadoop, they’re converging in functionality. They’ll never completely converge, so I think the workloads that you talked about make a lot of sense, but I’m seeing more and more people understand that it’s harder to distinguish what you can and can’t do in each environment, because they’re starting to mimic each other in many ways, but let’s move on.

Another myth that we hear often is that we don’t need OLAP cubes anymore, and I’m wondering what you think of that, especially coming from Microsoft that has moved their analysis services into the cloud. And it’s not a cube. It’s in a tabular format. So is that a myth or is that reality?

James: It’s a myth and we have the tabular model, which is in essence a type of cube, though we don’t use that word, in that it aggregates data and summarizes it. So you process data into this tabular model or multidimensional cube, and, because it is a summary of data, it’s going to be much faster than hitting a data warehouse or a data lake.

So if I’m going to build out a dashboard where I need millisecond response times because the end user is slicing and dicing and doesn’t want to wait a couple of seconds, you’re generally not going to get that with a data warehouse. You want the dashboard attached to the cube to get that millisecond response time. In addition, it forces you to create the semantic layer on top of the data. This makes it much easier to use the data because you’re pre-joining everything and taking the complexity out of the end users’ hands and putting it in the cube. Then they can do self-service BI by using the cube and just dragging fields to a canvas to create the reports.

You also get benefits inherent in the cube that you don’t have in a data warehouse, such as hierarchies, Key Performance Indicators, additional security, and advanced time calculations. All these things come into play in a cube or tabular model. And most of our customer solutions have a data warehouse and that’s used for operational reporting, ad hoc queries, and dashboarding or queries that are predetermined and need to have millisecond response times. The data from the data warehouse is put into the tabular model or cube and then queried through that dashboard with those fast queries.

Wayne: So you’re equating cubes with tabular models, which by definition are tables, but they’re running in memory, so they’re really fast and you’re not pre-calculating stuff. You’re kind of aggregating on demand, right?

James: Well, there’s some confusion because Microsoft had analysis services, which was multidimensional. Then they came out with a tabular model using DAX, which was mush easier to use than MDX, and the concept behind the scenes is the same as you’re aggregating the data.

In the tabular model, it does all this in memory, so it’s extremely fast. It does have some limitations, but over the years they’ve come closer to having all the functionality of what a multidimensional model has. One of the limitations is you’re limited to what you can put in memory and once you go above that it can be slow. Although the compression is very high when it put it in the memory. So you can fit pretty big tables inside of the tabular model, but the idea is the same whether it’s in a tabular model or a multidimensional model: it’s an aggregation of the data. It’s very fast-performing, and you’re doing the work to remove the joins. You’re doing the joins when you’re building out this tabular model and taking away the complexity from the end user.

Wayne: Is there a limit on the amount of data you can put into analysis services in the cloud?

James: Yeah, there are service tiers and they’ve changed them. I don’t have them off the top of my head, but there is a size limit because it’s based on memory. There are different performance tiers you can scale up and down and pay more for the higher performance tiers where they allow for larger tabular models.

Wayne: Microsoft is obviously a big advocator of public cloud. It’s got its own public cloud platform called Azure. And I know you’ve talked to a lot of folks who are peddling myths about the cloud, at least from Microsoft’s perspective. For instance, ‘The cloud is too insecure. It’s too expensive. The latency to get data up in the back to company data centers is too slow and too costly’ – are these myths or are these a reality?

James: Most are myths. There’s one in there that is a reality, so let’s start with the reality one. To get data into the cloud, obviously if the data is more on-prem, you’re going to have to put a copy of that into the cloud. Although we do see most solutions using cloud-borne data, there’s still plenty of data that is on-prem and will continue to be. So now we have the complexity of moving that data to the cloud if we have our data warehouse or data lake in the cloud.

So what does that entail? Well, it’s only going to go as fast as the pipe between your on-prem and Azure cloud, and if you have a slow pipe you’re going to have issues. There are ways to increase that pipe speed. We have a thing called ExpressRoute, which essentially gives you the ability to transfer a large amount of data at an extra cost, and it can also be redundant. So what we see companies do, if they’re pumping out many terabytes or gigabytes of data, they will have ExpressRoute and they will upload it. But they won’t do it just at the end of the day like you would in a traditional data warehouse. A way around that is to upload it during the day. So you can put the data up there every hour so you don’t have to do it all at once.

There are many little solutions like that to avoid having the pipe between your on-prem and cloud be a barrier to moving things, especially with our products where you can scale them up. A lot of clients will move data, import it, and scale up the product they’re using and scale it back down and have end users start doing reporting off of it. But they need that to get in there quick, and that’s one of the benefits of the cloud. You can scale up or down. You can’t do that on-prem.

Another one that’s a myth is security. The cloud is way more secure than anything I’ve ever seen on-prem, not even the same ballpark. We’ve talked to CEOs and CTOs many times who think they have a pretty secure data center and then we have them visit one of our data centers and they walk away realizing just how much hyper-scale goes into these cloud data centers in there. And we have groups in Microsoft that are monitoring the cloud for people trying to hack in or do denial-of-service attacks and all that.

You can imagine if Azure ever got hacked how much it would cost in lost revenue, because people would feel that it’s not secure enough. So the level we go to with monitoring and firewalls is just insane. And what we have in place for high availability and disaster recovery is at a much grander scale than anything anybody’s got on-prem.

The cost is a myth too. People think the cloud costs more, but it’s a pay-as-you-go service, so you’re not paying upfront. It can get expensive, but you get the benefit of not having to predetermine an amount of dollars.

You also get the ability to scale up, so you don’t have to buy the biggest machines. Then you have things like unlimited scalability with some of our storage solutions. Then there’s a lot of things you can’t really put a dollar figure on that you get in the cloud. For example, if I want to build a solution on-prem I need to order the hardware and rack it and stack it and install everything and configure everything, which takes weeks or months. In the cloud I can go in there and within a couple of a clicks I have this service available.

So what is the benefit in monetary figures of building a solution quicker? You really can’t put a number on it. However, all these additional benefits from the cloud have indirect costs, so that’s another thing to keep in mind.

Wayne: There are a lot of variables that go into cloud-based pricing, and I’ve heard that the break-even point is 2.3 years, but I think that average is hard to determine. What I have seen is that you just have to be a lot more alert when using the public cloud because if you don’t turn things off that you’re not using it will cost more. So you do have to be judicious about what you’re managing and make sure you’re only paying for what you’re using and not paying for things you’re not using.

James: Yeah, sure. It’s kind of a double-edged sword there because you have that ability to fire things up quickly, but you also have the ability to forget you have them. You can save money by pausing them or shutting them down, but you have to remember to do that. We’ve come a long way at Microsoft with having tools to automate that so you can specify when a machine should be shutdown and when it should restart. You use PowerShell to do that.

We’ve even gotten to a point where we’ve purchased companies where we offer these tools for free, and we’ll look at the environment and say, ‘Well, you have thee servers and you’re only utilizing 10 percent of them. You should scale them down,’ or, ‘These ones are not being used at all.’ So we’re really big into trying to save you money as one of the benefits of moving to cloud. It knows which servers are getting used and which are not. It may even tell you, ‘You should upgrade the scale in this server because it’s being over-utilized,’ and such. So that’s the cost savings that could come in to play by having these additional tools available that generally are not on-prem.

Wayne: Yeah, that’s a big win. Now, one thing I’ve heard lately is public clouds are not as reliable as on-premises enterprise computing environments. Or at least companies are used to having this certain degree of uptime and a minimum amount of outages – have you heard this myth or maybe reality?

James: The cloud, in most of our products, has these various SLAs, everything from three nines to five nines. So they’re generally a lot more reliable than anything you’ll get on-prem. In addition, it can easily be made to have more reliability. If you look at SQL Server, on-prem, you can set up availability groups to have other servers that have copies of the data.

You could do that for disaster recovery or to scale out reads, and we have that ability in the cloud too, but we’ve gone an extra step. If you look at our PaaS solutions like SQL Database, it has all the features of SQL Server and it has built in disaster recovery. So you go to a website and you have a map of the United States and you click on where you want to have a copy of that data and you hit ‘Go’ and within a few minutes that secondary server is set up. It’s so much easier to create disaster recovery in the cloud. How do you put a price on that, where you save all this manpower, not only in setting up, but in maintaining it, and it’s all done by Microsoft.

Wayne: Some people are saying that you should use NoSQL databases instead of relational databases. What do you think, true or not?

James: It’s not true. Gartner will say a lot of new technology that comes out is a solution for everything. Data lakes were like that. NoSQL came out six or seven years ago, and it has got a lot of media attention because “It’s going to replace regular databases like SQL Server.” Microsoft has a product called Cosmos DB. It’s equivalent to things like Mongo DB and Cassandra, and they’re bothgreat products, but their use case is limited to very high performance solutions.

They came around because before the Internet you never had a solution that had more than a few thousand users. But when the Internet came around and websites and mobile devices opened up these solutions like Pokémon GO! to millions of users. So maybe the most transactions per second was a couple of thousand. Now they can go into millions. SQL Server is a great product, but it’s not going to give you more than a million transactions per second from the best hardware.

So what happens if I have five million or 10 million transactions per second? You could do things like charting and write a lot of code and complexity to get SQL Server to work, but that’s where a product like Cosmos DB came in where it can support millions of transactions per second. It’s all built-in. It’s auto-charting and has the ability to deal with data in JSON format. If you want to do that in SQL Server you’d have to convert it. Now, you do support that in SQL Server 2017 JSON, but it’s not built for a million transactions per second.

So Cosmos DB, the NoSQL, really shine when you have a lot of transactions per second and you have the data in JSON, especially when you’re dealing with IoT devices. The reality is if you have less than 100 terabytes and 2,000 transactions per second, any of the NoSQL solutions are going to be overkill because they’re more expensive than relational databases in there.

So when I talk to clients, I say these NoSQL solutions have their place, but it’s for a limited use case, and 90 percent of the use cases are going to use a relational database. NoSQL got a lot of play because of the Googles and Yahoos using those products, but they’re at a grand scale that almost all companies are not at. That’s why there was this great myth of using NoSQL for everything. In reality it’s a great product for a limited use case.

Wayne: Are there any other myths or common fallacies that you encounter on a regular basis that you’d like to address?

James: We’ve hit the major ones, and I would say the other one is you still need the data governance that goes into building solutions. You still have upfront time in building out solutions. The cloud should not be seen as ‘I can get something done quickly without much planning. I can build the solution without data governance.’ When you’re getting into the data lake and you just throw data into it that’s where it becomes a data swamp if you don’t put data governance on top. You can start quicker in the cloud, but it doesn’t mean you bypass some of the standards if you’re building on-prem. So you can build solutions that are going to last a lot longer in the cloud by spending more time upfront creating architecture, using more products that give you flexibility, so a year later you end up saying, ‘Oh, no. Our data warehouse solution won’t support all this. We’ve got to redo it all.’

Wayne: Well said. There are no shortcuts when it comes to data management.

Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson