Joe Caserta: Exploring Modern Data Platforms

Joe Caserta is President of a New York City-based consulting firm he founded in 2001 and a longtime data guy. In 2004, Joe teamed up with data warehousing legend, Ralph Kimball to write to write the book The Data Warehouse ETL Toolkit. Today he’s now one of the leading authorities on big data implementations. This makes Joe one of the few individuals with in-the-trenches experience on both sides of the data divide, traditional data warehousing on relational databases and big data implementations on Hadoop and the cloud. His perspectives are always insightful.

Key Findings:

  • There used to be ETL developers, now there are data engineers who create data pipelines - 
  • DevOps is hot - it's a key difference between data warehousing and modern data engineering on analytics platforms
  • Knowing Spark and Python is crucial for data engineers
  • People shouldn't need to know where their data is
  • More organizations are using algorithms to make decisions
  • For now, there is no way to automate data modeling

This is an excerpt from the podcast interview between Wayne W. Eckerson and Joe Caserta.

Wayne: What is a modern data analytics infrastructure?

Joe: Good question and because I’m a consultant I can say it depends. I can talk about the trends of what we’re doing with our clients, which is a pretty good sample set of what’s going on in the industry and in the world.

So modern data engineering for us typically means being on the cloud. Nearly 100% of our work in 2017 has been either migrating to the cloud or building something from scratch in the cloud. If there is some legacy on-premises tech there, sometimes we extend it, but typically any new initiative where analytics is a central feature is going to be a cloud solution. Whether it’s AWS or Google or Azure or something else, is a mixed bag. It really depends on some of the features, functions, religions that is involved.

So, with that, once we have a cloud infrastructure, then typically what’s included is some kind of object data store. So if it’s AWS or DS3, if it’s Google Cloud, it would be Google Storage, GCS. Azure would be what they call Microsoft Blobs. Which flavor of the cloud doesn’t really matter that much. So now we have a cloud storage, object store, and then we need some kind of queryable BI-friendly environment. So typically that is still some kind of relational MPP-type database still on the cloud. So in Google it would be BigQuery. On AWS it would be Redshift or Snowflake.

Then, the final piece is the data transformations and the orchestration. Typically what we’ve been doing is using Spark for all of that, so all of the ETL that is done, all of the movement of data from the object data store to the relational database, that would all be done using Python code or SQL code, always in Spark, and then the analytics on top of it still would be Spark, and then in the relational database some kind of newer lightweight BI tool and that’s really the infrastructure.

This all fits very neatly into what we call “the corporate data pyramid”, which we’ll get into. So that’s just the infrastructure side and then we have to build it, maintain it, and productionalize it.

In 2011, probably 20% of our business was doing this and it has increased by 10% to 20% each year, and now in 2018 it’s all of our business. But what has changed in the past year is now we’re integrating the concept of DevOps with the analytics platform. And this is one of the big differences between traditional data warehousing and modern data engineering and analytics platforms--you used to have some kind of business application and it would generate all of the data and then it would waterfall all of the data into a data warehouse and then it will do reporting for human consumption. The really big thing that’s changed in the last year, last couple of years, is now the analytics platform is pretty tightly integrated and tightly coupled with the business applications. So the business applications now depend on the analytics in order to function and in order to create the user experience based on recommendation engines or scoring, things like propensity to buy or propensity to do whatever we’re trying to measure.

So, now, we need some kind of different SLAs. And so, what’s happening is the development and the deployment of change has to be very tightly coupled with the business applications. So now we’re becoming part of the DevOps environment, and when we build a new analytics platform from scratch using modern data engineering, we’re using things like containerization for deployment rather than just having a big-bang, bundling of manual deployment.

Wayne: How do you put analytics in a container and are you talking about embedding analytics into an ERP application or something else?

Joe: We’re using something like Docker where we’re putting microservices into containers and we’re deploying them in containers. So, for example, if you want to have a new feature of your application like ‘What ads should I be showing?’ or ‘What headlines should I be showing?’ to this customer based on their demographics. To deploy that, we would take all relevant components throughout the ecosystem, put it into a container and just deploy that, rather than doing a big-bang deployment or even doing more like a scheduled deployment.

So we would take the little bits of Python code, little bits of data, whatever is required, and we can just deploy that section of the solution. So then it goes through the lifecycle of DevOps where it detects the change, and it goes to deployment and goes to testing and it’s all automated with the virtual push a button. You can do that continuously. 

In the past, there would be one release cycle for your applications and a different release cycle for your data warehouse and analytics, and now it doesn’t really matter. Now it’s just continuous deployment of both. This is the way we’ve been doing things [here at Caserta] probably for the past year, but it really gained momentum about maybe six months ago. It’s becoming the way we design systems from scratch.

Wayne: So are the clients requesting you do this or are you suggesting that, ‘Hey, to get more value out of your data and analytics, let’s push this back into your operational customer-facing applications and we will take care of that for you?"

Joe: Typically, we don’t push anything to our clients. They tell us what problem they’re trying to solve and we help them solve it and one of the things that they’ve wanted to solve is standing up DevOps within their regular production and business application ecosystem.

For a lot of our clients this whole concept of DevOps is new even on the business side. For software development companies, the product vendors, for them it’s not so new, but for banks and insurance companies and media companies, they’re just dipping their toes in it now, so we’re helping them establish it not only for the analytics platform, but holistically through the company as well.

Wayne: It is interesting to see things coming back together and that virtual loop being closed. Do you find you have to appoint different people to handle the DevOps from the analytics perspective or do your developers take care of it as part of their go-to market skills?

Joe: No, it is a different skill set, so now we’ve actually built DevOps practice within Caserta to handle it, so now we have a team of people that specializes in this stuff, understanding Docker, understanding Kubernetes, understanding Selenium, understanding the container registry for the different clouds and how it all works together, and when it comes to automated testing, exactly what do you test, how do you test it? And then orchestration of all of this - that is a different skill set. It’s very, very different from data. But the two worlds have finally converged. 

Wayne: There are a lot of organizations out there who are trying to migrate to a more modern data platform, modern analytics platform. What do you see as the biggest challenges that they face?

Joe: So, the DevOps thing is something that everyone is trying to get their head around right now. When you have a whole staff of people who know SQL and know relational databases and now we say, ‘Okay, but all of your data is going to go to an object data store.’ Like, what does that look like or how should the data be organized? How do you query it? How do you use it? That type of training, but, to be honest, that really is not as much of a leap as it was even a year ago.

The evolution is happening very, very, very rapidly. A year or two ago we’d say, ‘You need to use an object data store’, and we were speaking some foreign language. Now they get it, and they say, ‘Okay, let’s do it,’ because they think what’s happened is over the years people started dipping their toes and they’re realizing the economics of it. It’s like Hadoop was the gateway drug for this type of platform where you could start experiencing drastic cost reduction with enhanced capabilities, and this is just taking it to the next level.

Wayne: Let’s talk about the object stores. Are they replacing Hadoop and in this sort order as you imply or not?

Joe: When I think of Hadoop, I think of two different things. So there’s HDFS, which is the Hadoop file system, and then there’s the Hadoop ecosystem that has other applications. So, we are still using the Hadoop ecosystem, so we’re using Google Dataproc or on Amazon on AWS we’re using EMR, which is just basically Hadoop, but for the HDFS, we have been using CFS. For several years we were using object data stores that are native to the clouds.

Wayne: What is the role of the object store? What do you do there? Do you just load all your raw data or do you ever query it?

Joe: We came up with what we call the “data pyramid”. So, the data pyramid consists of four components or four tiers. The first tier is just purely ingestion of data or data staging, so this is where we receive data coming from other sources. It could be internal data, external data, structured data, unstructured data. It doesn’t matter. We just bring everything in for fidelity and then we persist it. So, if you ever want to rerun your pipeline, your data pipeline on your ETL, you can always have the data.

Then, once you have all of your data, then you build what is affectionately known as the “data lake”. Alright, so the data lake brings your data coming from the ingestion area and then enriches it with data that you have internally like reference data or master data, and then it also does some organization and some light governance, and now that data is available for analytics, and you can sit something like Spark on top of it. And this paradigm doesn’t need to be big data. It’s just data and whether it’s big data or small data doesn’t really matter. I guess, the only difference is that if you don’t have big data, you may not necessarily need Spark.

You can just use Python, but even if you are using Spark, you still use Python because Spark has a great Python API. So, you would use Python for analytics typically. You can use R for analytics. You can use SQL for analytics. It doesn’t really matter. Spark supports all and you can hit all of them in your object data store.

Now we have the ingestion layer. We have the data lake layer, but what we don’t have yet is some kind of workspace or sandbox for people to create new datasets, do some sampling, do some testing, do some what-ifs, build some models. What we do is we build ephemeral workspaces. So the thing that’s really nice about being on the cloud is we could spin up a new cluster to work it that has Spark. You can enrich data. You could change data. You could do whatever you like, and then there needs to be some kind of information lifecycle management IOM to say, ‘Okay, if you want to retain this data, we need to graduate it or promote it either back into the lake or downstream to a data warehouse. If not, it’s ephemeral, so it’s going to go away in some kind of timeframe. Usually it’s 90 days. So if you don’t do anything with it, 90 days later that cluster is going to go down.’

That eliminates what used to be shadow IT departments, and this also takes the burden off IT because you could spin up those clusters as a sophisticated user. You don’t need to be IT to spin up a data science cluster. You could just be part of the data organization on the business side.

Then, you could do your science and analytics. You could get your findings and then you could either go through the data governance process of retaining or you can just drop it.

So those are the three tiers. You have the ingestion area, the data lake, and the data science workspaces. Finally, what we have is still a data warehouse.

You asked if there’s still a place for data warehousing in modern data engineering and the answer is yes, but it has morphed a bit. So what the morphing is it’s on the cloud and it’s usually a columnar data store. So when we do our designing and our data modelling, we’re still thinking dimensionally. We still think about customers and products and stores and all the people, places, and things that we create dimensions for, and we also keep thinking about events and transactions and facts. So we still think about a dimensional data model, but the way we implement it may vary depending on what platform. You don’t need to be so religious and pure anymore. The technologies out there don’t like it when you’re religious and pure. It’s okay to have a wide fact table with lots of attributes in it. That was a big, big, big no-no in the days of Oracle, SQL Server, DB2 relational databases.

Wayne: Is that because the performance of these in-memory columnar databases is so good that you can get away with that or something else?

Joe: Well, it actually likes it better. It’s really the preferred method, so it actually behaves better when everything is coexisting in a single store rather than to have data distributed across a bunch of different tables. The query optimizers actually prefer to just have a single set of data that it could just scan through. So, that’s a big change and logically we still create dimensional data models, but physically we start doing some denormalization and taking a lot of what normally would be out in dimensions and fold them into the facts.

Wayne: I always thought we did that because the BI tools want to build flat tables, but sounds like Google BigQuery loves that too.

Joe: It’s a win-win because the BI tools do like that and people like that. Even some of the newer, sexier BI tools, they still kind of force you to know where your data is. I think we’re getting closer and closer to making that go away where you shouldn’t have to know where it is. I think Google BigQuery does a pretty good job of that, but I do think that we’re not finished yet.

I think the industry still needs to take the next leap, and the next leap is just natural language processing for querying data where you could just open up a search bar and say, ‘Give me sales by region’ and it will give you a graph of your sales by region and you don’t need to know, ‘is region in a location table? Is it in the store table? Is it in the customer table?’ You shouldn’t have to know these things anymore. It’s 2018 where you can ask it. You can ask. We should be able to ask Alexa what our stores by region is and she should just tell us and I think we’re getting closer and closer to that.

Wayne: Yeah, we’re starting to see some tools come out like ThoughtSpot that’s all NLP and incorporate it into others like Power BI.

Joe: I think NLP is the next logical baby step, but I think we’re probably going to leapfrog it pretty quickly directly into voice. I was on a panel with a bunch of data scientists about three or four months ago, and they were asking us for predictions. My prediction still to this day is I think keyboards are going to become something that we talk about in the history books that we don’t use anymore. In our lifetime we’re going to see that happen. I think it’s going to be all about voice.

You know, chat bots are a big hot item right now with a lot of our clients and using an artificial intelligence also. So it’s a combination of artificial intelligence, chat bots, and BI and modern data engineering are all starting to converge and I think it’s not going to be long when we can just say, ‘Hey, computer what are my sales by region?’ and you’re going to just hear a voice that says, ‘Wayne, you did $1,257,000 yesterday.’ That is the future in the near term I think.

Wayne: I was talking to a director of analytics who’s definitely on the bleeding edge of technology, and they are querying the object store directly with Presto. They have data scientists. They’ve got rid of all their data analysts, and they’re using algorithms to make all their decisions. Talk about AI going wild, right? They trust the algorithms more than humans looking at dashboards. So I’m wondering if that’s science fiction or maybe you can see that as a reality. They have a hundred PhD data scientists on their staff, so not everyone can do this, but it certainly is maybe a scene from the future.

Joe: Well, it’s not in the future. It’s now. Many of our clients are doing that or at least approaching it and, again, remember I said all your data is in the object data store and then you push some of your data from the object data store – and that’s why we call it a pyramid because it gets smaller as you go up – so all of your data is in the data lake and then from the data lake you’re pushing some of your data out to the relational database for your consumption, and most of the data that’s being used and the reason it’s in the object data store is because…well, there’s two things.

It’s in the object data store because machines don’t really care what the data looks like. It’s pretty much known queries from the machine and the machine knows what to do with that data. The reason you need to put it in a relational store is because human beings like to write SQL, and they like to use BI tools, but what we’re seeing is in order to write SQL and use BI tools you just need a SQL engine. You don’t necessarily need to have a separate platform, and that’s why we’re seeing things like Spectrum and Athena and Presto. And what we’re seeing is that SQL engines, instead of moving the data into a separate place to function, are just reading the object data stores.

Wayne: Do you think that’s the way of the future? We won’t need a relational database for users to query. They’ll just query the object store?

Joe: I think so. That’s a big question mark. Well, we’re already seeing it. I think there is a little bit of latency that users have to contend with when you’re taking a SQL engine and sitting it on top of an object data store, but with in-memory it’s going to be a non-issue. But I do think that it’s going to be the future.

Right now, if BigQuery has its data in its own repository or if it has it in Google Storage it doesn’t really matter. All we care about is the semantic and this goes back to Oracle. At the end of the day, Oracle is just basically a semantic. All the data is still sitting on a file system somewhere and it’s the same thing. We have a SQL engine whether it’s Redshift or Snowflake or Impala or Presto or Athena. It doesn’t matter.

Wayne: So all the technology that we’ve used to deliver insights to users is changing and it’s simplifying the world where we don’t have to move data. We just ingest it, and then we can query it there, but the one thing that seems to still be a requirement is the semantics and modeling the data – and I guess the question is how much do you have to integrate the data before you can model it for users to consume and where does that happen, with what tools?

Joe: And that’s why you still need consultants or you still need people. That also may be changing and we did talk about why it’s changing…

If you want to get a whole picture of your customer journey and you think of all the touch points of a customer whether they’re getting mailings, and you’re doing phone calls, and you’re sending emails and they’re interacting on the website, and they’re taking advantage of coupons and there are campaigns sent to them – every single one of those things I just mentioned is a different system and each one of those different systems track things with different brains. They may call the same customer different things. Some may be to a household, some may be to an individual or if it’s a minor or if it’s a parent – there’s lots of complexities going on when you’re trying to string all of your stuff together, but today there’s no way to do that automatically.

We need to look at the data, profile the data, and see what the mappings are, build some algorithms to do either deterministic matching or probabilistic matching, and there’s a lot of trial and error involved. Once it’s all built, then you can do this identity resolution in the pipeline. If you were to hire Caserta, we do all of that in Spark. We do all of that in Python code, and it’s all custom-built because every system is different. Every way you can possibly identify a customer or a product or some kind of event can be different. So it always has to be custom. That body of work, I don’t see that becoming automated or obsolete. Somebody has to figure that out and do all of those mappings and do all that identity resolution process and then figure out how to string it all together. Once you have that built then you could do really good allocation models and customer journey heat maps, and then the data gets really, really interesting.

And then you could build algorithms to use that customer journey to find out behavior patterns of people who leave and you could do interventions to avoid attrition, and then all of that could be automated, served up as a microservice and embedded in your business application, and then the data gets really interesting. For all of this you don’t need a relational database. All of this is done right in the data lake, and that’s the exciting part. And that use case is really just a true story of how business is becoming dependent on analytics and how analytics is driving the business application.

Wayne: So it used to be that preparing the data and creating the model was 80% of the work. Is that still true even though all the technology has changed?

Joe: Again, it depends on what you’re trying to do. If all of the data is internal and aligned and you’re getting data from just a few sources, it starts to become trivial, but what we’re seeing is that’s more and more not the case. So what’s becoming the case is every business process is being done by a different system. So now we have Workday and we have Salesforce and every platform has a service or software as a service for different components of a business. It’s causing the business to be very, very fragmented for the business process.

Then, you know, the world goes around so that comes back around. Back in the ’80s Ralph Kimball created dimensional model bus matrix, so every project that we do starts with the dimensional model bus matrix.

So what we have to do is figure out ‘What are all these business processes?’ and capture what they are and what they do and we’ll go over the dimensions associated to all these business processes and then figure out where that data comes from and then start building out your data lake and from the very, very onset of any project we’re thinking dimensionally and we’re thinking in the bus matrix fashion and we use the bus matrix as a prioritization roadmap and we start hitting one process at a time, but once you have the bus matrix and you can see the intersection of all of your different dimensions for your business across all of your processes, it becomes very easy to start planning out your data integration.

So for each time, each place in integration, you need some kind of matching rule or matching algorithm and then you just start building. That’s not really repeatable. That has to be done every time. Every time you have data coming from two different systems, you have to build the bus matrix and for every intersection you have to come up with some kind of plan on how to integrate that data.

Wayne: One thing that hasn’t changed from the old world of data warehousing is the need to get in there, roll up your sleeves, understand the source data and then model it, and harmonize it into something that the business can start to use. So that still takes time, maybe even more time than it used to. Is that true?

Joe: The only thing that’s different is we need to match all of the similar data, components of data, like the customers and the products and all of the nouns of the business, but I guess what’s different is in the data lake we don’t necessarily need to have a customer table and a product table. We can leave the datasets fairly raw as long as we can join the datasets through some kind of mapping or through some kind of algorithm, and I think that’s what’s different.

So when we ultimately send it to the data warehouse for human consumption, the causal business user is not going to write an algorithm to join customers from Salesforce to customers from their DMP to customers from their transaction system. They just want to know.

They want to select, ‘Joe Caserta, my customer, what events have happened with Joe?’ and see a list of all of the events. That’s a very, very different type of query that somebody in the data lake is going to want to build a model to predict what Joe is going to do next and for that they’re used to having raw data. Data scientists like raw data. The more you prepare data, the more you’re just frustrated to.

So, all we really want is to have it very lightly governed to keep everyone safe, and we want to allow them to connect the dots between all of the different datasets, and that’s what the identity resolution algorithms do.

Wayne: This leads us into a discussion of roles in IT and that obviously is changing in many ways, but maybe not in some other ways. One acronym you have not used is ETL, and in the old data warehousing world there were more ETL developers than any other role out there. How is the role of IT and the traditional data warehousing shop changing? What new skills should people be learning?

Joe: As you know, I wrote the book on ETL. I was Mr. ETL for decades. The largest part of our projects is still ETL. We just don’t call it ETL anymore. So there are two things that changed. We used to have ETL developers who used to write ETL. Now we have data engineers that create data pipelines and it is different…

Wayne: Are those really different?

Joe: It is different. So, conceptually, we’re still extracting data from some place. We’re still transforming it into something and we’re placing it back on disks somewhere. So, conceptually, we’re still doing ETL. Physically, it’s very different. Physically, all the data is in an object data store unless we’re pushing it out to the MMP on the cloud.

Before when we would think of extraction we would extract that from some other system and when we’re loading we’re loading it to yet another system. Now we’re all doing it pretty much in the same. Again, too, the lines between all of these different systems are getting blurrier and blurrier every day. So that’s why we stopped thinking about it as, “this system and that system”. It’s your entire corporate data ecosystem, and in any ecosystem you need all of the components where you have the other components to stay alive and once you start removing things, other things start breaking.

So we still need all of those different sources and targets. It’s just where they are and what technologies we’re using. That’s evolving tremendously. That’s all changing, and the way we do our ETL is changing. The thought of extracting from a source, having an ETL server, brining all the data to the server, running all of your ETL and then loading it to another database - that’s changed. We don’t do that anymore. Now we have everything on the cloud, in your data ecosystem, on some kind of object data store and then we’re running Spark or some other kind of engine where we can write Python or we could write SQL or we could write Java. The languages have drastically changed. We used to either do 100% SQL or some kind of proprietary user interface for an ETL tool whether it was Informatica, DataStage and Ab Initio.

Because we’re now integrated with DevOps, we’re integrated with the production systems; we integrated with the business transactions, so now in order to be agile we’re back to writing code the way we were doing it in 1986 with COBOL. We’re writing mostly Python, and so, the ETL developers now are data engineers. The data engineer has to know how to organize data and how to manipulate data using this new paradigm.

Wayne: They have to know how to write Python and use Spark as an ETL developer.

Joe: If you want to be hired by Caserta, you will be given an online Spark Python test. We will give you some business scenarios and you have to resolve those scenarios using Python, and if you can’t, you don’t get hired. It’s pretty binary. You’re scored on a scale of 1 to 10 and that depends on whether you’re a junior or a consultant or architect. We use that scoring to place you, but if you don’t pass the test, you don’t get hired. You have to know Python. You have to know Spark.

Wayne: So, that’s the future. If people want to jump on this new data engineering bandwagon, those are the core skills they need.

Joe:     Well, that’s definitely today. If I had my crystal ball I would say that the world goes around and keeps going round and round. Back in the ’80s, we wrote code and then we had these tools like Informatica that would write the code for us or allow us to do the work without writing code. There’s even things out there now like Paxata that’s 100% Spark-based that allows you to do a lot of data manipulation without writing code.

So I think there will be a trend of products that are built specifically from the ground up to live in a distributed compute world whether it’s based on Spark or something else where you don’t have to be a technologist in order to manipulate data. I think we’ll see a lot of innovation and new data pipeline tools like Amazon Glue and Trifacta. There’s tools out there that are starting to emerge to say, ‘Yes, this is the future. Yes, we do need to be able to hit data that’s in an object data store. Yes, we need to distribute our workloads across a cluster,’ but, no, you don’t have to be a Python programmer to do it. You can use our user interface. It has to happen. It’s just kind of slow going at the moment. There’s little out there, but there’s no one really leading the charge on this yet. I think that there needs to be.

Wayne: IT used to do everything when it came to data warehousing. They built the data warehouse, the applications, the reports, and the dashboards. What is IT’s role these days?

Joe: In our client base, the role of IT, it’s really a mixed bag. They’re all morphing, but they’re morphing in very different ways, and it really depends on the organization.

With some of our clients, the data organization rolls up to the CFO. Sometimes the data organization rolls up to the COO, and sometimes the data organization rolls up to the CIO, and if the data organization rolls up to the CIO, besides the tools and technologies, very little is changing. They’re still the guys who are doing all this work. Where they’re rolling up to other aspects, either the CMO or the COO or the CFO, now the other parts or the other divisions are becoming less and less dependent on IT and IT is really part of infrastructure, but now all of your infrastructure has moved into the cloud, so IT is still responsible for the health and monitoring of what’s going on in the cloud.

If you liked the podcast, please subscribe. If you want more content from business intelligence to data management to data science, browse through the Eckerson Group website at eckerson.com.

Wayne Eckerson

Wayne Eckerson is an internationally recognized thought leader in the business intelligence and analytics field. He is a sought-after consultant and noted speaker who thinks critically, writes clearly and presents...

More About Wayne Eckerson