Lenin Gali: Past and Future of the Cloud and Big Data

Lenin Gali is a data analytics practitioner who has always been on the leading edge of where business and technology intersect. He was one of the first to move data analytics to the cloud when he was BI director at ShareThis, a social media based services provider. He was instrumental in defining an enterprise analytics strategy, developing a data platform that brought games and business data together to enable thousands of data users to build better games and services by using Hadoop & Teradata while at Ubisoft. He is now spearheading the creation of a Hadoop-based data analytics platform at Quotient, a digital marketing technology firm in the retail industry.

Key Findings:

  • Don’t build in the cloud, build for the cloud
  • Deciding to work with data is not a box you check, it’s an ongoing journey that demands a cultural shift
  • Every cloud provider is different, figure out which one is best for your needs
  • There is no perfect solution in the cloud, it’s not magic
  • You can forklift legacy systems to the cloud, but don’t expect it to solve all your problems
  • Organizing data is foundational. Organize it on a use case basis
  • Applications, services, and data warehouses must be fault tolerant

This is an excerpt of the podcast between Wayne W. Eckerson and Lenin Gali

Wayne: You’ve spent a lot of time doing work in the cloud and were one of the first to bring data analytics to the cloud when you were at ShareThis. From your vantage point, what are the benefits and challenges of running data analytics in the cloud?

Lenin: I’ll tell you a little story about how we came to do everything in the cloud.

ShareThis was such a small startup in the Midwest, and I came in to help embrace innovation and scale. In those days, every time we did anything with data it was a big pain and the biggest challenge was always scale, agility, flexibility, innovation, and opportunities. There were many questions we had to figure out.

Doing these things in a relational database or MPPs in those days was very expensive. Cloud was very attractive, however, very complex. So it was a tough decision. I was posed the question to stay in a data center or go to a cloud. It was going to be built on top of data, so we needed that scale. We’re not a company that has the resources to do everything that we thought we could and eventually spend all our money on building data center-related activity and resources, so we said all-in for Amazon and AWS. The challenge was its newness and unknowns. Everything we did was risky. There were days where I spent nights doing research and hiring people and figuring out the right strategy and ways to do things.

The benefits – I think it was the best thing we’ve ever done. It helped move the company where it is today. It’s still evolving and growing. It went from zero to millions of dollars of revenue and gone from doing one type of product to a multi-pronged approach, and really become a data powerhouse.

The benefits are agility, scale, flexibility, innovation, and opportunities, whereas the challenges have been selection of tools and strategy to define what the data organization and data governance should be and how we manage growth, storage, access, processing, operations, and prorating value from the data.

The biggest challenge going into the cloud was figuring out how to operate it. In those days, the data team used to be one and most of people’s skills were tool-driven. If you were a business intelligence person, you used Informatica, Oracle Data Warehouse, Microstrategy, or Tableau. Everything was a tool and you simply integrated it, but when you go to cloud, none of those were readily available. You would have to figure out everything.

What I’ve learned is don’t build in the cloud, build for the cloud.

Wayne: What do you mean by that?

Lenin: Even now, a lot of people who move into the cloud think it is a data center, infrastructure as a service. Back in the day there were a lot of things people didn’t know. We’ve had to learn and figure out things pretty quickly, like building an infrastructure in a data center, scaling, defining architecture, and developing programs that understand that the underlying servers can disappear at any minute. They’re not stable.

In cloud it’s all virtualization, so that means any machine that you power up and add to your cluster may disappear. It can stop anytime, so your application could break if you don’t have the underlying infrastructure. That means you have to build it for fault-tolerance and that was a fundamental idea of building in the cloud versus building for the cloud.

If you treat it like an infrastructure, like a data center, you’ll just move yourself into the cloud and think your application will work just the way it does in the data center because it is infrastructure. However, the quirks and the intricacies of the cloud require you to make sure your application is fault-tolerant, scalable, agile, and can adapt to uncertainty. By not having those, the struggle and the challenge is going to hit you right away.

Wayne: For companies looking to get into the cloud today, especially those who have existing legacy on premise, what would you advise?

Lenin: Working with data is a journey. No matter if you’re thinking about either starting your journey or have started your journey and are stuck with what you have built and you need to move to a cloud to refactor and re-architect, it’s a journey. It’s not just a reorganization or a simple move or building a platform or buying a platform. It’s a cultural shift. It’s an organizational shift, and it’s a technology shift. So you need to set the expectations before venturing into this type of journey with data because the benefits take time, whether it’s on premises or cloud.

My recommendation is to have a strategy. Understand your goals and make a good tool selection. And go with simple and small improvements. Walk before you run. And you need people. You need skills and you have to be able to fail quickly and recover. I consider every opportunity to fail a success for your long-term strategy. That empowered us to run faster.

Wayne: Are there specific ‘gotchas’ today that people should know when it comes to doing data analytics in the cloud.

Lenin: Every organization is different, so it’s very hard to speak for any one particularly. But you have to understand the architecture. The gotchas are whether you’re going there to build for flexibility and build for your product as a strategy for your company. What is the mission? Is it to do cost cutting, cost-optimization, or cost savings? Or is it to build flexibility to leverage data to build products? So, strategy is at the core.

Apart from that, I would say, selecting the right tool or platform. If you’re on Amazon you have enormous toolsets and the maturity is quite different. Google is quite different, and the toolsets and opportunities to work with data are different. Azure is a bit different too. Every cloud provider gives you tools and an environment that you have to adapt to.

First make sure that you’re going to the right place. That requires you to understand which cloud is the right cloud for your business. Then, be flexible. It’s one thing to go into the cloud and say, ‘I want to be independent, agnostic to the cloud vendor and operate on my own, so I will take my tools and move it in because I’ve already paid for it,’ versus the opportunity to make it a hybrid environment where you have to take cloud-native optimizations because the reason for most of these cloud providers to build those services is to make it seamless for you to move the data between applications and services. So, if you don’t adapt to what they’re offering, your cost isn’t going to make you that comfortable.

Wayne: So, you’re saying that it’s actually more cost-effective to buy new platform-specific services than bring your own legacy tools into the cloud?

Lenin: My recommendation to anyone venturing into this particular journey is never undertake a re-architecture migration project. That means you don’t want to go into the cloud, but you want to re-architect it.

You could take on that mission, but you have to learn to do two things: maintain what you have and support undergoing projects and initiatives in the existing environment, and you have to have an entirely new team. It’s a huge investment and distraction. Your team will have to learn to re-architect and bother and disturb the people who are probably the main knowledge base for you who probably then won’t be able to deliver what the business is expecting. So, there are a lot of problems that come with that kind of strategy.

Wayne: So, you’re suggesting that you forklift your legacy into the cloud, and then begin a process to re-architect?

Lenin: The strategy has to be multi-pronged. There is no perfect solution when you’re going into the cloud. Don’t expect it to work like magic when you go into the cloud. Like I said, going into the cloud is not into the cloud, it’s for the cloud. So if your application isn’t going to scale today, don’t expect it to scale when it goes into the cloud.

With most projects you may have to reconsider even going into the cloud based on the way the application is built. If it’s not reasonable for you to move it, you probably don’t want to forklift it. You may have to re-architect for good because it is probably at the end of its life or there is no forklift capabilities offered by the cloud.

In that situation, I would recommend not moving any of that legacy, that old architecture, or even applications into the cloud. But for a majority, even the clouds that have matured today, you can forklift; move your licenses and then adapt to the cloud.

As long as the path to migrate is simple and you keep your current commitments and set the right expectations, you’re going to make your team successful and the project a success. You’re not going to lose a whole lot by moving into the cloud. The investment is probably going to be a little high going in, but it’s worth it.

Wayne: Would it be safe to say that the best way to start with the cloud is to do some small projects where you can design it for the cloud and learn it’s nuances before you tackle your legacy systems?

Lenin: Yeah, that is typically how most companies have learnt to do it, but a legacy move is never an easy thing. If a project is separate from your existing legacy, it’s the perfect candidate.

The challenge is there are certain optimizations you could do that you couldn’t do in a data center. You could certainly build deployment automation, which you couldn’t do in a data center because you have to get the hardware, certify it, and put it all together. That particular network and groundwork is very easy, so you could create a Docker container or even use Kubernetes.

There are lots of opportunities to automate the groundwork so you don’t do repetitive work that you typically do in a traditional environment to moving into the cloud. If you do that, there are less mistakes and everything is configurable and modularized. I think that’s where the journey to the cloud starts for everybody – knowing how to go into the cloud and adopt it at the core rather than taking your practice and thinking it will work when you go to the cloud.

Wayne: Then there’s the whole topic of hybrid cloud where you’ve got some things in the cloud and some on premises. That seems even more complex, but sometimes a necessary step.

Lenin: Yes, I totally agree. There is no perfect way of running everything in the cloud or everything on premises. If you are in a situation where you have some applications running in the cloud and you need that data to come from your on premises enterprise data warehouse, you have to make sure the data comes back. And most initiatives that are in the cloud are where the big data is.

So how do you bring the data to the cloud from the applications that I have that are highly scalable? My experience is that you don’t want to bring the raw data that is humongous in volume into your on premises. You want to figure out how to condense it. How do you summarize it? What is the purpose? Why do you want it? Because it’s cheaper to keep the data at a high volume in the cloud than on premises.

Wayne: What do you see as the benefits and challenges of Hadoop? What would you recommend to people who are considering Hadoop?

Lenin: Unfortunately, like I said, it’s an evolution. It’s a journey. Like every other application and technology, it’s a tool. It’s a platform. It has evolved into a data enterprise platform hub with different services that are offered for different solutions that anyone can take advantage of.

It’s important to understand that working with data is a journey. And you also need to have an architectural understanding of Hadoop and understand why you’re using Hadoop.

It also depends on if a company has started working with an engineering mindset or is a more traditional company. Maybe data is not necessarily their foundation for products and services that they have built over time. Or it’s their foundation, but the data they get out of it has never been leveraged to make it valuable. Those are the companies that struggle. So mammoth companies, like insurance and financial institutions, where their products have evolved over years will find it very difficult to collect data, bring it together, and understand the characteristics across the organization. That’s when the challenges start.

Having a broader strategy to bring all the data together will be a failure. Start small. Bring in one dataset and make sure that you can organize it. Then bring another dataset. In the end, data isn’t any different whether it’s small or big. It’s just the volume, the velocity at which it comes at you, and the variety of the datasets is what makes a lot of this complex. Other than that, data is still the same. You still need to know why you have it and what it is going to do for you and how you are going to work with it. You still need to have the tools, people, and understanding. So organization of data is at the core, especially working with Hadoop.

Wayne: I think the organization of the data is still something that requires a lot of thought and a lot of manual effort. I wonder if you could comment more on that.

Lenin: It wasn’t such an important thing early on because people were trying to figure out how to get the data together. It was more about, ‘Okay, I have data all over the place. I just need to bring it together,’ right? So the focus was just building things to bring it into Hadoop and just dump it there.

That was the strategy initially because people did not understand when you bring so much data together at once and you don’t know what the data is, what it can do for you, and how to organize it, it becomes a swamp. In cases where people got frustrated, it’s because there was no holistic strategy to bring the data together. Everybody wanted their own piece, because organizations were verticalized and their teams are distributed, disconnected, and disjointed. Strategies are all over the place and everybody's trying to go faster. This is a recipe for disaster no matter which product or which strategy you take. I can understand the frustration of those organizations that went through this and then figured out, ‘Oh, man, what kind of mess have I created!’

Wayne: How does an organization organize their data? We’re talking about modeling the data in Hive or something else?

Lenin: No. You can still bring the data together. There is a pace at which you need to deliver results, right? That’s why I was saying big data is a journey, and it is not something you do and it’s over and done.

It never ends, and that’s the key, because data is the essential element that keeps changing. It’s changing because of the demand that you’re getting from the people that use the data. Data is the only thing that connects everything in an organization, and it’s a goldmine that is not pure. You have to ensure that you bring in raw data and compartmentalize it, initially.

Wayne: Are you recommending a use case-driven approach to organizing data, not enterprise data model like we used to do?

Lenin: Yes. It’s tough to work with data because it requires organization, structure, and understanding of purpose. If you bring data in thinking that it will work itself out, it never will. The need for data goes from department to department. It molds itself to the usage of the department. So building an enterprise warehouse and then thinking it will magically work…it won’t.

So you have to make sure the business use case is important and know why you need to structure data for that use case. Because there are other factors you need to understand as well.

Wayne: If we’re designing for use cases, what is going to stitch all these use cases together in to give us conformity, consistency, and enterprise views?

Lenin: It’s like the layering of a cake. You’re not going to build a cake with one ingredient. If you want to have a cake and eat it too, you have to layer it, and none of that comes without making sure you understand what you want. You can go with a simple at first, but once you understand how it works, you can start experimenting and building the layers.

It’s important culturally, too, because there’s no one team that is going to be successful. It’s one thing to bring the data together and organize it. It’s another thing for teams that will consume the data to understand how you’ve organized. It’s impossible to say that ‘Oh, my team is just supposed to be only managing data and putting it together.” If you don’t produce a catalog that helps people understand what the data is and where it is, the people looking at the data are never going to touch it because it’s so complex. It will be impossible for them to figure out what’s there. That’s why the principles that worked with a smaller dataset are still more relevant and important for big data.

Wayne: I was talking to a client today that has small data, under a hundred gigabytes, and we talked about using OLAP as a relevant way for them to proceed. Because OLAP views, provide a dimensional view of data that most of the people in the organization find intuitive. Of course, cubes don’t scale to these terabyte- and petabyte-sized environments that most companies have, but how do we get back to that environment where we’re giving users those dimensional views of data inside of a big data environment?

Lenin: It’s a challenge when you’re working with data to always have a solution at hand. It’s important to know the volume, variety, and veracity at which the data moves and changes and the right approach to manage and maintain that data in the way that supports your business needs. Again, if you lose the opportunity to provide value to the organization, you’ve probably already lost the opportunity to make an impact. When that happens, the interest goes away and people lose their confidence.

It’s not an easy thing. There’s no perfect solution. It’s always based on the circumstances you’re making the decision in and the people and technology you have. I can give you tons of examples of how I made that type of choice when I was doing ShareThis.

We’ve used MongoDB. We’ve used Cassandra for scalability and distributor data center capabilities that it has for storing a certain type of dataset. We’ve used Hadoop for different batch processing that we needed to do. We’ve used MySQL and Redshift for reporting and data warehousing. You have to leverage the tool that fits your purpose. Eventually, there are other ways that you could evolve from it.

I did replace a lot of these tools over time. That is the best part of being in a startup versus a company that is a mammoth bounded by so many different factors. I don’t envy people in either one. Each has its own risks and rewards, and I’ve been on both sides at Ubisoft and ShareThis.

Wayne: If you get out your crystal ball, what do you think is the next big thing that will transform the way we use data to make decisions?

Lenin: AI, IoT, cloud, and virtual reality devices that are going to be on your body or with you. It’s all about efficiency, automation, processing in real-time, and the ability to visualize and consume. These technologies will become transparent and invisible so you don’t even know that data is determining a decision for you.

And the reason automation and AI go hand-in-hand is because AI can automate repetitive, human-prone, and error-prone tasks. We want to simplify those tasks and move humans to do better things, extract value rather than processing data.

There are a lot of tools already implementing AI and machine learning. AI is like a broad stroke of machine learning; there are so many components within AI. It’s not easy to become an AI expert. You have to know machine learning and deep learning. You have to go through a lot of elements before you could eventually reap the benefits of AI.

If you liked the podcast, please subscribe!

Henry H. Eckerson

Henry Eckerson covers business intelligence and analytics at Eckerson Group and has a keen interest in artificial intelligence, deep learning, predictive analytics, and cloud data warehousing. When not researching and...

More About Henry H. Eckerson