Some Thoughts on MPP, Cloud, and MPP in the Cloud

Some Thoughts on MPP, Cloud, and MPP in the Cloud

Historically, core decision-support and analytics workloads had resisted migration or relocation to the cloud. The reasons for this had to do, first, with the demanding performance requirements of typical decision-support and analytic workloads, as well as, second, with the extent to which reports, queries, dashboards, etc. are embedded in core business processes at all levels of the enterprise. Thanks to the availability of creditable cloud MPP RDBMS services (Amazon’s Redshift, Microsoft’s Azure SQL Data Warehouse, Snowflake, Teradata IntelliCloud, etc.), as well as to an explosion in cloud business intelligence (BI), analytic, and machine learning services, the first of these reasons seems a lot less compelling than it used to be.

The upshot is that the performance, availability, and reliability issues that (individually and collectively) once militated against the movement of decision-support and analytics workloads to the cloud have, in critical ways, been ameliorated. There’s even a case to be made that a strategy which distributes decision-support and analytical workloads between and among cloud and on-premises environments permits an organization to optimize for performance, availability, reliability, security, and ease of use/access. This last leads to an overwhelming question, however: what kinds of workloads will run best in which context?

That’s the (inflation-adjusted) $64 billion question.

Given the importance of these workloads to day-to-day business decision-making, to say nothing of long-term business planning, it behooves an enterprise to make the right decision about where – or, more precisely, in which environment or context – to host which workloads.

From Ferrari to Fiat

Does the ready availability of (comparatively cheap) DWaaS MPP capacity make cloud a viable destination for most if not all decision-support and analytics workloads? After all, aren’t MPP databases ideally suited for the distributed cloud? Isn’t the distribution of resources in the cloud roughly analogous to the distribution of resources in an MPP cluster?

Not necessarily. To understand why this is the case, let’s try a thought experiment. Imagine two Formula One race cars, each of which has been designed with a specific environment (a rectangular-oval track), specific conditions (abundant resources), and specific performance characteristics in mind. On any given rectangular-oval track, both race cars will behave predictably: one car might have superior handling characteristics on tight curves, the other might have better acceleration or a higher top speed on straightaways. Depending on the specific conditions of this or that track, one car might have an advantage relative to the other. But because they’re both operating in the type of environment for which they were originally designed, their performance (mutatis mutandis) will be predictable. Now imagine driving either car to work on your daily commute. This might be doable, but is it practical? More importantly, is it scalable? Neither car was designed with the constraints, conditions, or performance characteristics of the commuting workload in mind. Would you really want to drive a Formula One race car to work, day in and day out, over a protracted period of time? Probably not.

Most MPP databases were designed (prior to the Age of Cloud) with an ideal context or environment in mind. The engineers who built them attempted to optimize for specific constraints, conditions, and performance characteristics. In some cases, Database A might indeed be better suited for certain types of workloads – think of this as analogous to superior handling on hairpin curves, or greater speed on straightaways – than Database B. It’s even conceivable that Database A might be faster in all conditions than Database B – or any other MPP platform – in the ideal environment for which it was designed. Like a Formula One race car, however, Database A will perform sub-optimally if it’s used outside of this environment, with its definite constraints, conditions, and performance characteristics. In other words, just because a database is designed for large-scale parallel operation doesn’t mean it will automatically scale – predictably, reliably, and cost-effectively – in the cloud: not without significant modification, at least. This is not to say MPP doesn’t scale in the cloud. It is to say that you can’t just take any old MPP database and – voila! – pronounce it cloud-ready.

Performance, Elasticity, and MPP

The good news is that none of the most prominent cloud MPP players has done anything like this. At this point, in fact, most of the available MPP DWaaSs are more or less adept at demonstrating the elasticity that is the defining property of cloud. It’s by virtue of this elasticity that subscribers are able to realize substantial cloud-related cost-savings. An MPP DWaaS is “elastic” to the degree that it permits subscribers to add or subtract capacity, either by adjusting virtual compute and/or storage resources on a per-node basis (so-called scale up/down) or by adding/subtracting nodes to increase or decrease the size of the parallel processing cluster (scale out/in). Another dimension of elasticity has to do with the ability to start, stop, pause, or resume resources independently of one another. This is what is meant by  “decoupling” resources. Thanks to resource decoupling, subscribers can (for example) add or subtract compute resources without also increasing or decreasing the amount of available storage, or vice-versa. (In some MPP DWaaS offerings, subscribers must increase compute and storage in lock-step with one another.) This same flexibility permits a customer to (e.g.) pause or shut down unused compute resources for certain periods of time (at night or on weekends). Still another dimension of elasticity has to do with how rapidly all of this (scale up/down, scale out/in, starting, stopping, pausing, and resuming resources) can take place. Scale in/out is particularly problematic, especially as regards adding (or subtracting) storage: resizing an MPP cluster – or, more precisely, redistributing data across nodes in an MPP cluster to optimize for parallel performance – can be a time-consuming task. This is because most MPP databases must revert to read-only mode during resize operations. Depending on the volume of data being redistributed, the data warehouse could be in read-only mode for hours or even days. In other words, an MPP DWaaS can be just so elastic.

In the same way, an MPP DWaaS can be just so performant. This has first to do with the nuts and bolts of MPP itself, and, second, with the challenge of scaling MPP in the cloud.

In the MPP model, be it on-premises or in the cloud, the challenge is to parallelize a workload over many nodes. This is relatively easy to do – it’s a matter of decomposing the workload into many small operations or tasks – assuming there are no dependencies between those tasks. Thus the high-level challenge is to manage (and maintain data integrity with respect to) the dependencies that obtain between and among parallelized tasks. If this is a hard problem in a conventional (on-premises) context – and it is – it’s exponentially more difficult in the cloud, where it’s exacerbated by the vicissitudes of cloud itself. In other words, problems that arise out of (i.e., are byproducts of) the very features that make cloud, you know, cloud – the virtualization, decoupling, and distribution of resources – work to exacerbate the core challenge of managing parallel dependencies. In the first place, virtualized compute, storage, and network resources are slower (particularly with respect to I/O performance) than are their physically instantiated kith. Second, the decoupling of resource dependencies – such that compute, storage, network resources can be managed independently of one another – imposes additional performance penalties, especially with respect to increased latency. Finally, cloud resources aren’t just decoupled from one another, they’re distributed, too: decoupled compute and storage resources don’t necessarily have to reside on the same racks or on racks in the same buildings, let alone in buildings on the same data center campuses. The sum total of all of these challenges – features, not bugs, of the cloud model – degrades the performance, availability, and reliability of the DWaaS, be it a conventional or an MPP design.

So that’s one challenge. Once you dig into the nitty-gritty of MPP, you’re confronted with another challenge – that of computational scale. MPP uses algorithms to (a) break a workload down into x constitutive jobs and (b) distribute these jobs for processing across y nodes in a

cluster while at the same time (c) identifying and accounting for all requisite recompute dependencies. Thanks to (c), the algorithm can’t just start a batch of concurrent jobs and kick off new operations once this or that job finishes. It must, instead, (d) sequence all operations into a computational pipeline. This last, (d), is what turns a hard problem into a nasty problem.

A third challenge is that of concurrency. The examples above posit a single job that’s distributed across all pertinent nodes in a cluster. But what happens when you put two people on an MPP cluster? Or just a single person running multiple concurrent jobs? Or dozens of people running multiple concurrent jobs? Or dozens of people running multiple concurrent jobs, some of which (people and/or jobs) are more important than others? Unless these jobs are somehow self-aware and, well, deferential, each is going to expect to be able to use all of the cluster’s available resources. An MPP database solves this problem by itself managing the scheduling, execution, and prioritization of workloads. (In the argot of MPP, this is called “workload management.”) Stop me if you’ve heard this before, but workload management is a Very Hard Problem. And that’s the rub: Most cloud MPP services are relatively new, and it takes years – decades, even – to develop robust workload management capabilities.

Practical Considerations

Again, this isn’t to say that relocating decision support and analytical workloads from an on-premises to a cloud environment is impossible. It isn’t even to say that it’s especially difficult.

It is to say that it’s important to be alert to the strengths and weaknesses, the advantages and disadvantages, of the cloud. Understand and accept that you can’t meaningfully compare physical capacity with cloud capacity: that not only will on-premises workloads require more compute capacity to scale in the cloud – and that, moreover, there’s no simple rule of thumb for this – but that it might be impossible to scale some on-premises workloads in the cloud.

Understand and accept that workloads or use cases with certain definite characteristics (e.g., high levels of concurrency, a large number of simultaneous connections, highly granular workload management requirements) will continue to defy relocation to cloud.

Understand and accept, too, that core decision-support systems which have, over time, become interpenetrated with the business processes they’re supposed to support will be especially difficult to relocate to the cloud. Not impossible: difficult. These workloads power the applications, portals, and services, the reports, dashboards, scorecards, and other analytical tools that are used by different people in different roles at different times across any number of different business processes. They’re the types of workloads that try [wo]men’s souls – exhausting the patience of both the line-of-business and IT. It’s impossible even to isolate them as “workloads” or “use cases,” so bound up are they with business processes: moving them to cloud means moving the entire process to cloud. Over the years, I’ve spoken with IT and business people who’ve despaired of ever fixing some of these processes, so obscure or poorly understood are their enabling IT antecedents[i]. (This is a function of several factors, including: poor or non-existent documentation; poorly understood legacy systems, interfaces; the accumulation of maintenance-related kludges and cruft, and so on.) Cloud could be a good, even ideal, destination for some of these processes. Getting them there, however, will take time, money, and (that most priceless of qualities) perseverance.

As one long-time Teradata user (a DBA with a large American media conglomerate) told me, decision-support and analytical workloads have a way of self-sorting such that certain workloads/use cases eventually come to be seen as obvious candidates for the cloud – or, more precisely, for the private as distinct to the public cloud – and others come to be seen as non-negotiable candidates for conventional on-premises deployments.

The upshot, this DBA told me, is that his company uses MPP DWaaSs from both Teradata and Snowflake to complement its on-premises Teradata systems. “Initially, the loads [that are now] on Snowflake were on [on-premises] Teradata, but the cost factor and the scale-up factor just made them too expensive [to run in that context],” he explained.

His company uses Snowflake primarily to ingest, process, and analyze data from other cloud services, especially Google Analytics. Its experience provides a fascinating demonstration of the incredibly rapid rate of innovation – and evolution – in the MPP DWaaS space. Until this year, after all, Teradata IntelliCloud DWaaS was not “elastic” in the most meaningful sense of the term: because of resource dependencies, it was not possible to start, stop, pause, or resume instances. Consequently, the decision to deploy on Snowflake’s DWaaS to support the Google Analytics use case was dictated by circumstance, said this DBA, who noted that his company does use Teradata Cloud  to provide a disaster recovery (DR) option for its on-premises Teradata deployment: “We went to Snowflake because we only had to pay for compute when we used it. We could spin up [compute] when we wanted to process data and turn it off when we weren’t using it. We also could bring [this processed data] to a small data warehouse so that our analysts could go query that. Finally, we also had the option of bringing summarized results back to Teradata and building [production] reporting on top of that.”

Another neat thing about this use case is that the media conglomerate could actually move the data processing it does in Snowflake to another MPP DWaaS – e.g., Teradata IntelliCloud, Azure SQL Data Warehouse, Amazon Redshift – with little difficulty. It could also shift this data processing to a DWaaS running in a private cloud, be it a virtual private cloud, an on-premises private cloud, or a managed private cloud. (This last lives either on-premises or in a colocation facility. Teradata, for example, has a managed private cloud option for IntelliCloud.)

[i] A predecessor service to IntelliCloud.

Hybridity’s the Thing 

The company’s use of the DWaaS is, then, consistent with a hybrid multi-cloud strategy.

Hybridity’s the thing. For the present and for the foreseeable future, the best and most pragmatic cloud strategy will be a hybrid one that mixes public and private clouds – and which also reserves a place for conventional, physically instantiated, on-premises resources. Private cloud is not a new idea, but enterprises seem to have rediscovered it with a vengeance. (This might be a function of increasing concern about so-called cloud service provider “lock-in.”[iii]) Through 2017 and the first half of 2018, for example, most North American businesses either kick-started new private cloud deployments or expanded existing private cloud investments. Forrester analyst Lauren Nelson reports that “90 percent of [survey] respondents say they’re developing a comprehensive cloud strategy over the next 12 months, while 81 percent specifically note that they are implementing, have implemented, or are expanding private cloud.” 451 Group also found a significant shift to private cloud in a recent report. In the last 18 months, in fact, we’ve seen the introduction of commodity technologies – viz., Microsoft’s Azure Stack and VMWare’s VMWare for Amazon Web Services (AWS) offerings – that are designed to run in both the IaaS public cloud (on Azure and AWS, respectively) and in the enterprise private cloud, wherever it lives. As the kids, your kids, like to say: hybridity is, well, lit.

There’s something else, too. It isn’t just that enterprises are turning to private cloud, or that this trend, taken in tandem with ongoing (and undiminished) enterprise investment in public cloud, has resulted in an increase in (de facto) hybrid cloud deployments, it’s that the logic of cloud – i.e., the uptake of cloud-like technologies, concepts, and practices – is transforming the enterprise. Cloud-driven technological innovation is blurring the boundaries between discrete “cloud” and “on-premises,” contexts; in future, for example, almost all “on-premises” workloads will run as virtualized operating system, application, storage, etc. instances hosted on high-density hardware configurations. (I haven’t talked about container virtualization in the context of data management or decision support and analytical workloads. That is not an oversight.[iv]) Like the cloud resources of today, tomorrow’s on-premises “clouds” will be highly dependent on automated provisioning, management, and monitoring software.  

The upshot is that the clear boundaries which today demarcate the public cloud from the private cloud from conventional on-premises systems will be obscured. Cloud will be a pervasive metaphor – so pervasive, in fact, that the term itself could conceivably fall out of use.
The best thing an enterprise can do to position itself for the future is to understand and accept the limitations (the strengths and weaknesses, the advantages and disadvantages) of cloud itself, along with those of the different cloud contexts – viz., public, private, VPC – and of conventional on-premises IT resources. At this point, each of these deployment options still has its place, even if, over time, an increasing number of the decision support and analytical workloads that are today hosted on conventional, on-premises IT kit will move to a cloud environment of some kind. This cloud environment doesn’t necessarily have to be in the public cloud, however. More important, it doesn’t necessarily have to be just one public cloud service, or, for that matter, a single cloud context. Savvy enterprises will distribute workloads across multiple cloud service providers and across multiple cloud contexts, with some workloads also staying put (on physically instantiated hardware resources) in local data centers.   

[i] I once interviewed an IT executive with a well-known North American rail transport company. This company had tried for about a decade to rehome about 60 relict reports that were hosted on a costly legacy system. The company had financed at least two separate migration projects, had spent hundreds of thousands of dollars on software and services, and had naught to show for it. At one point, it despaired of ever shutting down the legacy system. Success came via a data warehouse automation tool from WhereScape. “Success” in this context meant the bottom-up process of exploring, discovering, and remapping the source systems, database tables, columns, rows, etc. used in the reports. It’s an extreme example, but illustrative nonetheless. The task of relocating core decision-support workloads to cloud can be daunting. A lot of this stuff won’t move overnight. 

[ii] A predecessor service to IntelliCloud.

[iii] Mounting concern among customers about cloud service provider lock-in is reflected in other recent surveys, too. See among others: Newmark, Eric, Carvalho, Larry, Knickle, Kimberly et al. IDC Futurescape: Worldwide Cloud 2018 Predictions. Framingham: International Data Corp. (IDC), November 2017. Or, an example from 2016, Adam Ronthal and Donald Feinberg. Separating Cloud Resources for Data Management Increases Flexibility and Helps Prevent Lock-In. Stamford: Gartner Inc. August, 2016.

[iv] There are a host of reasons why Containers + Data Management = a Very Hard Problem. Actually, it’s more akin to a Wicked Problem. If there’s interest, I’d be happy to unpack this, uh, problematicity in a follow-up post.

Stephen Swoyer

Stephen Swoyer is a researcher and analyst with more than 20 years of industry experience. His research has focused on business intelligence (BI), data warehousing, and analytics, along with the...

More About Stephen Swoyer