Data Architect's Guide to Containers: How, When, and Why to Use Containers with Analytics
This is the first in a multi-part series on container virtualization, otherwise known as operating system-level virtualization. This series amounts to a deep dive on containers. It will explore what containers are and how they work; their advantages and disadvantages; their pitfalls and pratfalls; and, not least, how they’re different from other modes of virtualization. Given the perspective of this blog, however, I’m especially concerned to explore how—or, more precisely, for which purposes or use cases—containers can be used with data and analytical workloads.
i. Introduction: One of these things is not like the other
One obvious difference between the two has to do with size: VM images tend to be much larger than container images. Think of a VM image as analogous to the system hard drive (e.g., “C:\”) on a Windows computer. Actually, for all intents and purposes, a VM image basically is a replica of a Windows, Linux, etc. operating system (OS). In practice, a running instance of a VM consists of an OS + a virtualized hardware environment. For all intents and purposes, then, the VM is a virtual instance of a computer that runs inside a non-virtualized computer: a kind of binary digital homunculus. The problem is that most modern OSs require 1 GB or more of storage at a minimum. The size of a 64-bit Windows 10 VM image is roughly 20 GB without any application software. Moreover, all modern operating systems (and Windows OSs, especially) include an assortment of programs, libraries, and services that are not strictly necessary for (or used by) all applications. To sum up, there is a lot of clutter, clunk, and redundancy—cruft, in so many words—in the hardware virtualization model.
Another difference between VMs and containers has to do with abstraction. Containers are less “abstracted” than VMs, which, in practice, tends to give them a performance advantage.
Analogically, a running VM is a lot like a Matryoshka doll, i.e., a computer that nests inside another computer. Since virtually all modern computers are controlled by an OS, a running VM could be thought of as an OS that nests inside another OS. This other (“host”) OS can be a conventional computer system (Windows, Linux, xBSD) or a computer system that is designed specifically for hosting multiple VMs at scale, called a hypervisor. When you break it down into its constitutive parts, an OS consists of a kernel + device drivers + programs that run outside of kernel space. The kernel controls execution, scheduling, and I/O for all programs, services, routines, etc. It provides a standard set of APIs that programs can exploit to access system resources and services. It standardizes other critical functions, too. But the kernel by definition is a kind of tyrant: it wants and needs to be the boss of the environment in which it lives. A complicating factor is that the host context in which the VM runs has its own kernel, too. Only one kernel can be the boss, however, and (for several reasons), we’d prefer that kernel to be the host, not the VM guest. The genius of hardware virtualization is that the host—or software running on the host—tricks the guest VM into thinking that it is the one that is in charge. And it is, so to speak. As far as the guest VM is concerned, its OS kernel has exclusive access to its own (virtualized) hardware environment. A virtual machine monitor mediates between the guest VM and the host, translating the former’s attempts to control its virtual hardware environment into system calls that it passes on to the host. This latter controls access to key physical resources, such as processors, memory, physical disks, and network interfaces.
The problem is that a program’s performance is affected by its access to the kernel. If it has the kernel’s ear—if the kernel gives first priority to its messages—it will receive privileged access to system resources. In most cases, its performance will benefit accordingly. However, an application running in a VM is twice removed from access to system resources: it communicates first with the VM’s OS kernel, which (via the virtual machine monitor) communicates with the boss (host) kernel. Only then is it granted access to resources.
A container, by contrast, is comparatively unconstrained. A running instance of a container doesn’t have its own kernel and doesn’t access (or run in the context of) a virtualized hardware environment. Instead, it interacts with the host OS’ kernel, which (in typical use cases) vouchsafes it access to non-virtualized system resources. In general, an application virtualized in a container will achieve I/O performance superior to that of an app in a VM.
ii. Containers as a force for disintermediation
In this and other respects, container virtualization is a force for disintermediation. Think of a container as analogous to a video game ROM cartridge. Packaged in a stand-alone standardized form-factor, the ROM cartridge should install and run on any supported platform. It isn’t just that the ROM cartridge is designed for reuse, it’s that it is designed to provide exactly the same service experience in reuse: Cycle the power to the console and the ROM is reborn, the binary digital equivalent of Stanley Kubrick’s—or Friedrich Nietzsche’s—star child: innocence and forgetting, a new beginning. You know, Phil Connors in Groundhog Day.
Nor is that all. A ROM cartridge is self-contained in that it includes all of the program code it needs to run on its host platform, no more, no less. Metaphorically speaking, the ROM could be thought of as a kind of prototype for the container. A container, like a ROM cartridge, is designed for portability and reliability, especially in reuse. A container, like a ROM cartridge, is packaged in a standardized, standalone format: viz., an executable package, in which is rolled up all of the program code (OS and program binaries, scripts, libraries, configuration files, etc.) it needs to run. A container, like a ROM cartridge, is newly born, its contents unchanged, at run time. This analogy isn’t perfect, however. The ROM cartridge is a persistent object; the container is not: it has no existence independent of its execution. In most cases, for reasons I explain below, containers aren’t reused, but, instead, are freshly compiled at run time.
So why are we suddenly hearing so much about containers? In the first instance, they truly do comprise a lightweight alternative to VMs, which, at a minimum, require a basic operating system environment (with its retinue of programs and services) as a substrate. In the second instance, containers are more performant, and (theoretically) more secure than VMs, too.
From the perspective of the analyst or data scientist, containers are valuable for a number of reasons. For one thing, container virtualization has the potential to substantively transform the means by which data is created, exchanged, and consumed in self-service discovery, data science, and other practices. The container model permits an analyst to share not only the results of her analysis, but the data, transformations, models, etc. she used to produce it. Should the analyst wish to share this work with her colleagues, she could, within certain limits, encapsulate what she’s done in a container. In addition to this, containers claim to confer several other distinct advantages—not least of which is a consonance with DataOps, DevOps and similar continuous software delivery practices—that I will explore in this series.
To get a sense of what is different and valuable about containers, let’s look more closely at some of the other differences between containers, VMs, and related modes of virtualization.
iii. Virtual machines, hypervisors, and proto-containers
VMs are managed by a hypervisor, also called a virtual machine monitor, which does two things: first, it provides an abstraction layer between the VM itself and the physical hardware (a Type-1 hypervisor) or the host operating system (a Type-2 hypervisor) on which the VM and the hypervisor must run. Second, the hypervisor is the means of controlling (starting, stopping, pausing, etc.) and monitoring VMs. Most hypervisors emulate a basic x86 (32- or 64-bit) instruction set architecture (ISA), complete with virtual I/O chipsets, virtual video adapters, and, of course, virtual access to the host’s microprocessors. But VMs can emulate non-x86 ISAs, too. The hosted QEMU hypervisor, for example, is capable of emulating x86 and x86-64 CPUs, as well as the MIPS, Alpha, SPARC, and other ISAs. If you’re one of those technology pack-rats who still has a copy of Windows NT 4.0 lying around—and if you have too much time on your hands—install the MIPS version of NT 4.0 in QEMU. It will work.
If all of this sounds a little, or a lot, heavy-duty, it is. In this scheme, virtualizing a Jenkins automation server basically entails creating a computer within a computer: a Linux VM managed by a hypervisor (or by a hypervisor running “inside” an operating system host) that consumes a proportion of available compute (e.g., one or two virtual processors), memory (512 MB or more of virtual memory), storage (2 GB or more of virtual storage), and network resources. This invites a question: Isn’t there some way to do this without resorting to crufty and resource-intensive hardware-level virtualization? One solution, often employed by savvy Unix sysops, is to construct a so-called chroot “jail.” A sysyop might create a chroot to “jail” a program such as a multi-user text game in a context that provides some degree of isolation vis-à-vis the host’s other user space programs. Chroots have their problems, which is one of the reasons we don’t use them to implement OS-level virtualization at scale. Conceptually, however, the classic chroot jail shares a few notable similarities with the “virtual” container.
For example, a chroot consists solely of the files and folders that are needed to run a program, process, etc. In practice, running a program in a chroot involves replicating both the files and file system structure it requires in a subdirectory nested somewhere inside the host OS’ file system. As far as the chrooted program is concerned, this subdirectory becomes the root directory of its own file system; the chroot even runs its own versions of Linux userland programs. This isolates it from the rest of the programs, processes, etc. running on the host. The idea is that in the event of, e.g., compromise, the user who “owns” the chrooted programs (whatever her intentions, good or bad) cannot escape the context of the chroot’s “jail.”
iv. Containers and container virtualization: Isolation by any other name…
The chief advantage of the chroot is isolation. Container virtualization uses a similar technique to achieve isolation: like a chroot, a container consists of all of the executables (binaries and scripts), supporting libraries, configuration files, etc. that a program, process, etc. needs to run.
Beyond this conceptual similarity, chroots and containers are wholly different beasts, however.
A chroot is implemented at the level of a subdirectory on the host file system; Docker, Linux Containers (LXC), and other schemes use an executable package—the container itself, which is (usually) freshly generated at run time—to virtualize a program. A manager daemon (e.g., Docker Engine) provides a set of functions analogous to those of a hypervisor. In this way, proponents claim that containers achieve simplified security, networking, and maintenance. Also important is the fact that each container runs in its own separate kernal namespace. This permits the container and its processes to be isolated from other containers, as well as (many of) the programs/processes running on the host itself. A program or process that is isolated from other programs/processes cannot “see” and does not “know” of their existence. If it cannot see and does not know of them, neither can an attacker, in the event of compromise.
Unlike a VM image, the ideal container does not have an existence independent of its execution. It is, rather, quintessentially disposable in the sense that it is compiled at run time from two or more layers, each of which is instantiated in an image. Conceptually, these “layers” could be thought of as analogous to, e.g., Photoshop layers: by superimposing a myriad of layers, one on top of the other, an artist or designer can create a rich final image. This isn’t the only (or primary) reason she uses layering, however. In container virtualization as in Photoshop, layering promotes portability, reusability, and convenience. In Docker, for example, layering permits granular versioning, such that each change is preserved automatically in a new “intermediate” layer. As with a version control system, the developer can revert to a prior layer (i.e., version) if something she does should break an otherwise functional configuration. In this way, the developer can consolidate multiple programs or processes into a single image, layering each intermediate image—e.g., Jenkins, Kubernetes, Git, Redis, and other essentials—on top of a “base” image. This last typically replicates the file system hierarchy (with all necessary files) of a host operating system, such as any of several flavors of Linux or, via WINE, Windows. At run time, the Docker Engine generates a fresh container from scratch.
What is the difference between (a) freshly generating a container at run time and (b) restarting an already-created VM image? As it happens, this gets at the crux of the issue: it has to do with what is unique, valuable, and different about containers and container virtualization—namely, that they’re disposable by design. The simple explanation is that, for most production use cases, containers are created only to be destroyed. That this is the case is (mostly) uncontroversial. Why this is the case is the subject of this series.
v. Overview of research
To summarize, I find that:
Containers are a lightweight alternative to hardware virtualization. Containers require fewer pre-allocated system resources—in the form of processor, memory, storage, and network capacity—than VMs. There is another important difference, too. A container consists solely of the executable files (binaries and scripts), system and program libraries, configuration files, and userland tools an application or process needs to run. A VM, by contrast, requires that the user install a complete operating system—Linux, xBSD, Windows, etc.—to run application software.
Containers promise convenience, portability, security, and ease of maintenance. But so do VMs. To paraphrase philosopher Alasdair MacIntyre: Whose convenience? Which kind of portability? At the very least, this frame invites the obvious question: are containers more portable than VMs? Yes, in the sense that they are, in general, smaller. Does container virtualization provide greater isolation against portability-breaking dependencies than hardware virtualization? That is a more fraught question. Are containers more secure than VMs? In theory, yes, although this, too, is an issue of some contention. As for the promise of simplified maintenance, the answer seems to be: it depends. Proponents can find plenty to like in containers; conversely, skeptics can find no shortage of stuff to dislike. It might be that the set of useful applications for containers, like that for VMs—or for non-virtualized applications—is bounded.
Container virtualization is more performant than hardware virtualization. Unlike VMs, containers do not abstract access to system hardware. Docker, for example, uses a management engine—Docker Daemon—to spawn, respawn, stop, pause, modify, etc. containers. Docker Daemon is controlled by a command-line interface (CLI) tool, docker, which has dozens of different child commands. Once it is running, however, a container’s access to the host OS kernel is (relative to a VM) unfettered. This gives containers a performance edge vis-à-vis VMs; in most cases, in fact, they are able to achieve I/O performance on par with local (non-containerized) applications.
These and other advantages are offset by real-world trade-offs, however. Container virtualization emphasizes portability and manageability at the expense of persistence. This is by design. Containers are, in a special sense, disposable: in fact, for most production use cases, a best practice is to generate new containers as needed at run time. In other words, containers are always exactly the same and always behave exactly the same way because they’re always generated, newly born, at run time. This is a feature, not a bug, of the container model. It’s why container nomenclature distinguishes between “images” and “containers.” In Docker, two or more image “layers” are laminated together to create an executable package. This is the container. It can be configured to run almost any combination of applications and/or processes. An Apache container consists of a combination of a base image (typically, a core OS, such as Debian or RHEL) and one or more image layers, such as Apache, along with, if desired, MySQL and PHP, as well as any requisite executables, libraries, config files, etc. At run time, the Docker Daemon generates and runs a fresh container from these image layers. There are several very good reasons for this, as I explain below.
How, or where, is data persisted in this model? Not at the level of the container. If a container writes data to disk, it does so to a virtual file system that “lives” inside the container itself. This file system is its innate means of persistence. Its contents do not typically persist once the container terminates, crashes, etc.; nor are they recreated anew (with the container) at run time. Another container-specific feature, isolation, is also achieved at some cost to persistence. Again, this is by design. The application or process that lives in the container never writes data directly to the host file system. It is isolated to the degree that it does not share storage with any other container, or with the local applications, processes, daemons, etc. that live on the host operating system.
The crux of the issue is that the concept of in-container persistence is at odds with good container philosophy. Typically, organizations don’t reuse containers; in practice, in fact, it is common to invoke a container with the “--rm" (remove) command appended to it. This instructs the management engine to automatically delete the container once it exits. Why is this common? Because organizations want their containers to behave exactly the same way each and every time they run. So instead of using automation software to (re)spawn, stop, or pause an existing copy of a container, they have it create a new container generated from pre-fab images. Other reasons for this have to do with security, stability, and, not least, practicality: e.g., over time, multiple copies of the same container—each with its own data—will accrue and fill up available disk space. It is easier to create a container from scratch and destroy it than to manage its lifecycle. Organizations are busy enough with lifecycle management as it is.
How containers handle persistence is problematic, but not a deal-breaker. There are a number of ways to manage persistence for containers. All entail trade-offs, particularly with respect to portability and isolation. To configure a container for persistence is to complicate the task of deploying and securing it. It is likewise to introduce complexity that can compromise availability, if not performance.
The immaturity of containers is problematic, but not a deal-breaker. Linux Containers (LXC) debuted a decade ago; Docker, an offshoot of LXC, in 2013. An analogous concept—the chroot jail—is older, going back to the 1990s—or even earlier.
Container virtualization is far from an exact science. While the I/O performance of containers is, on the whole, superior to that of VMs, container availability, by general consensus, is not. At present, we lack a peer-reviewed empirical assessment of the long-term availability and performance of containers in production. We likewise lack a comparison of the long-term availability and performance of containers vis-à-vis VMs. We can, however, identify one obvious difference between containers and VMs—namely, the means of abstraction. With respect to a VM, the means of abstraction is a hypervisor; in the case of a container, it is the engine or daemon that is used to manage (spawn, respawn, stop, remove, etc.) running instances of containers.
With this in mind, there is some empirical basis for the argument that the container daemon constitutes a more vulnerable single point of failure than the hypervisor.
Another, even more basic issue has to do with statelessness, which, in the context of container virtualization, is a virtue, not a vice. The container model’s built-in bias in favor of statelessness underpins several of the advantages (viz., convenience, portability, simplified maintenance) it claims to achieve vis-à-vis VMs. It also manifests itself in the practical priority that most organizations give to what I call “disposability” in their disposition and management of containers: ideally, a running instance of a container is not unique; for this reason, it can be respawned—quickly, with minimal service interruption—if necessary. This is why containers are typically destroyed (deleted) after they exit or terminate. This makes sense: containers are smaller than VMs, but can still consume hundreds of megabytes, or even one gigabyte or more, of free disk space.
The emphasis, then, is on quickly detecting and respawning non-responding (or missing) containers. Most organizations will employ load-balancing and cluster management software (Docker Swarm, Kubernetes) to automate these tasks. A general best practice is that organizations should not use containers to provide services that require fault tolerance, data persistence, and data consistency.
These and other trade-offs are not arbitrary. Rather, an inverse relation obtains, such that some of the same features which make containers valuable—namely, convenient portability, sandbox-like isolation, and simplified maintenance—can also be detrimental, depending on context. Call it a case of a virtue becoming a vice, and, er, vice-versa.
Containers and data management are, in an essential sense, at cross purposes. Remember that rap about virtues becoming vices—and vice-versa? This is especially true with respect to the idea of using containers in connection with analytical workloads. In the first case, container virtualization gives priority to convenience, portability, and simplified manageability at the expense of data persistence, or, for that matter, data history—both of which are, conversely, top priorities for data management. Another way to put this is to say that data management is self-consciously stateful in precisely the way that container virtualization is self-consciously stateless.
The larger point is that there is an essential tension between the priorities and purposes of container virtualization and those of data management. This tension is exacerbated by the general characteristics of analytical workloads themselves.
The essential “distributedness” of data and analytical workloads is an additional complicating factor. What I’m calling “distributedness” takes two basic forms.
The first is that of people and process distributedness: data and analytical workloads are distributed because people—and, ipso facto, the data they create, consume, and need to work with—are distributed. A person might be in one physical place and require data that lives in another physical place, or, for that matter, another context—be it an organizational context (a separate business unit), a local-virtual context (a shared disk file system spanning a campus-wide SAN), a remote-virtual context (a cloud block storage service, such as Amazon EBS), etc. The practical upshot of this is that data and analytical workloads tend to occur at different times and in different contexts for different reasons. In spite of data governance policies and controls, data and analytical workloads are not necessarily managed and/or coordinated across an organization.
The second is that of computational distributedness: Some data and analytical workloads entail processing at so massive a scale as to outstrip the resources of any single resource—be it a VM, a container, or a physical computer system. One solution is to break them down into sequences of smaller operations that can be processed concurrently or in parallel by two or more compute resources, typically termed “nodes.”
Container virtualization can help organizations manage (both kinds of) distributedness. It can also exacerbate the problems posed by (both kinds of) distributedness. On the one hand, containers give an organization an additional option for easily, or cost-effectively, distributing workloads, especially in the case of data pipelines that entail many concurrent operations. Imagine that an analyst or data scientist needs to access, engineer, and move data between and among different contexts. Her goal? To create a training data set that she can use to develop and train one or more predictive models. This could be an ideal use-case for containers: she could, e.g., import a series of Python scripts or modules into Jupyter, customize them to perform the manipulations and/or transformations she needs, and run them in Kubernetes. Moreover, if she should want to share the models, data transformations, training data sets, notes, etc. that she’s created, she can easily roll them up into a container. Portability, for the win.
On the other hand, each of the dependencies that is introduced when a workload is distributed constitutes a potential point of failure in a pipeline or data flow. For example, a queued task might depend on one or more distributed tasks that are executing concurrently (such that Job C depends on Jobs A and B, which are either running on separate nodes, or on separate processors on a single node) or in parallel (such that a single task is broken up into Jobs A, B, and C, which may execute locally, on a single node, or be distributed across multiple nodes). What happens if Job B should fail, however? Job C cannot kick off without the outputs of Jobs A and B. The result is that a pipeline stall occurs. Thanks, both to the practical constraints and the vicissitudes of the container model, this stall could become catastrophic: in general, a container does not persist (write) data outside out of its context; should a container exit (crash, terminate, and/or disappear) or hang (crash) while processing a workload, its progress (and output to that point) is lost. The queued operation rolls back, da capo.
The problem here is two-fold: (1) computational distribution introduces stateful dependencies that constitute potential points of failure; (2) the vicissitudes of the container model exacerbate this problem, such that stateful dependencies + containers = the potential for catastrophe—i.e., process termination, data loss, etc.
One solution is to insist on a pragmatic diremption, of sorts: i.e., to recognize that a data processing workflow which entails multiple dependencies is a good candidate for encapsulation into one or more containers if it is a product of (or initiated as part) of the self-service use case. This is, conversely, to recognize that a similarly complex data processing workload—albeit one designed to support a reusable, productionized use case—is probably not a good candidate for container virtualization. This isn’t to say that one or more operations in such a workflow couldn’t, or shouldn’t, be performed in containers. It is to say that the workflow itself—as a repeatable, reusable process—probably should not be encapsulated in toto into one or more containers.
vi. Preliminary Conclusion
I’ll have much more to say about containers, about the several and varied modes of virtualization; the challenging (logistical and computational) characteristics of data and analytical workloads; the importance of pragmatism; and other, related issues in the series of blogs that will follow. For the present, I want to stress something—namely, a virtue—that so often takes flight whenever and wherever new technology is concerned: common sense.
I’ve seen too many organizations take leave of their (common) senses in deploying new technologies—Hadoop is perhaps the most frustrating example—to serve purposes or to support workloads for which they were not designed. Container virtualization is valuable precisely because it gives enterprises new flexibility—new options—for dealing with stateless workloads, of which there is no lack. The use of containers to support stateful workloads is a much more problematic proposition, however; it almost certainly would not be a thing, per se, but for the moiling and roiling efforts, the ceaseless agitation, of interested vendors.
I keep coming back to two great conversations I’ve had with people I respect over the last month. The first elicited the simple observation that state should not be managed at the level of a container. “If the state is important, should it be sitting locally in a container? That’s pretty bad state management,” one friend said to me. Another friend framed the issue in terms not unlike those of the Hippocratic Oath. Instead of “First, do no harm,” this person suggests: “Will what I propose to do lead to better outcomes? If so, what are they? Why better? How better?”
 Caveat developer: a recommended best practice is to encapsulate a single process or program per container.
 This performance advantage is less a function of the maturity of containers and/or container management software than of the architectural advantages of the container model itself.
 These workloads cannot cost-effectively be processed by a single system. Rewind to the late-1990s and we’d find very large SMP systems from IBM Corp. Hewlett-Packard Co. (HP), and the former Sun Microsystems Inc. crushing MPP systems in most contemporary decision support benchmarks. The market abandoned large SMP systems (32+ individual processors) because they could not compete with the price/performance of MPP kit.
 Again, this is not to say that it cannot. It is, however, to refer back to the finding that the priority of persistence is, to a degree, at odds with the design, purpose, and historical usage of containers.