Chapter 4

Computing as a Service

“In pioneer days they used oxen for heavy pulling, and when one ox

couldn ’t budge a log, they didn ’t try to grow a larger ox. We shouldn ’t be

trying for bigger computers, but for more systems of computers.”

—Admiral Grace Hopper

The two simplest forms of cloud computing are perhaps the best known to scientists

and engineers: storage and computing on demand. We covered storage in part I of

the book; we turn next to cloud computing: com puti ng as a service.

A cloud can provide near-instant access to as much computi ng and storage

as you need, when you need it, along with the ability to obtain computers with

the speciﬁc conﬁguration(s) that you want. Furthermore, these capabilities are all

just an API call or a few mouse clicks away. As we will see, computing services

can be delivered in several diﬀerent ways. The most basic is known in the cloud

industry as

infrastructure as a service

(IaaS) because it provides virtualized

infrastructure to its users. To understand what IaaS compute looks like, let’s

consider some typical scen arios.

•

You need 100 CPUs now, rather than when your job reaches the head of the

queue i n your lab cluster.

•

You need to access a GPU cluster or a computer with 1 TB memory, but

such a resource does not exist in the lab.

•

You need computers loaded with 10 diﬀerent Linux variants to test the

portability of your new software prior to distribution.

4.1. Virtual Machines and Containers

•

You want a machine outside your ﬁrewall that you can share with your

collaborators for a p roject.

Each scenario can be addressed in multiple ways by cloud computing. The

question we look at in this chapter is how to determine the best approach and how

to evaluate the pros and cons of picking a solution.

4.1 Virtual Machines and Container s

Public cloud data centers comprise many thousands of individual s ervers. Some

servers are used exclusively for da ta services and supporting infrastructure and

others for hosting your computations. When you compute in the cloud, you do not

run directly on one of these servers in the way that you would in a conventional

computational cluster. Instead, you are provided with a virtual machine running

your favorite operating system. A

virtual machine

is just the software image

of a complete machine that can be loaded onto the server and run li ke any other

program. The server in the data center runs a piece of software called a

hypervisor

that allocates and manages the server’s resources that are granted to its “gu est”

virtual machines. In the next chapter, we delve into how virtualization works, but

the key idea is that when you run in a VM, it looks exactly like a server running

whatever operating system the VM is conﬁgured to run.

For the cloud operator, virtualization has huge advantages. First, the cloud

provider can provide dozens of diﬀerent operating systems packaged as VMs for

the user to choose from. To the hypervisor, all VMs look the same and can be

managed in a uniform way. The cloud management system (sometimes called

the

fabric controller

) can select which server to use to run the requested VM

instances, and it can monitor the health of each VM. If needed, the cloud monitor

can run many VMs simultaneously on a single server. If a VM instance crashes,

it does not crash the server. The cloud monitor can record the event and restart

the VM. User applications running in diﬀerent VMs on the same server are largely

unaware of each other. (A user may notice another VM when they impact the

performance or respon se of their VM.)

We provide in chapter 5 detailed instructions on how to deploy VMs on the

Amazon and Azure public clouds, and on OpenStack private clouds.

Contai ners

are similar to VMs but are based on a diﬀerent tech nol ogy and

serve a slightly diﬀerent purpose. Rather than run a full OS, a container is layered

on top of the host OS and uses that OS’s resources in a clever way. Containers

allow you to package up an application and all of its library dependencies and data

Chapter 4. Computing as a Service

into a single, easy-to-manage unit. When you launch the container, the application

can be conﬁgured to start up, go through its initialization, and be running in

seconds. For example, you can run a web server in one container and a database

server in another; these two containers can discover each other and communicate

as needed. Or, if you have a special simulation program in a container, you can

start multiple instances of the container on the same host.

Containers have the advantage of being extremely lightweight. Once you have

downloaded a container to a host, you can start it an d the application(s) that

it contains quasi-instantly. Part of the reason for this speed is that a container

instance can share libraries with other container instances. VMs, because they are

complete OS instances, can take a few minutes to start up. You can run many

more containers on a single host machine than you can eﬀectively run the same

number of VMs. Figure 4.1 illustrates the diﬀerence between the software stack of

a server running multiple VMs versus a server running a single OS and multiple

containers o n a typical server in a cloud.

Figure 4.1: Virtual machine s vs. containers on a typical cloud server.

Building a container to run a single application is simple compared with the

task of customizing a VM to run a single application. All you need to do is create a

script that identiﬁes the needed libraries, source ﬁles, and data. You can then run

the script on your laptop to test the container, before uploading the container to a

repository, from where it can be downloaded to any cloud. Importantly, containers

are completely portable across diﬀerent cl oud s. In general, VM images cannot be

ported from one clo ud framework to another.

Containers also have downsides. The most serious issue is secu rity. Because

containers share the same host OS instance, two containers running on the same

4.2. Advanced Computing Services

Table 4.1: Virtual machines and containers, compared.

Virtual machines Containers

Heavyweight Lightweight

Fully isolated; hence more secure Process-level isolation; hence less secure

No automation for c on ﬁ gu ration Script-driven conﬁguration

Slow deployment Rapid deployment

Easy port and IP address mapping More abstract port and IP mappings

Custom images not portable across clouds

Completely portable

host are less isolated than two VMs running on th at host. Managing the network

ports and IP addresses used by containers can be slightly more confusing than

when working with VMs. Furthermore, containers are often run on top of VMs,

which can exacerbate the confusion .

In chapter 6, we work though examples of using containers in some detail,

describing in particular the Docker container system

docker.com

,howtorun

containers, and how to create your own containers.

Table 4.1 lists some of the the pros and cons of the virtual machine and

container ap proaches. Needless to say, both technologies are evolving rapidly.

4.2 Advanced Computing Services

Cloud vendors such as Amazon, Microsoft, and Google have many additional

services to help your research, including special data analysis clusters, to ols to

handle massive streams of events from instruments, and special machine learning

tools. We discuss these services in part III of the book.

A common issue of concern to scientists and engineers is

scale

.VMsand

containers are a great way to virtualize a single machine image. However, many

scientiﬁc applica tions require multiple machines to process many data or to perform

a complex simulation. You may already know how to run parallel programs on

clusters of machines and you now want to know whether you can run those same

programs on the cloud. The answer depends on the speciﬁcs of the application.

Most hi gh-perform ance parallel app lica tions are based on the

Message Passing

Int e rfa ce

(MPI) standard [

147

]. As we describe in chapter 7, Amazon and Azure

provide an extensive set of tools for building Linux MPI clusters.

Cloud users can also exploit parallelism in other ways. For exampl e,

many

task

(MT) parallelism [

221

] is used to tackle problems in which you have hundreds

of simi lar tasks to run, each(largely) independent of the others. Another method is

Chapter 4. Computing as a Service

called

MapReduce

[

108

], made popular by the Hadoop computational framework

[

258

]. MapReduce is related to a style of parallel com pu ting known as

bulk

synchro no us parallelism

(BSP) [

139

]. We cover these topics, also, in chapter 7.

A compelling feature of the cloud is that it provides many ways to create highly

scaled applications that are also interactive. The

Spark

system [

265

], originally

developed at University of California Berkeley, is more ﬂexible than Hadoop and is

a form of BSP computing that can be used interactively from Jupyter. Google has

released a service called

Cloud Datalab

, based on Jupyter, for interactive control

of its data analytics cloud. The Microsoft Cloud Business Intelligence (

Cloud BI

)

tool supports interactive access to d ata queries and visualization. We discuss these

tools in chapter 8.

Managing your cloud computing resources can become complicated when

you need to scale beyond a few VMs or containers. Keeping track of many

processes spread over many cloud VMs is not easy. Fortunately, several new

tools have been adopted by the public clouds to help with this challenge. For

managing large numbers of containers, you can use the Docker

Swa rm

tools

docker.com/products/docker-swarm

and Google’s

Kubernetes container man-

agement

[

] (which Google uses for its own container man agement). Many

people use the venerable

HTCondor

system [

243

] to manage many task parallel

computation. (HTCondor is used in the Globus Genomics system that we describe

in chapter 11.)

Mesos

[

154

] provides another di stribu ted operating system with

a web interface that allows you to manage many a ppl icati ons in the cloud at the

same time. All of these systems are already available, or can easil y be deployed,

on cl oud platforms. We describe them in chapter 7.

One other computation service programming model that is common in cloud

computing is

dataﬂow

. This model plays a signiﬁcant role in the analysis of

streaming data, as discussed in chapter 9.

4.3 Serverless Computing

An interesting recent trend in cloud computing is the introduction of

serverless

computing

as a new paradigm for service delivery. As we show in the chapters

ahead, computation and data analysis can be deployed in the cloud via a range

of special services. In the majority of cases, the user must deploy VMs, either

directly or indirectly, to support these capabilities. Doing so takes time, and the

user is responsible for deleting the VMs when they are no longer needed. At times.

however, this overhead is not acceptable, such as when you want an action to take

place in response to a relatively rare event. The cost of keeping a VM running

4.4. Pros and Cons of Public Cloud Computing

continuously so that a program can wait for the event may be unacceptably high.

Serverless computing is similar to the old Unix concepts of a daemon and cron

jobs, whereby a program is managed by the operating system and is executed

only when speciﬁc conditions arise. In serverless computing, the user provides a

simple function to be executed, again under certain conditions. For example, the

user may wish to perform some bookkeeping when a new ﬁle is created in a cloud

repository or to receive a notiﬁcation when an important event occurs. The cloud

provider keeps a set of machines running to execute these functions on the user’s

behalf; the user is charged only for the execution of the task, not for maintaining

the servers. We return to this topic in chapters 9 and 18.

4.4 Pros and Cons of Public Cloud Computing

Public cloud computing has both pros and cons. Important advantages include

the fol lowing.

•

Cost: If you need a resource for only a few hours or days, the cloud is much

cheaper than buying a new machine.

•

Scalability: You are starting a new lab and want to start with a s mal l number

of servers, but as your research group grows, you want to be able to expand

easily without the hassle of managing your own racks of servers.

•

Access : A researcher in a small university or an engineer in a small company

may not even have a computer room or a lab with room for racks of machines.

The cl oud m ay be the only choice.

•

Conﬁgurability: For many scientiﬁc disciplines, you can get complete virtual

machines or containers with all the standard software you need pre-installed.

•

Variety : Public cloud systems provide access to a growing diversity of

computer systems. Amazon and Azure each provide dozens of machine

conﬁgurations, ranging from a single core with a gigabyte of memory to

multicore s ystems with GPU accelerators and massive amounts of memory.

•

Security: Commercial cloud providers have excellent security. They also

make it easy to create a virtual network that integrates cloud resources into

your network.

•

Upgradeability: Cloud hardware is constantly upgraded. Hardware that you

buy is out of date the day that it is delivered, and becomes obsolete quickly.

Chapter 4. Computing as a Service

•

Simplicity: You can manage your cloud resources from a web portal that is

easy to navigate. Manag ing your own private cluster may require sophisti-

cated s ystem adm ini stration skill s.

Disadvantages of computing as a service include the following.

•

Cost: You pay for public cloud by the hour and the byte. Computing

the total cost of ownership of a cluster of machines housed in a university

lab o r data center is not easy. In many environments, power and system

administration are subsidized by the institution. If you need to pay only for

hardware, then running your own cluster may be cheaper than renting the

same services i n the cloud. Another accounting oddity, perhaps peculiar to

the U.S., is that some universities charge overhead on funds obtained from a

federal funding source for public cloud but not when equivalent funds are

used for hardware purchases.

Academic researchers may also have the option of accessing a national facility

such as Jetstream, Chameleon, or the European science cloud. The cost here

is the work involved in writing a proposal.

•

Variety : Th e cloud does not provide every type of computing that you may

require, at least not today. In particular, it is not a proper substitute for

a large supercomputer. As we show in chapter 7, both Amazon and Azure

support the allocation of fairly sophisticated high performance computing

(HPC) clusters. However, these clusters are not at the scale or performance

level of the top 500 supercomputers.

•

Security: You r research concerns highly sensitive data, such as medi cal

information, that cannot be moved outside your ﬁrewall. As stated above,

there are ways to extend your network into the cloud, and the cloud providers

provide HIPAA-compliant facilities. However, the paperwork involved in

getting approval to use these solutions may be daunting.

•

Dependence on one cloud vendor (often referred to as vendor lock-in). This

situation is changing, however. As the publ ic clouds converge in many of

their standard oﬀerings and compete on price, moving applications between

cloud vendors ha s become easier.

Another way to think about the cost of computing is to weigh the cost of

computing for an individual, who may use only a few hours of computing per week,

versus for an entire institution, such as a university or large research center, which

4.5. Summary

may in aggregate use many tens of thousands of hours per week. One sensible

solution is for the institution to sign a long-term co ntract with a public cloud

provider that allows the instituti on to pick up the tab for its researchers. There are

several ways to make this approach economical ly attractive to both the institution

and the cloud provider. For example, universities can negotiate cloud access as

part of a package deal involving software licenses or institutional data back-up.

An institution may also have the resources and expertise required to build a

private mini-data center, in which it can dep loy an OpenStack cloud that it makes

available to all of its employees. Depending on the institution and its workloads,

this approach may be mo re cost eﬀective than others . It may also be requi red

for data protection reasons. This approach leaves open the possibility of a hybrid

solution, in which you spill over to the public cloud when the private cloud is

saturated. This is often referred to as

cloud bursting

. This hybrid solution has

become a common model for many large corporations and is well supported by

cloud providers such as Microsoft and IBM. Aristotle

federatedcloud.org

is an

example of an a cadem ic cloud that suppo rts cloud bursting.

4.5 Summary

Clouds can provide computi ng resources as a service, at scales ranging from a

single virtual machine with one virtual core and a few gigabytes of memory to full

HPC clusters. The type and scale of service you use depend on the nature of the

application. The following are examples.

•

You may need only an extra Linux or Windows server to run an application,

and you do not wa nt to saddle your laptop with the extra load. In this case,

a single VM running on a multicore server with a large memory is all you

may need. You deploy it in a few minutes and remove it when you are done.

•

You want to run a standard application s uch as a da tabase to share with other

users. In this case, the easiest solution is to run a containerized instance of

your favorite database on a VM designed to run containers or on a ded icated

container ho sting service.

•

You have an MPI-based paral lel program that does not need thousands of

processors to get the job done in a reasonable amount of time. In this case

the public clouds have simple tools to s pi n up a HPC cluster that you can

use for your job.

Chapter 4. Computing as a Service

•

You have a thousand tasks to run that produce data you need to analyze

interactively. Thi s may be a job for Spark with a Jupyter front end or Ha doop

if i t can be rendered as a MapReduce computation.

•

If the thousand-task computations are mo re lo o sely coupled and can be

widely distributed, then HTCondor is a natural choice.

•

If you are processing large streams of data from external sources, a dataﬂow

stream p rocessi ng tool may be the best solution.

Other considerations, notably cost and security, can also enter into your choice

of resou rce and system. We discuss security in chapter 15.

4.6 Resources

Each of the major public cloud vendors provides excellent tutorials for using its

computing services. In addi tion , entire books have been written on each topic

covered in this chapter. Fo ur that we particularly like are: Programming AWS

EC2 by Jurg van Vliet and Flavia Paganelli [

251

]; Amazon Web Servic es in

Action, by Andreas Wittig and Michael Wittig [

263

]; Programming Google App

Engine with Python, by Dan Sanderson [

231

]; and Microservices, IoT and Azure:

Leveraging DevOps and Microservice Architecture to deliver SaaS Solutions by Bob

Familiar [119]. We point to additional resources in the chapters that follow.