Chapter 7
Scaling Deployments
“I fixed my eyes on the largest cloud, as if, when it passed out of my
sight, I might have the good luck to pass with it.”
—Sylvia Plath, The Bell Jar
Public clouds are built from the ground up to allow customers to scale their
deployed services to fit their needs. The most common motivation for scaling in
industry is to allow services to support thousands or millions of concurrent users.
For example, an online media streaming site may initially be deployed on a single
VM instance. As its business grows, it may need to scale to use 100s or even
10,000s of servers at peak times, and then scale back when business is slow.
As we discuss in chapter 14, such multiuser scaling can also be of interest to
some research users. However, the more urgent need facing scientists and engi neers
is often to run a bigger simulation or to process more data. To do so, they need to
be able to obtain access to 10 or 100 or 1,000 servers quickly and easily, and then
run parallel programs eciently across those servers. In this chapter, we describe
how dierent forms of parallel computation can be mapped to the cloud, and some
of the many software tools that clouds make available for this purpose.
We return to various of these tools in part III of the book, when we look at
data an alytics , streaming, and machine learning, each of which increasingly require
large-scale computing.
7.1. Paradigms of Parallel Computing in the Cloud
7.1 Paradigms of Parallel Computing in the Cloud
We consider in this chapter five commonly used parallel computing paradigms for
scaling deployments in the cl oud .
The first is highly synchronous
single program multiple data
computing. This paradigm dominates supercomputer applications and is often the
first that scientific programmers want to try when they meet the cloud. We will
show how to use SPMD programming models in the cloud, pointing out where
care must be taken to get good performance.
The second paradigm i s
many task parallelism
, in which a large queue or
queues of tasks may be executed in any order, with the results of each task stored
in a database or in files, o r used to define a new task that is added to a queue.
Such computations were termed embarrassingly parallel by early parallel computing
researchers [131]; we prefer happily parallel. It adapts well to clouds.
The third paradigm is
bulk synchronous parallelism
(BSP) [
]. Originally
proposed as a model for analyzing parallel algorithms, BSP is based on processes or
threads of execution that repeatedly perform independent computations, exchange
data, and then synchronize at a barrier. (When a process reaches a barrier in
its program, it stops and cannot proceed until all other threads/processes in the
synchronization group reach the same point.) A contemporary example of th e BSP
model i s the
style made famous by Google and now widely used in
its Hadoop and Spark realizations.
The fourth paradigm is the graph execution model, in which computation
is represented by a directed, usually acyclic, graph of tasks. Execution begins at a
source of the graph. Each node is scheduled for execution when all incoming edges
to that node come from task nodes that h ave completed. Graphs can be constructed
by hand or alternatively generated by a compiler from a more traditional-looking
program that describes the graph either implicitly or explicitly. We describe graph
execution systems in chapters 8 an d 9, including the data analytics tool
and the
Spark Streaming
Apache Flink
Google Dataflow
systems. The graph execution model is also used in machine learning tools, s uch
as the
Google TensorFlow
Microsoft Cognitive Toolkit
systems that
we discuss in chapter 10. In many of these cases, the execution of a graph node
involves a parallel operation on on e or more distributed data structures. If the
graph is just a linear sequence of nodes, then execution becomes an instance of
BSP concurrency.
The fifth paradigm we consider is
and actors. In the actor
model of paral lel programm ing , computing is performed by many
Chapter 7. Scaling Deployments
communicate via messages. Each actor has its own internal private memory and
goes into action when it receives a message. Based on the message, it can change its
internal state and then send messages to other actors. You can implement an actor
as a simple web service. If you limit the service to being largely stateless, you are
now in th e realm of
, a dominant design paradigm for large cloud
applications. Microservices are often implemented as container instances, and they
depend on robust container management infrastructure such as
Kubernetes [79], which we describe in later sections.
We discuss in the following how each of these paradigms can be implemented in
cloud environments. We look at the impact that GPU accelerators are having on
science in the cloud and examine how we can build traditional high-performance
computing (HPC) clusters. We then move on to more cloud-centric systems that
have evolved for scalable data analytics. This last group includes Mesos and
Kubernetes for managing large collections of microservices, as well as MapReduce-
style computing systems such as Hadoop and Spark.
7.2 SPMD and HPC-style Paralleli sm
Scientists and engineers frequently ask, “Can I do traditional high-performance
computing (HP C) in the cloud?” The answer us ed to be, “not really”: Indeed, a
2011 U.S. Department of Energy report [
] and a 2012 National Aeronautics
and Space Administration (NASA) report [
] both concluded that cloud d id not
meet expectations for HPC applications. However, things have changed as cloud
providers have deployed both high-speed node types and specialized interprocessor
communication hardware. It is now feasible to achieve excellent performance for
HPC applications on cloud systems for modest numbers of processing cores, as we
now discuss.
7.2.1 Message Passing Interface in the Cloud
The Message Passing Interface (MPI) remains the standard for large-scale parallel
computing i n science and engineering. The specification allows for portable and
modular parallel programs; implementations allow communicating processes to
exchange messages frequently and at high speed. Amazon
and Azure
both oer special
HPC-style cluster hardware in their data centers that can achieve high performance
for MPI applications. Cycl e Computing reported back in 2015 th at “we have seen
comparable performance [for MPI applications on EC2 vs. s upercomputers] up
7.2. SPMD and HPC-style Parallelism
to 256 cores for most workloads, and even up to 1024 for certain workloads that
aren’t as tightly-coupled. ” [
] The key to getting good performance is configu ring
advanced networking, as we discuss below when we describe how to create virtual
clusters on Amazon and Azure.
7.2.2 GPUs in the Cloud
Graphics processing units have been used as accelerators in supercomputing for
about 10 years. Many top supercomputing applications now rely on compute
nodes with GPUs to give massive accelerations to dense linear algeb ra calculations.
Coincidentally this type of calculation is central to the computations needed to
deep neural networks
(NNs), which lie at the nexus of interest in machine
learning by the technology industry. Consequently it is no surprise that GPUs are
making their way into cloud servers.
The most common GPUs used in the current systems are from NVIDIA and
are programmed with a single instruction multiple data (SIMD) model in a system
called CUDA. SIMD operators are, in eect, array operations where each instruction
applies to entire arrays of data. For example, a state-of-the-art system in 2017, the
NVIDIA K80, has 4, 992 CUDA cores with a dual-GPU design delivering up to 2.91
teraflops (10
floating point operations per second) double-precision performance
and 8.73 teraflops i n single precision. Amazon, Microsoft, and IBM have made
K80-supported servers available in their public clouds. The Amazon EC2 P2
instances in corporate up to eight NVIDIA Tesla K80 accelerators, each runni ng
a pair of NVIDIA GK210 GPUs. The Azure NC24r series has four K80s with
224 GB memory, 1.5 TB solid state disk (SSD), and hi gh-speed networking. You
do not need to configure a particularly large cluster of these machine instances to
create a petaflop (10
floating point operations/second) virtual computer.
MPI cloud computing for proton therapy
. This example illustrates how spe-
cialized node types (in this case, GPU-equipped nodes with 10 gigabit/s interconnect)
allow cloud computing to deliver transformational computing power for a time-
sensitive m edical application. It also involves the use of MPI for inter-instance
communication, Apache Mesos for acquiring and configuring virtual clusters, and
Globus for data movement betwee n hospitals and cloud.
Chard and colleagues have used cloud computing to reconstruct three-dimensional
proton computed tomography (pCT) images in support of proton cancer treat-
ment [
]: see figure 7.1 on the following page. In a typical clin ical usage modality,
protons are first used in an online position verification scan and are then targeted at
cancerous cells. The res ults of the verification scan must be returned rapidly.
Chapter 7. Scaling Deployments
10 cm
Figure 7.1: Proton c omputed tomography. Protons pass left to right through sensor planes
and traverse the target before stopping in the detector at th e far right.
proton histories, resulting in an input dataset of
100 GB. The reconstruction process
is complex, involving multiple processes and multiple stages. First, each participating
process reads a subset of proton histories into memory and performs some preliminary
calculations to remove abnormal histories. Then, filtered back projection is used to
estimate an initial recons truction solution, from which most likely paths (MLPs) are
estimated for the proton through the target. The voxels of the MLP for each proton
identif y the nonzero coecients in a set of nonlinear equations that must then be
iteratively solved to construct the image. This solution phase takes the bulk of the
time and can be accelerated by using GPUs and by caching the MLP paths (up to
2 TB for a 2-billion-history image) to avoid recomputation.
Chard et al. use a high-performance, MPI-based parallel reconstruction code
developed by Karonis and colleagues that, when run on a standalone cluster with
60 GPU-equipped compute nodes, can reconstruct two billion histories in seven
minutes [
]. They configure this code to run on an Amazon virtual cluster of GPU-
equipped compute instances, selected to meet requirements for compute performance
and memory. They deploy on each instance a VM image configured to run the pCT
software plus associated dependencies (e.g., MPI for inter-instance communication).
The motivation for developing this pCT reconstruction service is to enable rapid
turnaround processing for pCT data from many hospitals, as required for clinical use
of proton therapy. Thus, the pCT service also incorporates a scheduler component,
responsible for acce p ting requests from client hospitals, acquiring and configuring
appropriately sized virtual clusters (the size depending on the number of proton
histories to be processed), and dispatching reconstruction tasks to those clusters.
They use the Apache Mesos scheduler (section 7.6.6 on page 124) for this task and
the Globus transfer service (section 3.6 on page 51) for data movement.
This pCT service can compute reconstructions for $10 per image when using
7.2. SPMD and HPC-style Parallelism
Amazon spot instances (see page 80 for a discussion of spot instances). This is a
revolutionary new capability, considering that the alternative is for each hospital with
7.2.3 Deploying an HPC Cluster on Amazon
A useful tool for building HPC clusters on the Amazon cloud i s Amazon’s
service, which enables the automated deployment of complex collec-
tions of related services, such as multiple EC2 instances, load balancers, special
subnetworks connecting these components, and security groups that apply across
the collection. A specific deployment is described by a template that defines all
needed parameters. Once required parameter values are set in the template, you
can launch multiple identical instances of the described deployment. You can both
create new templates and modify existing templates.
We do not go into those complexities here bu t instead consider one particular
CfnCluster (CloudFormation Cl uster)
. This tool is a set of Python
scripts that you can install and run on your Linux, Mac, or Windows computer to
invoke CloudFormation, as follows, to build a private, custom HPC cluster.
sudo pip install cfncluster
cfncluster configure
The configuration script asks you various questions, including the following.
1. A name for your cluster template (e.g., “mycluster”)
2. You r Amazon access keys
3. The region in which you want your cluster to be deployed
4. The name for your virtual private cloud (VPC)
5. The name for the key pair that you want to use
6. Virtual Private Cloud (VPC) and subnet IDs
The following command causes a basic cluster to be launched, in the manner
illustrated in figure 7.2 on the following page. (As you have not yet specified the
number or type of compute nodes to start, or what image to run on those nodes,
defaults are used.) This process typically takes about 10 minutes.
cfncluster create mycluster
Chapter 7. Scaling Deployments
Figure 7.2: CloudFormation steps involved in launching a private HPC cloud from a
CfnCluster template.
The “create” command returns a Gangli a URL. Ganglia is a well-known and
frequently used cluster monitoring tool. Following that link takes you to a Ganglia
view of your HPC clus ter. The default settings for a new cluster are autoscale
compute nodes and the gridEngine scheduler. This combination works well if you
are running a lot of no n-MPI batch jobs. With autoscale, compute nodes are shut
down when not in us e, and new nodes are started when load increases.
Now let us deploy a new cluster with better compute nodes and a better
scheduler. The following command deletes your current deployment.
cfncluster delete mycluster
To create a new and improved deployment that you can use for MPI programs,
find the directory
on your PC and edit the file
. Look for
a section called
[cluster mycloud]
, and add or edit the following four lines in
that section.
compute_instance_type = c3.xlarge
initial_queue_size = 4
maintain_initial_size = true
scheduler = slurm
7.2. SPMD and HPC-style Parallelism
The choice of compute instance type here is important. The
type supports what Amazon calls
enhanced networking
, which mean s that it
runs on hardware and with software that support Single-Root I/O Virtualization
(SR-IOV). As we discuss in more detail in section 13.3.2 o n page 286, this technology
reduces latency and increases bandwidth by allowing VMs to access the physical
network interface card (NIC) directly.
You do not need to specify a VM image because the default contains all libraries
needed for HPC MPI-style computing. You have also stated that you want all of
your compute nodes to stay around and not be managed by autoscale, an d that you
want Slurm to be the scheduler. We have found that these options make interactive
MPI-based computing much ea sier. When you now issue the create comma nd, as
follows, CloudFormation deploys an HPC cluster with these properties.
cfncluster create mycluster
Now let us run an MPI program on the new cluster. You first login to the
cluster’s head node, using the key pair that we used to create the cluster. On a
PC you can use PuTTY, and on a Mac you can use
from the command line.
The user is
. First you need to set up some path information. At the
command prompt of the head node, issue the following three commands.
export PATH=/usr/lib64/mpich/bin :$PATH
export LD_LIBRARY_PATH =/usr/ lib64/mpich/lib
export I_MPI_PMI_LIBRARY =/opt/slurm/lib/
You next need to know the local IP addresses of your compute nodes. Create a
file called ip-print.c, and put the following C code in it.
# include <stdio.h>
# include <mpi.h>
# include <stdlib.h>
main(int argc , char ** argv )
char hostname [1024];
gethostname(hostname , 1024);
printf("%s\n", hostname );
You can then use the Slurm command
to run copies of this prog ram
across the enti re cluster. For example, if your cluster has 16 nodes, run:
mpicc ip -print.c
srun -n 16 /home /ec2 -user /a. out > machines
Chapter 7. Scaling Deployments
The output file
should then contain multiple IP addresses of the form
, where
is a number between 1 and 255, one for each of your compute
nodes. (Another way to find these private IP addresses is to go to the EC2 console
and look at the details for each running server labeled Compute. You need to do
this for each instance, so if you have a lot of instances, the above method i s faster.)
Now you can run the simple MPI program in figure 7.3 on the next page to
test the cluster. Call this
. This program starts with MPI node 0
and sen ds the number -1 to MPI node 1. MPI node 1 sends 0 to node 2, node 2
sends 1 to node 3, and so on. Compile and run this with the command below.
mpicc ring.c
mpirun -np 7 -machinefile ./ machines /home/ec2 -user/a.out
Assuming seven processes running on four nodes, you should get results like
the following.
Warning: Permanently added the RSA host key for IP address ' ' to
the list of known hosts .
Warning: Permanently added the RSA host key for IP address ' ' to
the list of known hosts .
Warning: Permanently added the RSA host key for IP address ' ' to
the list of known hosts .
Warning: Permanently added the RSA host key for IP address' ' to
the list of known hosts .
Received number -1 from process 0 on node ip -10 -0 -1 -77
Received number 0 from process 1 on node ip -10-0-1-75
Received number 1 from process 2 on node ip -10-0-1-78
Received number 2 from process 3 on node ip -10-0-1-76
Received number 3 from process 4 on node ip -10-0-1-77
Received number 4 from process 5 on node ip -10-0-1-75
Received number 5 from process 6 on node ip -10-0-1-78
Observe that the
command distributed the MPI nodes uniformly across
the cluster. You are ready to do some real HPC computing on your virtual private
MPI cluster. You might, for example, measure message transit times. We converted
our C program into one in which our nodes are connected in a ring, and measured
the time fo r messages to go around this ring ten million times. We found that the
average time to send a message and have it received at the destination was a bout
70 microseconds. While this is not a standard way to measure message latencies,
the resu lt is consistent with that seen in other cloud cluster implementations. It
is at leas t 10 times slower than conventional supercomputers, but note that the
underlying protocol used here is IP and that we do not use any special cluster
network such as InfiniBand. The full code is available in GitHub and is linked
from the book website in the Extras tab.
7.2. SPMD and HPC-style Parallelism
# include <mpi.h>
# include <stdio.h>
# include <stdlib.h>
# include <time.h>
int main( int argc , char ** argv ) {
// Initialize the MPI environment
// Find out rank , size
int rank , world_size , number ;
MPI_Comm_rank(MPI_COMM_WORLD , &rank);
MPI_Comm_size(MPI_COMM_WORLD , &world_size);
char hostname [1024];
gethostname(hostname , 1024);
// We assume at least two processes for this task
if (world_size < 2) {
fprintf(stderr , "World size must be >1 for %s\n", argv [0]);
if (rank == 0) {
// If we are rank 0, set number to -1 & send it to process 1
number = -1;
MPI_Send (&number , 1, MPI_INT , 1, 0, MPI_COMM_WORLD );
else if (rank > 0 && rank < world_size) {
MPI_Recv (&number , 1, MPI_INT , rank -1, 0, MPI_COMM_WORLD ,
printf("Received number %d from process %d on node %s\n",
number , rank -1, hostname);
number = number +1;
if (rank+1 < world_size)
MPI_Send (&number , 1, MPI_INT , rank+1, 0, MPI_COMM_WORLD );
MPI_Finalize ();
Figure 7.3: The MPI program
that we use for performance experiments.
Chapter 7. Scaling Deployments
7.2.4 Deploying an HPC Cluster on Azure
We describe two approaches to building an HP C cluster on Azure. The first is to
use Azure’s service deployment orchestration service,
Quick Start
. Like Amazon
CloudFormation, it is based on templates. The templates are stored in GitHub
and can be invoked directly from the GitHub page. For example, figure 7.4 shows
the template start page for creating a Slurm cluster.
Figure 7.4: Azure Slurm start page [8].
When you cl ick
Deploy to Azure
, you are taken to the Azure login p age and
then directly to the web form for completing the S lu rm cluster deployment. At
this point you enter the names for the new resource group that defines your cluster,
the numb er and type of compute nodes, and a few details about the network. We
refer you to the Microsoft tutorial [
] for more details on how to buil d a Slurm
cluster and launch jobs on that cluster.
A second approach to HPC computing on Azure is
Azure Batch
, which
supports the management of large pools of VMs that can handle large batch
jobs, such as many task-parallel job s (described in the next section) or large MPI
computations. As illustrated in figure 7.5 on the next page, four major steps are
involved in using this service.
1. Upload your application binaries a nd input data to Azure storage.
Define the pool of compute VMs that you want to use, specifying the desired
VM size and OS image.
3. Define a Job, a container for tasks executed i n your VM pool.
4. Create tas ks that are lo aded into the Job and executed in the VM pool.