Chapter 8
Data Analytics in the Cloud
“Science is what we understand well enough to explain to a computer.
Art is everything else we do.
—Donald Knuth
What we know today as public clouds were originally created as internal data
centers to support services such as e-commerce, e-mail, and web search, each of
which involved the acquisition of large data collections. To optimize these services,
companies started performing massive amounts of analysis on those data. After
these data centers morphed into public clouds, the methods developed for these
analysis tasks were increasingly made available as services and open source software.
Universities and other companies have also contributed to a growing collection
of excellent open source tools. Together this collection represents an enormous
ecosystem of software supported by a large community of contributors.
The topic of data analytics in the cloud is huge and rapi dly evolving, and could
easily fill an entire volume itself. Furthermore, as we observed in chapter 7, science
and engineering are themselves rapidly evolving towards a new data-driven fourth
paradigm of data-intensive discovery and design [
]. We survey some of the most
significant approaches to cloud data analytics and, as we have done throughout
this book, leave you with examples of experiments that you can try on your own.
We begin with the first major cloud data analysis tool, Hadoop, and describe
its evolution to incorporate Apache YARN. Rather than describe the traditional
Hadoop MapReduce tool, we focus on the more modern and flexible Spark system.
Both Amazon and Microsoft have integrated versions of YARN into their standard
service oerings. We describe how to use Spark with the Amazon version of YARN,
8.1. Hadoop and YARN
Amazon Elastic MapReduce, which we use to anal yze Wikipedia data. We also
present an example of the use of the Azure version of YARN, HDInsight.
We then turn to the topic of analytics on truly massive data collections. We
introduce Azure Data Lake and illustrate the use of both Azure Data Lake Analytics
and Amazon’s similar Ath ena analytics platform. Finally, we describe a tool from
Google called Cloud Datalab, which we use to explore National Oceanographic
and Atmospheric Administration (NOAA) data.
8.1 Hadoop and YARN
We have already introduced Hadoop and the MapReduce concept in chapter 7.
Now is the ti me to go beyond an introduction. When Hadoop was introduced, it
was widely seen as the tool to use to solve many large data analysis problems. It
was not necessarily ecient, but it worked well for extremely large data collections
distributed over large clusters of servers.
Hadoop Distributed File System
(HDFS) is a key Hadoop buil d ing
block. Written in Java, HDFS is completely portable and based on stan dard
network TCP sockets. When deployed, it has a single
, used to track
data location, and a cluster of
, used to hold the distributed data
structures. Individual files are broken into 64 MB blocks, which are distributed
across the DataNodes and also replicated to make the system more fault tolerant.
As illustrated in figure 8.1 on the following page, the NameNode keeps track of
the location of each file block and the replicas.
HDFS is not a POSIX file system: it is write-once, read-many, and only
eventually consistent. However, command li ne tools make it usable in a manner
similar to a standard Unix file s ystem. For example, the following commands create
a “directory” in HDFS, pull a copy of Wikipedia from a website, push those data
to HDFS (where they are blocked, replicated and stored), and list the directory.
$hadoop fs -mkdir /user/wiki
$curl -s -L http :// dumps. wikimedia .org/enwiki /... multisream .xml.bz2\
$hadoop fs -ls /user/wiki
Found 1 items
-rw -r--r-- hadoop 59189269956 21:29 /user/wiki/wikidump-en.xml
Hadoop and HDFS were originally created to support only Hadoop MapReduce
tasks. However, the ecosystem rapidly grew to include other tools. In addition, the
original Hadoop MapReduce tool could not support important application classes
Chapter 8. Data Analytics in the Cloud
Figure 8.1: Hadoop Distributed File System with four DataNodes and two files broken
into blocks and distributed. The NameNode keeps track of the blocks and rep licas.
such as those requiring an iterative application of MapReduce [
] or the reuse of
distributed data structures.
(Yet Another Resource Negotia tor) represents the evolution of
the Hadoop ecosystem into a full distributed job management system. It has a
resource manager and scheduler that communicates with node manager processes
in each worker node. Applications connect to the resource manager, which then
spins up an application manager for that application instance. As illustrated in
figure 8.2 on the next page, the application manager interacts with the resource
manager to obtain “containers” for its worker nodes on the cluster of servers. This
model allows multiple applications to run on the system concurrently.
YARN is similar in many respects to the Mesos system described in chapter 7.
The primary dierence is that YARN is designed to schedule MapReduce-style
jobs, whereas Mesos is designed to support a more general class of computations,
including containers and microservices. Both systems are widely used.
8.2 Spark
Spark’s design addresses limitations in the original Hadoop MapReduce computing
paradigm. In Hadoop’s linear dataflow structure, programs read input data from
disk, map a function across the data, reduce the results of the map, and store
reduction results on disk. Spark supports a more general graph execution model
8.2. Spark
Figure 8.2: YARN distributed resource manager architecture.
that allows for iterative MapReduce as well as more ecient data reuse. Spark
is also interactive and much faster than pure Hadoop. It runs on both YARN
and Mesos, as well as on a laptop and in a Docker container. In the following
paragraphs we provide a gentle introduction to Spark and present some data
analytics examples that involve its use.
A central Spark construct is the
Resilient Distributed Dataset (RDD)
data collection that is distributed across servers and mapped to disk or memory,
providing a restricted form of distributed shared memory. Spark is implemented
, an interpreted, statically typed object-functional language. Spark has
a library of Scala parallel operators, similar to the Map and Reduce operations
used in Hadoop, that perform transformations on RDDs. (The library also has
a nice Python binding.) More precisely, Spark h as two types of operations:
that map RDDs into new RDDs and
that return values
to the main program: usually the read-eval-print-loop, such as Jupyter.
8.2.1 A Simple Spark Program
We introduce the use of Spark by using it to implement a trivial program that
computes an approximation to via this identity:
Our program, shown in figure 8.3 on the following page and in n otebook 11,
uses a map operation to compute
for each of
values of
, and a reduce operation
Chapter 8. Data Analytics in the Cloud
to sum the resul ts of those computations. The p rogram creates a one-dimensional
array of integers that we then convert to an RDD partitioned into two pieces. (The
array is not big, but we ran this example on a dual-core Mac Mini.)
Figure 8.3: Computing with Spark.
In Spark, the partitions are distributed to the workers. Parallelism is achieved
by applying the computational p arts of Spark operators on each partition in parallel,
using, furthermore, multiple threads per worker. For actions, such as a reduce,
8.2. Spark
most of the work is done on each partition and then across partitions as needed.
The Spark Python library exploits Python’s ab ili ty to create anonymous functions
using the lambda operator. Code is generated for these functions, and they can
then be sh ip ped by Spark’s work scheduler to the workers for execution on each
RDD partition. In this case we use a simple MapReduce computati on.
8.2.2 A More Interesting Spark Program: K-means Clustering
We now consider the more interesting example of
-means clustering [
]. Suppose
you have 10,000 points on a plane, and you want to nd
new points that are the
centroids o f
clusters that partition the set. In other words, each point is to be
assigned to the set corresponding to the centroid to which it is closest. Our Spark
solution is in notebook 12.
We use an array
to hold the
centroids. We initialize this array with
random values and then apply an iterative MapReduce algorithm, repeating these
two steps until the centroids have not mo