Part III:
The Cloud as Platform
“May your mountains rise into and above the clouds.”
—Edward Abbey
As we noted in chapter 1, the cloud is a lot more than a virtual computer: it is a
rich ecosystem of services that can slash the expertise, time, and money required
to build sophisticated applications. For example, say that you want to build
an application to monitor environmental sensors and alert you when a certain
combination of sensors indicates anomalo us behavior. Or perhaps you need to
explore a large archive of image data for samples to train a deep neural network,
so that you can analyze a much larger archive. Or you have been charged with
building a system to deliver large quantities of genomic data to collaborators
around the world. Each of these tasks sounds like a massive task, but as we will
see, cloud services can make each of them surprisingly easy.
In this regard, a cloud serves as a platform: an environment that allows you
to develop, run, an d manage applications without the need to set up, run, and
maintain the hardware and software infrastructure that would otherwise be needed
to host those applications. In the environmental sensor example, you ca n receive,
enqueue, and process events without having to write the complex software that
would normally be used to perform those tasks. Your application can, furthermore,
scale to hand le more events automatically, without you having to implement
specialized load balancing logic. Need access control? Archiving? Auditing? Data
analytics? Each of these capabili ties is easily available.
This concept of a platform is not new to science and engineering. Many people
use Matlab, Mathematica, SPSS, SAS, R, or Python, each of which provides
capabilities that simplify the development of certain classes of application. When
hosted on the cloud, these same tools can become collaborative information-
processing laboratories.
In general, then, a cloud platform comprises a set of software components that
are operated by the cloud provider and that software developers can incorporate
into their applications, for example by REST API calls. Many systems satisfy
this broad defini tion, and a surprising number of those systems have been used
in sci ence and engineering in one way or another. (Scientists and en gin eers are
enterprising people!) For example, Facebook provides a set of programming
interfaces and tools that developers can use to integrate with the “social graph”
that Facebook maintains of personal relations and information. Researchers have
used this platform’s capabilities to implement scientific collaboration systems and
even peer-to-peer resource-sharing systems in which Facebook friends share storage
space on their computers [
88
]. The Twitter and Salesforce platforms have seen
similar use.
The number of cloud platform capabilities is so large that we cannot hope to
do them justice here. Instead, we focus on four classes of cloud platform services:
Data analytics
, as implemented with the Hadoop and YARN tools includ-
ing Spark. We show how data analytics can be used on Amazon Elastic
MapReduce and Azure HDInsight and Google’s Cloud Datalab. We also look
at data warehouse tools such as Azure Data Lake and Amazon Athena.
Streaming data
services, which have become a fully integrated part of
the public cloud landscape. Amazon Kinesis and its analytics tools, along
with Azure Event Hubs and Stream Analytics, are easily used and powerful.
The open source community also has developed a rich collection of tools for
monitoring and analyzing streaming data.
Machine learning
services, which combine open source libraries and in-
teractive clou d-bas ed development environments to provide exciting new
capabilities. Deep learning is revolutionizing the field because of the avail-
ability of extremely large data collections and powerful computing platforms.
Globus platform services
, which provide identity, group, and research
data management capabilities that simplify the development of applications
and systems that integrate people an d data at disparate locations, such as
research data management portals.
134