Chapter 1
Orienting in the cloud universe
“I’ve also grown weary of reading about clouds in a book. Doesn’t this
piss you o? You’re reading a nice story, and suddenly the writer has
to stop and describe the clouds. Who cares?”
—George Carlin, “Seven Things I’m Tired Of
We start this journey into cloud computing for science and engineering by intro-
ducing important concepts and the structure of the book, and reviewing tools that
you should know in order to obtain the most value from thi s material.
1.1 Cloud: Computer, assistant, and platform
Scientists and engineers can apply cloud capabilities in their work in many dierent
ways. We find it useful to think in terms of three categories of use.
First, a cloud is an
elastic computer
: a source of on-demand comp utin g
and storage that you can call upon when you need computing or storage capacity
larger than, or dierent from, what is available locally. Accessing this capacity
in the cloud may be cheaper, faster, and/or more convenient than acquiring and
operating your own computing and storage systems. While there are dierences
between the cloud computing and storage oerings from dierent cloud providers,
they provide quite similar capabilities: in particular, object storage and execution
of virtual machines and containers. We cover this
infrastructure as a service
(IaaS) technology and its applications in Parts I and II.
1
1.1. Cloud: Computer, assistant, and platform
Figure 1.1: Scientists can use clouds in three distinct ways: As a source of on-demand
computing and storage on which to run their own software (left); as a source of software
that can be run over the network (center) as a source of new platform capabilities that
can allow development of new types of software (right).
Second, a cloud is a tireless
laboratory assistant
: a source of powerful
software that can perform certain tasks more eectively and/or cheaply than you
can yourself: for example, Academia.edu, Google Scholar, and ResearchGate to
access information about publications, facilitating research and citation; GitHub to
manage software and documents, facilitating collaboration, software sharing, and
reproducibility; Google Docs, Box, and Dropbox to share data; Science Exchange
to order experiments; Figshare for publishing data; Globus to move and m ana ge
large data; Skype and other services for communication; and m any others. In each
case, you can avoid substantial cogni tive, administrative, and financial burdens
that you or members of your laboratory would incur if they had to perform these
tasks themselves. These
software as a service (SaaS)
capabilities are important,
but are largely out of scope for this book, although we do discuss how to build
your own software as a service in Chapter 14.
Third, a cloud is a
programming platform
: that is, a collection of powerful
software mechanisms that you can use to build software with capabilities that
would be dicult or expensive to duplicate in your own lab : for example, an event
processing system that can process millions of events per second, a database that can
scale to billions of rows, an identity management service that can handle dozens of
dierent identity providers, a data transfer service that can move terabytes securely
and reliably, or a service that is replicated in multiple geographic regions to ensure
continuous operations. These platform capabilities are arguably the most exciting
2
Chapter 1. Orienting in the cloud u niverse
part of cloud computing, because they enable i ndi vidu al programmers to create
and operate software systems that would otherwise require large teams. They allow
the cloud to be used as an interactive environment for large-scale computational
experimentation and discovery. They can al so be the most challenging to use
eectively, because they have often been developed for use cases rather dierent
from traditional technical computing. In addition, it is in this area that we see the
biggest variation across cloud vendors in terms of capabilities and interfaces. We
discuss these platform as a service (PaaS) capabilities in Part III.
Inevitably, the boundaries between these dierent types of cloud system and
cloud usage are not always crisply defined. For example, a growing number of
software-as-a-service oerings provide APIs that allow them to be used as pl atform
services, and we often see (as discussed in Part III) platform services enhancing
the value of virtual computer oerings.
1.2 The cloud landscape
The cloud landscape is large, diverse, and com pl ex. The U.S. National Institute of
Standards and Technology lists five essential characteristics of cloud co mputing :
on-demand self-service, broad network access, resource pooling, rapid elasticity
or expansion, and measured service [
197
]. Today, thousands of companies oer
services with some or all of these characteristics, from low-level computing and
storage to sop hi sticated software: see Figure 1.2. But apart from the collaboration
and content management systems listed above, few of the commercial cloud services
shown in the figure are relevant to science and engineering.
One major exception is in the realm of clou d infrastructure: the elastic compute
services th at allow individuals to acquire storage and computing on demand. Here,
the landscap e is simpler, particularly when we focus on providers with oerings
relevant to science and engineering. (Others specialize in specific products, such
as Oracle for databases or AT&T for telecom.) Three vendors, Amazon, Google,
and Microsoft, dominate the industry, as shown in Table 1.1 on the next page,
and each has proven useful for science and engineering. We focus in this book
on the services provided by those three providers and by one academic research
cloud, Jetstream [
122
]
jetstream-cloud.org
. Nevertheless, other cloud providers
are al so impressive. For example, the New York-based DigitalOcean is popular
in the software engineering and cloud application development community, while
Rackspace supports those using the Amazon and Microsoft clouds as well as
Ashaded,roundedrectangledenotesan
https
URL, in this case
https://jetstream.org
.
3
1.2. The cloud landscape
Figure 1.2: While dated, Bessemer Venture Partner’s picture of the top 300 cloud
computing companies in 2012 conveys the vast range of cloud service providers.
running its own cloud servers. Europ ean cloud providers include 1&1, UpCloud,
City Cloud, CloudSigma, CloudWatt, and Aruba Cloud. Large telecommunications
and search companies, such as China’s Baidu, are also rapidly building cloud data
centers. Together, these various companies operate more than one hundred data
centers around the globe, containing an estimated ten million servers and vast
storage. (We base these estimates on news articles [63, 169, 96, 204].)
The cloud services operated by Amazon, Google, and Microsoft are commonly
referred to as
public clouds
, by analogy with the public utilities (p ower, telephone,
Table 1.1: Ma jor cloud infrastructure providers of relevance to research.
Amazon
Market leader. Computing, storage, and platform services.
Extensively used in science and engineering.
Microsoft
Second biggest player. Computing, storage, and platform
services with both individual and enterprise customers.
Google
Began with a service called App Engine and is now using that
experience to release a full suite of cloud capabilities.
4
Chapter 1. Orienting in the cloud u niverse
water, sewer, etc.) on which most of us depend in our daily lives. Like public
utilities, they provide computing, storage, and/or other services to any member
of the public with th e ability to pay. (Public clouds are not regulated like public
utilities, leading some to argue that the term is inappropriate.)
In contrast, a
private cloud
is operated by a private institution or individual
to provide computing, storage, and/or other services to a more limited audience.
We can think of them as being analogous to a private electricity generator, al though
the utility that they provide is not electricity but on-demand access to computing,
storage, or software. Private clou ds are frequently deployed in larger enterprises.
IBM, VMware, and Microsoft are major providers of proprietary soluti ons for build-
ing on-premises cloud-like systems. OpenStack
openstack.org
is the dominant
open source cloud software solution, particularly in the US; this software is used,
for example, by Jetstream and in some public clou ds, such as that o perated by
Rackspace. OpenNebula [
202
]
opennebula.org
is another prominent open source
solution, used extensively in Europ e. The European Union has announced plans
for a Europe-wide science cloud [
163
]by2018
hnscicloud.eu
. There are also
numerous academic cloud projects in Europe.
Figure 1.3: Private (including academic and community), public, and hybrid clouds.
This distinction between public and private clouds may appear minor, but it
has important implications. Because the major public clouds operate at a scale
far larger than any private cloud, they can oer a broad set of powerful features:
for example, elasticity, fine-grained billing, high reliability due to geographic
distribution, a wide variety of resource types, and rich sets of platform services.
Equally importantly, they can achieve substantial economies of scale.
5
1.2. The cloud landscape
Private clouds, in contrast, typically oer a limited set of cloud-like capabilities:
for example, just the ability to deploy virtual machine instances and store objects.
As we will see in Parts I and II of this book, these capabilities are enough to support
interesting applications. However, the lack of the many other services oered by
the Amazon, Microsoft and Google platforms limit the range of things that can
be done. Private clouds may nevertheless be preferred in some circumstances,
for example because a specific workload can be run more cost-eectively on an
in-house infrastructure, or b ecau se a company or researcher does not wis h sensitive
data to leave their premises. In such cases, so-called
hybrid clouds
may be used
to run selected tasks on public clouds: a process that is sometimes termed
cloud
bursting. (Cloud compu ting ha s i ns pired so me terrible terminology!)
A
community cloud
is a private cloud depl oyment designed to support a
specific community: for example, the genomics community or a set of companies
or academic institutions who want to share resources. The term
academic cloud
is sometimes used to refer to a private or community cloud focused on the needs
of the academic community. Figure 1.3 depicts these dierent cloud types.
Private or public?
The merits of private clouds are hotly debated. Proponents
of private clouds argue that it can be significantly cheaper to acquire and operate
dedicated computers and storage systems than to buy time on a public cloud. To give
just one example, let’s consider the problem of providing online access to a petabyte
of data. Storing that petabyte for a year in the Amazon public cloud object store
would cost, as of January 2017, $252,000 in the Amazon US East region. In contrast,
you can buy a 1 PB capacity SpectraLogic V erde system, a high-capacity storage
device, for $75,000, and that system of course should be usable for several years.
Asecondfrequentlycitedreasonforusingaprivatecloudistheneedtoprotect
sensitive data. Increasingly, such data can be stored on public cloud resources from
a regulatory and policy perspective, as we discuss in Chapter 15, but again the costs
can be daunting, particularly if your institution provides you with secure storage
and computing at subsidized rates. (If they do not, then the public cloud can enable
research that you could not undertake otherwise.)
Critics respond that private cloud enthusiasts underestimate the cos ts associated
with creating and running a cloud computing system, the diculties inherent in
achieving high reliability and security, and the benefits of a truly elastic cloud th at
always has available capacity. (Returning to your petabyte, who pays for power, space,
operations, and support? What about backups? And if you don’t need online access,
Amazon provides an archival storage service, Glacier
Glacier
,thatcanstorethat
same petabyte for just $48,000 for a year, with automated migration of infrequently
used data from object store to archive.) The question of whether to build or buy is
complex, with the answer depending on many factors. Suce to say that you should
be careful to consider all relevant factors when choosing your cloud solution.
6
Chapter 1. Orienting in the cloud u niverse
We focus primarily in this book on public clouds, as they tend to be more
capable, more accessible, and easier to use than other clouds. However, we do
include material on the Eucalytus and O pen Stack software that are commonly
used to create private, community, and academic clouds, and on the Jetstream
academic cloud. Table 1.2 lists some p rivate clouds from th e academic community.
Table 1.2: Some private research clouds and their characteristics.
Name Description
Aristotle
Hybrid cloud for academic research, integrating Eucalyptu s private
cloud clusters and public cloud providers. federatedcloud.org
Bionimbus
Acloud-basedinfrastructureformanaging,analyzingandsharing
genomics datasets. bionimbus.opensciencedatacloud.org
Chameleon
Acongurableexperimentalenvironmentforlarge-scalecloudre-
search. chameleoncloud.org
Jetstream
Cloud computing for the U.S. academic community, operated as
part of the XSEDE research network. jetstream-cloud.org
RedCloud
Subscription-based cloud that provides virtual servers and storage
on demand. www.cac.cornell.edu/redcloud/
1.3 A guide to this book
This book has been written with you, the student, in mind. (Even if you are a
senior scientist or engineer, we know that you are still a student at heart!) Your
discipline may be physics, astronomy, b iol ogy, engineering, computer science, the
humanities, or one of the newer disciplines called computational or data science.
You may have come to this book because you have heard of new ways of computing
in the cloud and want to learn whether they matter to you. Perhaps:
you have a lot of data that must be an alyzed by remote collaborators;
your current computing platform (e.g., your laptop) is no longer big enough
for your needs, and you lack access to a large cluster or supercom puter;
you have access to a supercomputer, but it does not work well for interactive
data analysis and collaboration tasks;
you want to apply new computational methods, such as machine learning or
stream analytics, that are hard to install, operate, and scale; or
you want to make software or data availabl e to your community as a service.
7
1.4. Accessing the cloud: Web, APIs, an d SDKs
We organize this book into five parts (see Figure 1.4 on th e following page),
covering the following topics:
1. Managing data in the cloud
: We describe the various types of data
storage systems that are available for use in the cloud, a nd ill us trate how you
can interact with these services using a cloud portal or directly with code.
2. Computing in the cloud
: Here we explore the spectrum of cloud com-
puting capabilities. These range from deploying single virtual machines
or containers to support basic interactive science experiment to clusters of
machines to do data analytics or traditional HPC computation.
3. Cloud as platform
: Beyond data storage and computing there are high-level
services that are particularly well suited to research applications. We examine
data analysis, machine learning, and streaming data analysis methods. We
also look at some specialized cloud tools designed specifically for science.
4. Building your own cloud
: It is possible to build a basic cloud from
scratch using some p owerful open source software packages. We describe two
examples and some of the tools needed.
5. Security and other topics
: Security is always a major concern for any
online activity. We address this topic at the end of the book, not because it
is unimportant but because managing security requires an understanding of
cloud architecture, as presented in previous chapters. We also consider some
concerns and thoughts about future cloud evolution.
1.4 Accessing the cloud: Web, APIs, and SDKs
We have explained how the cloud can be used variously as a virtual computer,
assistant, or platform. But how exactly do you use it for each of these things? We
provide details in later chapters, but let us first explain some basic concepts.
1.4.1 Web interfaces, APIs, SDKs, and CLIs
Most cloud services can b e accessed in multiple ways. First, most support access via
the web, thus permitting intuitive point and click access without any programming
or even loca l software installation (beyond a web browser) on your part. The
availabili ty of such intuitive interfaces is pa rt of the attraction of clou d services.
8
Chapter 1. Orienting in the cloud u niverse
Managing&data&in&the&cloud
File%systems
Object%stores
Databases%(SQL)
NoSQL%and%graphs
Warehouses
Globus%file%services
Computing&in&the&cloud
Virtual%machines
Containers% Docker
MapReduce% Yarn%and%Spark
HPC%clusters%in%the%cloud
Mesos,%Swarm,%Kubernetes
HTCondor
The&cloud&as&platform
Data%analytics
Spark%&%Hadoop
Public%cloud%Too l s
Streaming%data
Kafka,%Spark,%Beam
Kinesis,%Azure%Events
Machine%learning
Scikit-Learn,%CNTK,%
Tenso r fl o w,%AWS%ML
Building&your&own&cloud
What%you%need%to%know
Using%Eucalyptus
Using%OpenStack
Security&and&other&topics
Securing%services%and%data%
Solutions
History,%critiques,%futures
Research%data%portals
DMZs%and%DTNs,%Globus
Science%gateways
Part&I
Part&II
Part&III
Part&IV
Part&V
Figure 1.4: The cloud for science, from the ground up.
A web interface becomes tedious if the same or similar actions must be per-
formed repeatedly. In such cas es, you likely want to write programs that issue
requests to cloud services on your behalf. Fortunately, most cloud services support
such programmatic access. Typically, they support a
Representational State
Transfer
(REST) application programming interface (API) that permits requests
to be transmitted via the secure Hypertext Transfer Protocol (HTTPS) that is
used by web browsers. (This common use of HTTPS is no t a coincidence: the web
interfaces discussed in the first paragraph are often implemented via browser-hosted
Javascript programs that generate such REST messages.) REST APIs are the key
to programmatic interactions with cloud services.
The meaning of REST
. This term was introduced by Roy Fielding in 2000 [
121
],
who defined a set of principles that should be followed to build distributed systems
that have desirable properties of the World Wide Web, such as performance, reliability,
scalability, and simplicity. These principles define that, among other things, a REST
(or RESTful) web service should refer to objects by uniform resource identifiers,
such as
myserver.org/myobject
,andthatoperationsontheseobjectsshouldbe
performed via HTTP operations, with for example a
PUT
being usually interpreted
as a request to create an object and a
GET
as a request to access its contents. We
give examples of REST operations below.
One way to interact with cloud services programmatically is to write prog rams
that generate REST messages directly. However, while constructing REST messages
“by hand” may appeal to hard-core system programmers, you will normally want
9
1.4. Accessing the cloud: Web, APIs, an d SDKs
to access cloud services via
software development kits
(SDKs) that you install
on your computer. Such SDKs permit access from programming languages such as
Python (our choice in this book), C++, Go, Java, PHP, and Ruby. (Sorry, Fortran
programmers, but Fortran SDKs are few and far between.) They typically render
operations on cloud services in ways that are consistent with the programming
model of the language in question. Cloud vendors typically provide SDKs for
accessing their services, but there are also good open source ones available, and if
you do not like any of them, you are free to develop your own.
Accessing a cloud service
.Weuseasimpleexampletoillustratethesedierent
approaches to accessing cloud services. Consider the Amazon Simple Storage Service
(S3), which as we describe in Chapter 2, allows you to create and access containers
called
buckets
, within which you can store and retrieve byte strings called
objects
.
The Amazon web interface allows you to interact with S3 simply by pointing and
clicking. For example, Figure 1.5 on the following page shows it being used to create
anewbucketcalled
cloud4sciencebucket
, located within the US Standard region.
(Amazon, like other cloud providers, ope rates many data centers around the world.
The US Standard region is located in northern Virginia.) Such intuitive interfaces
that can be used without any programming or even local software installation (beyond
a web browser) on the part of the user are part of the attraction of cloud services.
S3 also defines a REST API that you can use to manipulate buckets and objects
programmatically. Thus, instead of u sing the Amazon web interface, we could have
created the bucket with name
cloud4sciencebucket
via a
PUT
request on the URI
cloud4sciencebucket.s3.amazonaws.com
. The following shows the syntax of this
PUT request, although omitting some of the header fields for simplicity.
PUT / HTTP/1.1
Host: cloud4sciencebucket.s3.amazonaws.com
Content-Length: length
Date: date
Authorization: authorization string
<CreateBucketConfiguration
xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<LocationConstraint>US Standard</LocationConstraint>
</CreateBucketConfiguration>
Similarly, a
DELETE
operation on the same URI requests d eletion of the bucket
that we just created, and a
GET
operation on that URI return s some or all of the
objects that may have subsequently been p laced in the bucket.
In later chapters, we describe such cloud service APIs and SDKs for a range
of cloud infrastructure and platform services. Not covered in this book, but also
10
Chapter 1. Orienting in the cloud u niverse
interesting, are the APIs and SDKs provided by many of the SaaS oerings listed
above: for example, Dropbox, Google Do cs , LinkedIn, Science Exchange, and GitHub.
(Not all SaaS provide APIs: Google Scholar and ResearchGate, sadly, do not.)
The fact that we can easily access most cloud services both via web browser and
programmatically is one of the reasons why cloud computing has proved so impactful.
Finally, we show how an SDK simplifies interactions with cloud services. The
following Python code uses the Boto3 SDK to interact with Amazon S3. We obtain
an S3 resource; delete the bucket created previously with the REST API; create the
bucket again; and upload a file to the newly created bucket.
import boto3
s3 = boto3 . r eso urc e ('s3 ')
# Delete the bucket previously created with the REST API
s3 . Buc ket ( ' cloud3sciencebucket'). delete ()
# Create that bucket again , specifying location
bucket = s3. create_bucket (Bucket = ' cloud4sciencebucket ',
CreateBucketConfiguration={
' LocationConstraint ': 'us- s ta n d a rd '})
# Upload a file ' test .jpg ' into the newly created bucket
bucket. put_object(Key=' test .jpg' ,Body=open( ' test .jpg' , 'rb '))
Figure 1.5: The Amazon S3 web interface at
console.aws.amazon.com/s3
,herebeing
used to create a new bucket called cloud4sciencebucket in the US S tan d ard region.
11
1.5. Tools used in this book
1.4.2 Local and cloud-hosted applications
The fact that we can, in a few lines, write programs that result in sophisticated
actions occurring in a cloud service is exciting. But where sh oul d those programs
run? One obvious location is your laptop or workstation, and indeed that may be
the right place for many purposes . For example, we might use a slightly expanded
version of the example program to upload 1,000 files from our laptop to S3.
However, in other cases, we want to run a program elsewhere: for example,
because we want the program to keep running once we close our laptop, we cannot
easily install required software on our local computer, or our program is intended
to provide services to other people. In such cases, a natural thing is to run our
program in the cloud. We discuss this topic in detail in Part II of this book, where
we see that we can create a cloud-hosted virtual computer, via either web interfaces
or APIs/SDKs, much as we created a bucket in the example above.
In summary, the cloud can be viewed both as a source of services and as a
place to run programs. Cloud services can be accessed from web browsers or from
programs—programs that can themselves run locally or in the cloud. It is this
diversity of usage modalities, and the relative simplicity of the methods by which
these usage modalities are employed, that accounts for the power of the cloud.
1.5 Tools used in this book
We make extensive use in both this book and its supporting online notebooks of
some standard tools that go beyond the world of cloud computing: the Python
programming language, Jupyter web-based computing tool, GitHub version control
and collaboration system, and Globus research data management service. We
recommend that any researcher who aspires to become proficient in scientific
computing master all four systems. A ll are qu ite accessi ble and are supported by
excellent online resources. Time sp ent mastering them will be repaid many times
over in more productive research. We give a brief introduction to each here.
1.5.1 Python
You need some basic programming knowledge to get the most out of this book.
Most science and engineering students know the Python programming language,
so we use Python for our programming examples. If you do not know Python, the
book should still be interesting, but trust us, Python is easy and also tremendously
fun. Learn the basics, at least.
12
Chapter 1. Orienting in the cloud u niverse
The easiest way to get Python working on your computer is to install the free
Anaconda distribution provided by Continuum Analytics
continuum.io/downloads
.
This distribution includes multiple tools for installing and updating both Python
and installed packages. It is separate from any OS-level version of Python, and is
easy to completely uninstall. It works well on Windows, Mac, and Linux.
Alternatively, you can also create your Python environment manually, installing
Python, package managers, and Python packages separately. Packages like NumPy
and Pandas can be dicult to get working, however, particularly on Windows
Anaconda simplifies this setup considerably, regardless of your OS.
1.5.2 Jupyter: An interactive, web-based computing tool
To facilitate access to the various methods and tools presented in this book, we
provide complete source code for most code examples in the form of
Jupyter
notebooks
. Jupyter Notebook, or simply
Jupyter
, is a web application that
allows you to create and share documents (“notebooks”) containing live code,
equations, visualizations, and explanatory text. Figure 1.6 on the next page shows
what Jupyter looks like in your web browser. The code for the notebook in this
figure is in the code repository as Notebook 1, as documented in Chapter 17.
To install Juypter for Python, us e the Python package installer
pip
or download
and install Anaconda from Continuum Analytics. Later in this book we demonstrate
how to install Python and Jupyter as a virtual m achine or Docker container running
in a remote cloud server.
Our use of Jupyter emphasi zes that cloud computing lends itself to interactive
exploration. We make almost all of the examples in this book available as Jupyter
notebooks. Most were developed by the authors during interactive sessions using
one or more of the cloud platforms described in this book.
1.5.3 The GitHub version control system
We also recommend that you master
GitHub
. A version control system is a tool for
keeping track of changes that have been made to a document over time. GitHub is
a hosting service for projects that use the Git version control system. Both the Git
tool and the GitHub site are increasingly often used by researchers, to create digital
lab notebooks that record the data files, programs, papers, and other resources
associated with a project, with automatic tracking of the changes that are made
to those resources over time [
240
]. GitHub also makes it easy for collaborators to
work together on a p roject, whether a program or a paper: changes made by each
13
1.5. Tools used in this book
Figure 1.6: A samp le Jupyter notebook. This notebook includes four cells. The first is a
markdown cell, i.e., one containing text. The following three provide Python code that
can be run from your web browser, with the last producing a visualization.
contributor are recorded and can easily be reconciled. For exam pl e, we used GitHub
to create this book, with both the authors and reviewers checking in changes and
14
Chapter 1. Orienting in the cloud u niverse
comments at dierent times and time zones. Ram [
222
] provides a nice description
of how Git/GitHub can be used to promote reproducibility and transparency in
research. We also use GitHub to provide access to the online notebooks tha t
accompany this text. You can find the repository at SciEngCloud.github.io .
1.5.4 Globus
We also use the Globus software as a service
globus.org
in our examples, and think
that you will find it useful as well. This cloud-hosted service implements research
data, identity, and credential management capabilities that can greatly simplify
the development of cloud applications that need to access resources located on
university campuses, national computing centers, and other facilities. It comprises a
set of cloud-hosted software-as-a-service services (data transfer and synchronization,
authorization, data shari ng, data publication, data search, and g roups), plus a
simple software component, Globus Connect, deployabl e on computers associated
with storage systems, including laptops, lab servers, campus compute clusters,
cloud storage, and scientific instruments. REST APIs and a Python SDK simplify
integration into applications. We provide more information about Globus in later
chapters, as we introduce examples that demonstrate its use to manage access to
data, replicate data across sites, and publish data, among other things.
1.6 Summary
Pioneering scientists and engineers are already using the cloud in their work, often
in areas in which new data sources or modeling methods require more or dierent
resources than are easily available in the laboratory: for example, in the analysis
of urban [
76
,
84
,
266
] and environmental [
110
,
117
,
137
,
142
,
252
]dataandin
biomedical data analysis and modeling [
33
,
73
,
188
,
203
,
260
]. We are certain that
many more researchers are reaching similar turning points and are thus finding
themselves needing to think about computing in new ways. This book is for them
and for a generation of computer scientists and engineers who recognize that cloud
computing is going to be essential for their careers.
It is almost a futile exercise to write a practical, hands-on book about a
technology that is evolving as rapidly as cloud. Our choices of major vendors and
services may well be rendered obsolete by new developments. Nevertheless, we
expect that most core concepts and tools will remain valid for a long time. The
Unix operating system, hot in the 1970s, lives on in Linux today; the Python
programming language has been with us for 25 years. As new and better ideas
15
1.7. Resources
emerge with implications for science and engineering, whether incremental or
revolutionary, we will strive to update the online resources and, as and when
feasible, produce revised versions of this text.
1.7 Resources
The U.S. National Institutes of Standards and Technology provides a u seful defini-
tion of cloud computing, as “a model for enabling ubiquitous, convenient, on-demand
network access to a shared pool of configurable computing resources (e.g., networks,
servers, storage, applications and services) that can be rapidly provisioned and
released with minimal management eort or service provider interaction” [197].
We recommend Charles Severance’s Python for Informatics: Exploring Infor-
mation [
233
], which covers basic Python and provides material relevant to web
data and MySQL. This book is freely available online and is supported by excellent
online lectures and exercises.
There are many Jupyter resources. The main Jupyter site
jupyter.org
has
many valuable resources. Fernando Pérez and Brian Granger have an excellent
blog called the “State of Jupyter” [219], that is both history and a look ahead.
Each public cloud has a portal where you can l earn about their services,
obtain free accounts with modest-sized al locations for experimentation, and track
your data resources and compute activities: Amazon
aws.amazon.com
, Microsoft
azure.microsoft.com
, and Google
cloud.google.com
. Amazon and Microsoft
grant programs can provide larger allocations: see
aws.amazon.com/grants
and
research.microsoft.com/azure
. The latter sites list examples of cloud in research,
as do reports by Gannon et al. [135] and Lifka et al. [182].
To access the NSF-funded Jetstream cloud you need an allocation through the
XSEDE program. If you are a U.S. academic researcher you can qualify for an
allocation. Details are at jetstream-cloud.org and www.xsede.org .
16
Bibliography
[1]
Access control at the project level.
https://cloud.google.com/storage/docs/
access-control/iam.
[2]
Apache Flink dataflow programming model.
https://ci.apache.org/projects/
flink/flink-docs-release-1.2/concepts/programming-model.html.
[3]
Assignments for Udacity deep learning class with TensorFlow.
https://github.
com/tensorflow/tensorflow/tree/master/tensorflow/examples/udacity.
[4]
AWS Case Study: Animoto.
https://aws.amazon.com/solutions/
case-studies/animoto/.
[5]
AWS Identity and Access Management best practices.
http://docs.aws.amazon.
com/IAM/latest/UserGuide/best-practices.html.
[6]
Azure Batch Shipyard recipes.
https://github.com/Azure/batch-shipyard/
tree/master/recipes.
[7]
Azure Data Lake Store Python SDK.
https://github.com/Azure/
azure-data-lake-store-python.
[8]
Azure: Deploy a slurm cluster.
https://github.com/Azure/
azure-quickstart-templates/tree/master/slurm/README.md.
[9] Bare metal on OpenStack: Ironic. https://wiki.openstack.org/wiki/Ironic.
[10]
CentOS 7 / RHEL 7 Open ports.
http://www.linuxbrigade.com/
centos-7-rhel-7-open-ports/.
[11]
Cloudbridge documentation.
http://cloudbridge.readthedocs.io/en/latest/
.
[12]
Containers on OpenStack: Magnum.
https://wiki.openstack.org/wiki/Magnum
.
[13]
Deep learning AMI Amazon Linux version.
https://aws.amazon.com/
marketplace/pp/B01M0AXXQB.
347
Bibliography
[14]
Euca2ools overview.
https://docs.hpcloud.com/eucalyptus/4.3.0/
euca2ools-guide/index.html.
[15]
Eucalyptus EDGE network configuration.
https://docs.eucalyptus.com/
eucalyptus/4.3/install-guide/nw_edge.html.
[16]
Eucalyptus installation guide.
https://docs.eucalyptus.com/eucalyptus/
latest/shared/install_section.html.
[17]
Eucalyptus network configuration requirements.
https://docs.hpcloud.com/
eucalyptus/4.3.0/install-guide/preparing_firewalls.html.
[18]
Eucalyptus: Plan services placement.
https://docs.eucalyptus.com/
eucalyptus/latest/install-guide/services_understanding.html.
[19]
Eucalyptus: Planning networking modes.
https://docs.eucalyptus.com/
eucalyptus/latest/install-guide/planning_networking_modes.html.
[20] Galaxy on Jetstream. https://wiki.galaxyproject.org/Cloud/Jetstream.
[21]
Get started: Create Apache Spark cluster on HDInsight Linux an d run interactive
queries using Spark SQL.
https://azure.microsoft.com/en-us/documentation/
articles/hdinsight-apache-spark-jupyter-spark-sql/.
[22]
Globus endpoint activation.
https://docs.globus.org/api/transfer/
endpoint_activation/.
[23]
Google Cloud Dataflow: Complete E xamples.
http:https://cloud.google.com/
dataflow/examples/all-examples.
[24]
Google Cloud Datalab Quickstart.
https://cloud.google.com/datalab/docs/
quickstarts/quickstart-local.
[25]
IBM Analytics Stream Computing.
http://www.ibm.com/analytics/us/en/
technology/stream-computing/.
[26] The Kubernetes project. http://kubernetes.io.
[27] Layers library reference. https://www.cntk.ai/pythondocs/layerref.html.
[28] Linux RAID. https://raid.wiki.kernel.org/index.php/Linux_Raid.
[29]
Machine Learning Library (MLlib) guide.
https://spark.apache.org/docs/
latest/ml-guide.html.
[30]
Making secure requests to Amazon Web Services.
https://aws.amazon.com/
articles/1928.
[31]
Microsoft Azure Event Hubs.
https://azure.microsoft.com/en-us/services/
event-hubs/.
348
Bibliography
[32]
Microsoft Azure Stack.
https://azure.microsoft.com/en-us/overview/
azure-stack/.
[33]
NCBI BLAST on Windows Azure.
https://www.microsoft.com/en-us/
download/details.aspx?id=52513.
[34] Ocean Observatories Initiative. http://oceanobservatories.org.
[35] The Open Compute Project. http://opencompute.org.
[36]
OpenStack documentation: CPU topologies.
https://docs.openstack.org/
admin-guide/compute-cpu-topologies.html.
[37]
OpenStack in production: Hints and tips from the CERN OpenStack cloud team.
http://openstack-in-production.blogspot.co.uk/.
[38]
OpenStack Newton release notes.
https://www.openstack.org/software/
newton/.
[39]
OpenStack: Operators mailing list.
http://lists.openstack.org/pipermail/
openstack-operators/.
[40]
OpenStack: Scientific working group.
https://wiki.openstack.org/wiki/
Scientific_working_group.
[41]
Predict with pre-trained models.
http://mxnet.io/tutorials/python/predict_
imagenet.html.
[42] Rados object storage utility. http://docs.ceph.com/docs/giant/man/8/rados/.
[43] Riak cloud storage. http://docs.basho.com/riak/cs/2.1.1/.
[44]
Sample applications built using Amazon Machine Learning.
https://github.com/
awslabs/machine-learning-samples.
[45]
Spark SQL, DataFrames and Datasets guide.
http://spark.apache.org/docs/
latest/sql-programming-guide.html.
[46]
The Red Hat Package Manager.
https://en.wikipedia.org/wiki/RPM_Package_
Manager.
[47]
Theano deep learning library.
http://www.deeplearning.net/software/theano/
.
[48]
Transferring RDA data with Globus.
http://ncarrda.blogspot.com/2015/06/
transferring-rda-data-with-globus.html.
[49] TripleO online documentation. http://tripleo.org.
[50]
VMware Cloud Foundation.
https://www.vmware.com/products/
cloud-foundation.html.
349
Bibliography
[51]
Welcome to Bridges.
https://www.psc.edu/index.php/resources/computing/
bridges.
[52]
What is IAM?
https://docs.aws.amazon.com/IAM/latest/UserGuide/
Introduction.html.
[53]
Setup Linux Network Bridges on CentOS for Nova Net-
working, Nov 2015.
https://platform9.com/support/
setup-network-bridges-on-centos-nova-networking/.
[54]
OpenStack user s u rvey, Oct 2016.
https://www.openstack.org/assets/survey/
October2016SurveyReport.pdf.
[55]
Using AWS in the context of New Zealand privacy consid e rations. Technical report,
Oct. 2016.
https://d0.awsstatic.com/whitepapers/compliance/Using_AWS_
in_the_context_of_New_Zealand_Privacy_Considerations.pdf.
[56]
G. Agha. An overview of actor languages. SIGPLAN Notices,21(10):5867,June
1986.
[57]
T. Akidau. The world beyond batch: Streaming 102, Jan 2016.
https://www.
oreilly.com/ideas/the-world-beyond-batch-streaming-102.
[58]
T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma,
R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, and S. Whittle. The dataflow
model: A practical approach to balancing correctness, latency, and cost in massive-
scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endow-
ment,8(12):17921803,Aug.2015.
[59]
T. Akidau and F. Perry. Dataflow/Beam and Spark: A Programming
Model Comparison, Feb 2016.
https://cloud.google.com/dataflow/blog/
dataflow-beam-and-spark-comparison.
[60]
A. Aliper, S. Plis, A. Artemov, A. Ulloa, P. Mamoshina, and A. Zhavoronkov. Deep
learning applications for predicting pharmacological properties of d rugs and drug
repurposing using transcriptomic data. Molecular Pharmaceutics,2016.
[61]
W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, and
I. Foster. The Globus striped GridFTP framework and server. In ACM/IEEE
Conference on Supercomputing,page54,2005.
[62]
B. Allen, J. Bresnahan, L. Childers, I. Foster, G. Kandaswamy, R. Kettimuthu,
J. Kordas, M. Link, S. Martin, K. Pickett, and S. Tuecke. Software as a service for
data scientists. Communications of the ACM,55(2):8188,Feb.2012.
[63]
S. Anthony. How big is the Cloud? ExtremeTech,May2012.
http://www.
extremetech.com/computing/129183-how-big-is-the-cloud.
350
Bibliography
[64]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan,
M. J. Franklin, A. Ghodsi, et al. S park SQL: Relational data processing in Spark. In
ACM SIGMOD International Conference on Management of Data,pages13831394,
2015.
[65] P. Bailis and K. Kingsbury. The network is reliable. Queue,12(7):20,2014.
[66]
R. Barga, J. Goldstein, M. Ali, and M. Hong. Consistent streaming through time:
A vision for event stream processing. In Conference on Innovative Data Systems
Research,pages363374,2007.
[67]
P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer,
I. Pratt, and A. Warfield. Xen and the art of virtualization. ACM SIGOPS Operating
Systems Review,37(5):164177,2003.
[68]
W. Barnett, V. Welch, A. Walsh, and C. A. Stewart. A roadmap for using NSF cyber-
infrastructure with InCommon, 2011.
https://www.incommon.org/federation/
cyberroadmap.html.
[69]
L. A. Barroso, J. Clidaras, and U. Hölzle. The datacenter as a computer: An
introduction to the design of warehouse-scale machines. Synthesis Lectures on
Computer Architecture,8(3):1154,2013.
[70] S. Beer. Brain of the Firm.PenguinPress,1972.
[71]
D. Bernstein. Containers and cloud: From LXC to Docker to Kubernetes. IEEE
Cloud Computing,1(3):8184,2014.
[72]
P. Bernstein, S. Berkov, J. Thelin, and S. Burkhardt. Orleans - Virtual Actors.
http://research.microsoft.com/en-us/projects/orleans/.
[73]
K. Bhuvaneshwar, D. Sulakhe, R. Gauba, A. Ro driguez, R. Madduri, U. Dave,
L. Lacinski, I. Foster, Y. Gusev, and S. Madhavan. A case study for cloud based high
throughput analys is of NGS data usin g the Glob us Genomics system. Computational
and Structural Biotechnology Journal,13:6474,2015.
[74]
M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D.
Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, and J. Zhao. End to end
learning for self-driving cars. arXiv preprint arXiv:1604.07316,2016.
[75]
F. Bonomi, R. Milito, J. Zhu, and S. Addepalli. Fog computing and its role in the
internet of things. In M CC Workshop on Mobile Cloud Computing,pages1316.
ACM, 2012.
[76]
D. E. Boyle, D. C. Yates, and E. M. Yeatman. Urban sensor data streams: London
2013. IEEE Internet Computing,17(6):1220,2013.
[77]
T. Bray. One Amazon year, December 2015.
https://www.tbray.org/ongoing/
When/201x/2015/12/01/One-Amazon-Year.
351
Bibliography
[78]
E. Brewer. CAP twelve years later: How the “rules” have changed. Computer,
45(2):23–29, 2012.
[79]
E. Brewer. Kubernetes and the path to cloud native. In 6th ACM Symposium on
Cloud Computing, pages 167–167. ACM, 2015.
[80]
J. Bryce. Embracing datacenter d iversity. In OpenStack Austin.2016.
https:
//www.openstack.org/videos/video/embracing-datacenter-diversity.
[81]
Y. Bu, B. Howe, M. Balazins ka, and M. D. Ernst. HaLoop: Ecient iterative data
processing on large clusters. Proceedings of the VLDB Endowment,3(1-2):285296,
2010.
[82]
S. Bugiel, S. Nürnberger, T. Pöppelmann, A.-R. Sad eghi, and T. Schneider. Ama-
zonIA: When elasticity snaps back. In 18th ACM conference on Computer and
Communications Security, pages 389–400. ACM, 2011.
[83]
J. Cantarella, C. Shonkwiler, and E. Uehara. A fast direct sampling algorithm for
equilateral closed polygons, Jan 2017. https://arxiv.org/abs/1510.02466v2.
[84]
C. Catlett, T. Malik, B. Goldstein, J. Giurida, Y. Shao, A. Panella, D. Eder, E. v.
Zanten, R. Mitchum, S. Thaler, and I. Foster. Plenario: An open data discovery
and exploration platform for urban science. Bulletin of the IEEE Computer Society
Technical Committee on Data Engineering,pages2742,2014.
[85]
A. Caulfield, E. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil,
M. Humphrey, P. Kaur, J.-Y. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Pa-
pamichael, L. Woods, S. Lanka, D. Chiou, and D. Burger. A cloud-scale acceleration
architecture. In 49th Annual IEEE/ACM International Symposium on Microarchi-
tecture,October2016.
[86]
M. Cezar. Setting up NTP (Network Time Protocol) Server in
RHEL/CentOS 7. Tecmint,Mar2015.
http://www.tecmint.com/
install-ntp-server-in-centos/.
[87]
K. M. Chandy, O. Etzion, and R. von Ammon. Event process ing. Dagstuhl Seminar
Proceedings 10201, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany,
2011.
[88]
K. Chard, S. Caton, O. F. Rana, and K. Bubendorfer. Social cloud: Cloud computing
in social networks. IEEE CLOUD,10:99106,2010.
[89]
K. Chard, J. Pruyne, B. Blaiszik, R. Ananthakrishnan, S. Tuecke, and I. Foster.
Globus data publication as a service: Lowering barriers to reproducible science. In
11th IEEE International Conference on eScience,2015.
[90]
R. Chard, K. Chard, K. Bubendorfer, L. Lacinski, R. Madduri, and I. Foster. Cost-
aware cloud provisioning. In IEEE 11th International Conference on e-Science,
pages 136–144, 2015.
352
Bibliography
[91]
R. Chard, R. Madduri, N. Karonis, K. Chard, K. Dun, C. Ordonez, T. Uram,
J. Fleischauer, I. Foster, M. Papka, and J. Winans. Scalable pCT image reconstruc-
tion delivered as a cloud service. IEEE Transactions on Cloud Computing,2015.
http://ieeexplore.ieee.org/document/7160740/.
[92]
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and
Z. Zhang. Mxnet: A flexible and ecient machine learning library for heterogeneous
distributed systems. CoRR,abs/1512.01274,2015.
[93]
Y. Chen, V. Paxson, and R. H. Katz. What’s new about cloud computing security.
University of California, Berkeley Report No. UCB/EECS-2010-5 January,2010.
[94]
T. Che n g and J. Wang. Application of a Dynamic Recurrent Neural Network in
Spatio-Temporal Forecasting,pages173186. SpringerBerlinHeidelberg,Berlin,
Heidelberg, 2007.
[95]
K. Cho, B. Van Merriënboer, D. Bahd anau, and Y. Bengio. On the proper-
ties of neural machine translation: Encoder-decoder approaches. arXiv preprint
arXiv:1409.1259,2014.
[96]
J. Clark. 5 numbers that illustrate the mind bending size of Amazon’s cloud.
Bloomberg Global Tech,Nov2014.
http://www.bloomberg.com/news/2014-11-14/
5-numbers-that-illustrate-the-mind-bending-size-of-amazon-s-cloud.
html.
[97]
Cloud Computing Security Working Group. NIST Cloud Computing S e-
curity Reference Architecture. Special Publication 500-299, National In-
stitute of Standards and Technology, 2013.
http://collaborate.nist.
gov/twiki-cloud-computing/pub/CloudComputing/CloudSecurity/NIST_
Security_Reference_Architecture_2013.05.15_v1.0.pdf.
[98]
D. T. Cohen, G. W. Hatchard, and S. G. Wilson. Population trends in incorporated
places: 2000 to 2013. Technical Report P25-1142, US Census, Mar 2015.
[99]
A. Conesa, P. Madrigal, S. Tarazona, D. Gomez-Cabrero, A. Cervera, A. McPherson,
M. W. Szcześniak, D. J. Ganey, L. L. Elo, X. Zhang, et al. A survey of best
practices for RNA-seq data analysis. Genome biology,17(1):13,2016.
[100]
F. J. Corbató and V. Vyssotsky. Introduction and overview of the Multics system.
IEEE Annals of the History of Computing,14(2):1213,1992.
[101]
J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat,
A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li,
A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito,
M. Szymaniak, C. Taylor, R. Wang, , and D. Woodford. Spanner: Google’s globally
distributed database. ACM Transactions on Computer Systems,31(3):8,2013.
353
Bibliography
[102]
T. Cowles, J. Delaney, J. Orcutt, and R. We ller. The Ocean Observatories Initiative:
Sustained ocean observing across a range of spatial scales. Marine Technology
Society Journal,44(6):5464,2010.
[103]
D. R. Cox. The regression analysis of binary sequences. Journal of the Royal
Statistical Society. Series B (Methodological),pages215242,1958.
[104]
R. J. Creasy. The origin of the VM/370 time-sharing system. IBM Journal of
Research and Development,25(5):483490,1981.
[105]
J. Czyzyk, M. P. Mesnier, and J. J. Moré. The NEOS server. IEEE Computational
Science and Engineering,5(3):6875,1998.
[106]
E. Dart, L. Rotman, B. Tierney, M. Hester, and J. Zurawski. The Science DMZ: A
network design pattern for data-intensive science. Scientific Programming,22(2):173
185, 2014.
[107]
F. De Carlo. DMagic data management system.
http://dmagic.readthedocs.io
.
[108]
J. Dean and S. Ghe mawat. MapReduce: Simplified data processing on large clus te rs .
Communications of the ACM,51(1):107113,2008.
[109]
E. Deelman, K. Vahi, M. Rynge, G. Juve, R. Mayani, and R. F. da Silva. Pegasus
in the cloud: Science automation through workflow technologies. IEEE Internet
Computing,20(1):7076,2016.
[110]
P. Dhingra, K. Tolle, and D. Gannon. Using cloud-based analytics to save lives.
Cloud Computing in Ocean and Atmospheric Sciences,page221,2016.
[111]
S. Dieleman. My solution for the Galaxy Zoo challenge, Apr 2014.
http://benanne.
github.io/2014/04/05/galaxy-zoo.html.
[112]
C. Docan, M. Parashar, and S. Klasky. DataSpaces: An interaction and coordination
framework for coupled simulation workflows. Cluster Computing,15(2):163181,
2012.
[113]
A. Dubey and D. Wagle. Delivering software as a service. The McKinsey Quarterly,
6(2007):2007, 2007.
[114]
D. Eadline. Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Com-
puting in the Apache Hadoop 2 Ecosystem.Addison-Wesley,2016.
[115]
G. Eisenhauer, M. Wolf, H. Abbasi, and K. Schwan. Event-based systems: Op-
portunities and challenges at exascale. In 3rd ACM International Conference on
Distributed Event-Based Systems,2009.
[116]
S. Ekanayake, S. Kamburugamuve, and G. Fox. SPIDAL: high performance data
analytics with Java and MPI on large multicore HPC clusters. In Spring Simulation
Multi-Conference,pages36,2016.
354
Bibliography
[117]
J. Elliott, D. Kelly, J. Chryssanthacopoulos, M. Glotter, K. Jhunjhnuwala, N. Best,
M. Wilde, and I. Foster. Th e parallel system for integrating impact models and
sectors (pSIMS). Environmental Modelling & Software,62:509516,2014.
[118]
O. Etzioni. Deep learning isn’t a dangerous magic genie. It’s
just math. Wired,June2016.
https://www.wired.com/2016/06/
deep-learning-isnt-dangerous-magic-genie-just-math/.
[119]
B. Familiar. Microservices, IoT and Azure: Leveraging DevOps and Microservice
Architecture to deliver SaaS Solutions.APress,2015.
[120]
M. R. Ferré. Cloud native applications (for dummies), 2014.
http://www.it20.
info/2014/12/cloud-native-applications-for-dummies/.
[121]
R. T. Fielding. Architectural styles and the design of network-based software archi-
tectures. PhD thes is, University of California, Irvine, 2000.
[122]
J. Fischer, S. Tuecke, I. Foster, and C. A. Stewart. Jetstream: A distributed cloud
infrastructure for underresourced higher education communities. In 1st Workshop on
The Science of Cyberinfrastructure: Research, Experience, Applications and Models,
pages 53–61. ACM, 2015.
[123]
I. Foster. Globus Online: Accelerating and democratizing science through cloud-
based services. IEEE Internet Computing,15(3):7073,May2011.
[124]
I. Foster, K. Chard, and S. Tuecke. The discovery cloud: Accelerating and democ-
ratizing research on a global scale. In I EEE International Conference on Cloud
Engineering, pages 68–77. IEEE, 2016.
[125]
I. Foster, R. Ghani, R. S. Jarmin, F. Kreuter, and J. I. Lane, editors. Big Data and
Social Science: A Practical Guide to Methods and Tools.Taylor&FrancisGroup,
2016. See also http://www.bigdatasocialscience.com.
[126]
I. Foster and C. Kess elman. The history of the grid. In High Performance Computing:
From Grids and Clouds to Exascale, pages 3–30. IOS Press, 2011.
[127]
I. Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud computing and grid computing
360-degree compared. In Grid Computing Environments Workshop, pages 1–10. Ieee,
2008.
[128]
A. Fox, D. A. Patterson, and S. Joseph. Engineering Software as a Service: An
Agile Approach using Cloud Computing. Strawberry C anyon LLC, 2013.
[129]
G. Fox and D. Gannon. Using clouds for technical computing, 2013.
http://www.
academia.edu/14845479/Using_Clouds_for_Technical_Computing.
[130]
G. Fox, S. Jha, and L. Ramakrishnan. Streaming and Steering Applications:
Requirements and Infrastructure. http://streamingsystems.org.
355
Bibliography
[131]
G. C. Fox, R. D. Williams, and G. C. Messina. Parallel computing works! Morgan
Kaufmann, 2014.
[132]
B. H. Frank. AWS wants to dominate beyond the public cloud with Lambda updates.
PC World,Dec.2016.http://www.pcworld.com/article/3147389/.
[133]
D. Gannon. Performance Analysis of a Cloud Microservice-based
ML Classifier, Oct 2015.
https://esciencegroup.com/2015/10/08/
performance-analysis-of-a-cloud-microservice-based-ml-classifier/.
[134]
D. Gannon. CNTK revisited. A new deep learning toolkit release
from Microsoft, Nov 2016.
https://esciencegroup.com/2016/11/10/
cntk-revisited-a-new-deep-learning-toolkit-release-from-microsoft/.
[135]
D. Gannon, D. Fay, D. Green, K. Takeda, and W. Yi. Science in the cloud: Lessons
from three years of research projects on Microsoft Azure. In 5th International
Workshop on Scientific Cloud Computing, pages 1–8. ACM, 2014.
[136]
Gartner Research. Software as a Service (SaaS).
http://www.gartner.com/
it-glossary/software-as-a-service-saas.
[137]
K. Gee and W. Hunt. Enhancing stormwater management benefits of rainwater
harvesting via innovative technologies. Journal of Environmental Engineering,
142(8):04016039, 2016.
[138]
L. George. HBase: The Definitive G uide: Random Access to Your Planet-Size Data.
O’Reilly Media, Inc., 2011.
[139] A. V. Gerbessiotis and L. G. Valiant. Direct bulk-synchronous parallel algorithms.
Journal of Parallel and Distributed Computing,22(2):251267,1994.
[140]
S. Goasguen. Enjoy Kubernetes with Python.
https://www.linux.com/learn/
kubernetes/enjoy-kubernetes-python.
[141]
J. Go ecks, A. Nekrutenko, J. Taylor, and T. G. Team. Galaxy: A comprehensive
approach for supporting accessible, reproducible, and transparent computational
research in the life sciences. Genome Biol,11(8):R86,2010.
[142]
J. Gong, P. Yue, and H. Zhou. Geoprocessing in the Microsoft cloud comput-
ing platform–Azure. In Joint Symposium of ISPRS Technical Commission IV &
AutoCarto,page6,2010.
[143]
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org.
[144]
A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent
neural networks. In IEEE International Conference on Acoustics, Speech, and Signal
Processing, pages 6645–6649. IEEE, 2013.
356
Bibliography
[145]
A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The cost of a cloud: Research
problems in data center networks. ACM SIGCOMM Computer Communication
Review,39(1):6873,2008.
[146]
K. Gremban. Get started with access management in the Azure por-
tal.
https://docs.microsoft.com/en-us/azure/active-directory/
role-based-access-control-what-is.
[147]
W. Gropp, E. Lusk, and R. Thakur. Using MPI-2: Advanced features of the Message
Passing Interface. MIT Press, 1999.
[148]
J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques.Morgan-
Kaufmann, 2011.
[149]
D. Hardt. OAuth 2.0 authorization framework specification, 2012.
http://tools.
ietf.org/html/rfc6749.
[150]
J. A. Hartigan and M. A. Wong. Algorithm AS 136: A k-means clustering algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics),28(1):100108,
1979.
[151]
K. Hashizume, D. G. Rosado, E. Fernández-Medina, and E. B. Fernandez. An
analysis of security issues for cloud computing. Journal of Internet Services and
Applications,4(1):1,2013.
[152]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.
CoRR,abs/1512.03385,2015.
[153]
T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: Data-Intensive Scientific
Discovery.Kindle,2009.
[154]
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz,
S. Shen ker, and I. Stoica. Mesos: A platform for fine-grained resource sharing
in the data center. In USENIX Symposium on Networked Systems Design and
Implementation,pages2222,2011.
[155]
B. Holzman. Fermilab HEPCloud: An elastic computing facility for High Energy
Physics. In International Conference on Computing in High Energy Physics.2016.
https://indico.cern.ch/event/432527/contributions/1072465/.
[156]
A. Howard. Running MPI applications in Amazon EC2, May 2015.
https://
cyclecomputing.com/running-mpi-applications-in-amazon-ec2/.
[157]
W. Huang, A. Ganjali, B. H. Kim, S. Oh, and D. Lie. The state of public
infrastructure-as-a-service cloud security. ACM Computing Surveys,47(4):68,2015.
[158]
T. Hunt. Introducing Have I been pwned?” aggregating ac-
counts across website breaches, Dec 2013.
https://www.troyhunt.com/
introducing-have-i-been-pwned/.
357
Bibliography
[159]
A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia,
D. Gunter, D. Skinner, G. Ceder, et al. The materials project: A materials genome
approach to accelerating materials innovation. APL Materials,1(1):011002,2013.
[160]
S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wan-
derer, J. Zhou, M. Zhu, et al. B4: Experience with a globally-deployed software
defined WAN. ACM SIGCOMM Computer Communication Review,43(4):314,
2013.
[161] Y. Jia and E. Shelhamer. Cae. http://caffe.berkeleyvision.org/.
[162]
B. Johnson. Cloud computing is a trap, warns GNU founder Richard Stallman.
Guardian Newspaper,Sep2008.
https://www.theguardian.com/technology/
2008/sep/29/cloud.computing.richard.stallman.
[163]
B. Jones. Towards the European open science cloud, 2015.
http://doi.org/10.
5281/zenodo.16001.
[164]
N. Jouppi. Google supercharges machine learning tasks with TPU custom chip.
Google Cloud Pl atform Blog,May2016.
https://cloudplatform.googleblog.com/
2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.
html.
[165]
S. Kamburugamuve and G. Fox. Survey of distributed stream processing, Feb 2016.
https://www.researchgate.net/publication/299411481.
[166]
S. Kamburugamuve, P. Wickramasinghe, S. Ekanayake, and G. Fox. Anatomy of
machine learning algorithm implementations in MPI, Spark, and Flink, Jan 2017.
https://www.researchgate.net/publication/312426658.
[167]
N. T. Karonis, K. L. Dun, C. E. Ordoñez, B. Erdelyi, T. D. Uram, E. C. Olson,
G. Coutrakon, and M. E. Papka. Distributed and hardware accelerated computing
for clinical medical imaging using proton computed tomography (pCT). Journal of
Parallel and Distributed Computing,73(12):16051612,2013.
[168]
A. Karpathy. The unreasonable eectiveness of recurrent neural networks, Feb 2015.
http://karpathy.github.io/2015/05/21/rnn-effectiveness.
[169]
M. Kassner. A look at Amazon’s world class data center
ecosystem, Dec 2014.
http://www.techrepublic.com/article/
a-look-at-amazons-world-class-data-center-ecosystem.
[170]
S. Kemp. Password-les s logins with OpenSSH, 2005.
https://
debian-administration.org/article/152/.
[171]
R. D. King, J. Rowland, S. G. Oliver, M. Young, W. Aubrey, E. Byrne, M. Liakata,
M. Markham, P. Pir, L. N. Soldatova, A. Sparkes, K. E. Whelan, and A. Clare. The
automation of science. Science,324(5923):8589,2009.
358
Bibliography
[172]
G. Klimeck, M. McLennan, S. P. Brophy, G. B. Adams III, and M. S. Lundstrom.
nanohub.org: Advancing education and research in nanotechnology. Computing in
Science & Engineering,10(5):1723,2008.
[173]
S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. M. Patel,
K. Ramasamy, and S. Taneja. Twitter Heron: Stream processing at scale. In ACM
SIGMOD International Conference on Management of Data, pages 239–250. ACM,
2015.
[174]
H. S. Kuyuk, R. M. Allen, H. Brown, M. Hellweg, I. Henson, and D. Neuhauser. De-
signing a network-based earthquake early warning algorithm for California: ElarmS-2.
Bulletin of the Seismological Society of America,2013.
[175]
M. Lamann a. The LHC computing grid project at CERN. Nuclear Instruments
and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors
and Associated Equipment,534(1):16,2004.
[176]
K. A. Lawrence, M. Zentner, N. Wilkins-Diehr, J. A. Wernert, M. Pierce, S. Marru,
and S. Michael. Science gateways today and tomorrow: Positive perspectives of
nearly 5000 members of the research community. Concurrency and Computation:
Practice and Experience,27(16):42524268,2015.
[177]
J. Layton. A container for HPC.
https://www.admin-magazine.com/HPC/
Articles/Singularity-A-Container-for-HPC.
[178]
J. A. Le, H. El-Askary, M. Allali, and D. Struppa. Application of recurrent neural
networks for d rought projections in California. Atmospheric Research,187,2017.
[179] H. Lee. Simple Azure. https://readthedocs.org/projects/simple-azure/.
[180]
P. D. Lena, K. Nagata, and P. F. Baldi. Deep spatio-temporal architectures
and learning for protein structure prediction. In Advances in Neural Information
Processing Systems 25, p ages 512–520. Curran Associates, Inc., 2012.
[181]
Y. Li. Introduction to Docker secrets management.
https://blog.docker.com/
2017/02/docker-secrets-management.
[182]
D. Lifka, I. Foster, S. Mehringer, M. Parashar, P. Redfern, C. Stewart, and S. Tuecke.
XSEDE cloud survey report, 2013. http://hdl.handle.net/2142/45766.
[183]
I. Liu and B. Ramakrishnan. Bach in 2014: Music composition with recurrent neural
network. CoRR,abs/1412.3191,2014.
[184]
Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R. Tchoua,
J. Lofstead, R. Oldfield, M. Parashar, N. Samatova, K. Schwan, A. Shoshani, M . Wolf,
K. Wu, and W. Yu. Hello ADIOS: The challenges and lessons of developing leadership
class I/O frameworks. Concurrency and Computation: Practice and Experience,
26(7):1453–1473, 2014.
359
Bibliography
[185]
Y. Liu, A. Padmanabhan, and S. Wang. CyberGIS Gateway for enabling data-rich
geospatial research and education. Concurrency and Computation: Practice and
Experience,27(2):395407,2015.
[186]
R. Madduri, K. Chard, R. Chard, L. Lacinski, A. Rodriguez, D. Sulakhe, D. Kelly,
U. Dave , and I. Foster. The Globus Galaxies platform: Delivering science gateways
as a service. Concurrency Practice and Experience,27(16):43444360,2015.
[187]
R. K. Madduri, D. Sulakhe, L. Lacinski, B. Liu, A. Rodriguez, K. Chard, U. J. Dave,
and I. T. Foster. Experiences building Globus Genomics: A next-generation sequenc-
ing analysis se rvic e using Galaxy, Globus, and Amazon Web Services. Concurrency
Practice and Experience,26(13):22662279,2014.
[188]
P. K. Mantha, A. Luckow, and S. Jha. Pilot-MapRedu ce: An extensible and flexible
MapReduce implementation for distributed data. In 3rd International Workshop on
MapReduce and Its Applications, pages 17–24. ACM, 2012.
[189]
J. Margolis. Amazon Echo’s role in deep space exploration.
Financial Times,Jan2017.
https://www.ft.com/content/
24529e30-d0e3-11e6-b06b-680c49b4b4c0.
[190]
N. Marz and J. Warren. Big Data Principles and best practices of scalable realtime
data systems.Manning,2015.
[191]
A. Matsunaga, J. Fortes, K. Keahey, and M. Tsugawa. Sky computing. IEEE
Internet Computing,13:4351,2009.
[192] K. Matthias and S. P. Kane. Docker: Up and Running.OReilly,2016.
[193]
W. McKinney. Python for Data Analysis: Data Wrangling with Pandas, NumPy,
and IPython.OReillyMedia,2015.
[194]
N. Mehrotra, L. Franks, P. McKay, R. McAllister, and J. Gao. Get started:
Create Apache Spark cluster in Azure HDInsight and run interactive queries
using Spark SQL.
https://docs.microsoft.com/en-us/azure/hdinsight/
hdinsight-apache-spark-jupyter-spark-sql/.
[195]
N. Mehrotra, R. McMurray, L. Franks, and J. Gao. Machine learning: Predic-
tive analysis on food inspection data using MLlib with Apache Spark cluster
on HDInsight Linux.
https://docs.microsoft.com/en-us/azure/hdinsight/
hdinsight-apache-spark-machine-learning-mllib-ipython.
[196]
P. Mehrotra, J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazano, S. Saini, and
R. Biswas. Performance evaluation of Amazon EC2 for NASA HPC applications.
In 3rd Workshop on Scientific Cloud Computing, pages 41–50. ACM, 2012.
[197]
P. Mell and T. Grance. The NIST definition of cloud computing. Special Publication
800-145, National Institute of Standards and Technology, 2011.
http://nvlpubs.
nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf.
360
Bibliography
[198]
X. Meng, J. Bradley, B. Yavuz, E. S p arks , S. Venkataraman, D. Liu , J. Freeman,
D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia,
and A. Talwalkary. MLlib: Machine learning in Apache Spark. Journal of Machine
Learning Research,17(34):17,2016.
[199]
F. Meyer, D. Paarmann, M. D’Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian,
A. Rod riguez, R. Stevens, A. Wilke, et al. The metagenomics RAST server–A public
resource for the automatic phylogenetic and functional analysis of metagenomes.
BMC bioinformatics,9(1):386,2008.
[200]
Microsoft Research Connections. MSR Courseware.
https://github.com/
MSRConnections/Azure-training-course.
[201]
M. A. Miller, W. Pfeier, and T. Schwartz. Creating the CIPRES science gateway
for inference of large phylogenetic trees. In Gateway Computing Environments
Workshop,pages18,2010.
[202]
D. Milojičić, I. M. Llorente, and R. S. Montero. OpenNebula: A cloud management
tool. IEEE Internet Computing,15(2):1114,2011.
[203]
N. M. Mohamed, H. Lin, and W.-C. Feng. Accelerating data-intensive gen om e
analysis in the cloud. In 5th International Conference on Bioinformatics and
Computational Biology.2013.
[204]
T. P. Morgan. A rare peek at the massive scale of AWS. En-
terpriseTech,Nov2014.
http://www.enterprisetech.com/2014/11/14/
rare-peek-massive-scale-aws.
[205]
A. Morin, J. Urban, P. D. Adams, I. Foster, A. Sali, D. Baker, and P. Sliz. Shining
light into black boxes. Science,336(6078):159160,2012.
[206]
A. Mouat. Docker security: Using containers safely in production.
https://gallery.mailchimp.com/979c70339150d05eec1531104/files/
Docker_Security_Red_Hat.pdf.
[207]
A. C. Muller and S. Guido. Introduction to Machine Learning with Python: A Guide
for Data Scientists.OReillyPublishing,2017.
[208]
N. Nakata, J. P. Chang, J. F. Lawrence, and P. Boué. Body wave extraction and
tomography at long beach, california, with ambient-noise interferometry. Journal of
Geophysical Research: Solid Earth,120(2):11591173,2015.
[209]
F. Nelli. Python Data Analytics: Data Analysis and Science using Pandas, Matplotlib
and the Python Programming Language.Apress,2015.
[210]
M. A. Nielsen. Neural Networks and Deep Learning.DeterminationPress,2015.
http://neuralnetworksanddeeplearning.com.
361
Bibliography
[211]
B. Nikolic. Data processing for the Square Kilometre Array telescope.
http:
//www.mrao.cam.ac.uk/~bn204/publications/2015/SKA-SDP-Streaming.pdf.
[212]
D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youse, and
D. Zagorodnov. The Eucalyptus open-source cloud-computing system. In 9th
IEEE/ACM International Symposium on Cluster Computing and the Grid,pages
124–131, 2009.
[213]
C. Olah. Understanding LSTM networks, Aug 2015.
http://colah.github.io/
posts/2015-08-Understanding-LSTMs/.
[214]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-
foreign language for data processing. In ACM SIGMOD International Conference
on Management of Data, pages 1099–1110. ACM, 2008.
[215]
R. Orihuela and D. Bass. Help wanted : Black belts in data,
Jun 2015.
http://www.bloomberg.com/news/articles/2015-06-04/
help-wanted-black-belts-in-data.
[216]
K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. Chung. Toward
accelerating deep learning at scale using specialized hardware in the datacenter. In
27th HotChips Symposium on High-Performance Chips. IEEE, August 2015.
[217]
D. F. Parkhill. The Challenge of the Computer Utility.Addison-WesleyEducational
Publishers, 1966.
[218]
N. Paskin. Digital object identifier (DOI) system. Encyclopedia of Library and
Information Sciences,3:15861592,2010.
[219]
F. Pérez and B. Granger. The state of Jupyter, Jan 2017.
https:/www.oreilly.
com/ideas/the-state-of-jupyter.
[220]
D. A. Phillips, C. Puskas, Santillan, L. M., Wang, R. W. King, W. M. Szeliga,
T. Melbourne, M. Murray, M. Floyd, and T. A. Herring. Plate Boundary Observatory
and related networks: GPS data analysis methods and geodetic products. Reviews
of Geophysics,54:759f808,2016.
[221]
I. Raicu, I. Foster, and Y. Zhao. Many-task computing for grids and supercomputers.
In IEEE Workshop on Many-Task Computing on Grids and Supercomputers,2008.
[222]
K. Ram. Git can facilitate greater reproducibility and increased transparency in
science. Source Code for Biology and Medicine,8(1):7,2013.
[223]
L. Ramakrishnan, P. T. Zbiegel, S. Campbell, R. Bradsh aw, R. S. Canon, S. Coghlan,
I. Sakrejda, N. Desai, T. Dec lerck, and A. Liu. Magellan: Experiences from a science
cloud. In 2nd International Workshop on Scientific Cloud Computing,pages4958.
ACM, 2011.
362
Bibliography
[224] S. Rashka. Py thon Machine Learning.PacktPublishing,2016.
[225] K. Reitz. Requests: HTTP for humans. http://docs.python-requests.org.
[226] J. Richer. OAuth 2.0 token introspection. RFC 7662, IETF, 2015.
[227]
M. Rosenblum and T. Garfinkel. Virtual machine monitors: Current technology
and future trends. Computer,38(5):3947,2005.
[228]
M. Russinovich. Report from Open Networking Summit: Achieving hyper-scale
with software defined networking. http://bit.ly/2laCxLT.
[229]
S. Ryza, U. Laserson, S. Owen, and J. Wills. Advanced Analytics with Spark:
Patterns for Learning from Data at Scale.OReillyMedia,2015.
[230]
N. Sakimura, J. Bradley, M. Jones, B. d. Medeiros, and C. Mortimore. OpenID
Connect Core 1.0 incorporating errata set 1, 2014.
http://openid.net/specs/
openid-connect-core-1_0.html.
[231]
D. Sanderson. Programming Google App Engine with Python: Build and Run
Scalable Python Apps on Google’s Infrastructure.OReillyPress,2015.
[232]
M. Satyanarayanan. The emergence of edge computing. Computer,50(1):3039,
2017.
[233]
C. Severance. Python for informatics: Exploring information, 2013.
http://www.
pythonlearn.com/book.php.
[234]
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driess-
che, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Diele-
man, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach,
K. Kavukcuoglu, T. Graepel, and D. Hass abis. Mastering the game of Go with deep
neural networks and tree search. Nature,529(7587):484489,2016.
[235]
F. Simorjay. Shared responsibilities for cloud computing. Technical
report, Microsoft, Mar 2016.
https://gallery.technet.microsoft.com/
Shared-Responsibilities-81d0ff91.
[236]
A. Singh, J. Ong, A. Agarwal, G. Anderson , A. Armistead, R. Bannon, S. Boving,
G. Desai, B. Felderman, P. Germano, et al. Jupiter rising: A decade of Clos
topologies and centralized control in Google’s datacenter network. ACM SIGCOMM
Computer Communication Review,45(4):183197,2015.
[237]
L. Smarr and C. E. Catlett. Metacomputing. Communications of the ACM,35(6):44
53, 1992.
[238] R. M. Stallman. Who does that server really serve? Boston Review,35(2),2010.
363
Bibliography
[239]
R. Stevens, P. Woodward, T. DeFanti, and C. Catlett. From the I-WAY to the
National Technology Grid. Communications of the ACM,40(11):5060,1997.
[240]
C. Strasser. Git/GitHub: A primer for researchers, 2014.
http://datapub.cdlib.
org/2014/05/05/github-a-primer-for-researchers/.
[241]
A. Szalay and J. Gray. The world-wide telescope. Science,293(5537):20372040,
2001.
[242]
T. Tetrick. Best practices for securing access to your
Azure virtual machines, Jun 2014.
https://blogs.
technet.microsoft.com/uspartner_ts2team/2014/06/04/
best-practices-for-securing-access-to-your-azure-virtual-machines/.
[243]
D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice:
The Condor experience. Concurrency and computation: practice and experience,
17(2-4):323–356, 2005.
[244]
B. Tierney, J. Metzger, J. Boote, E. Boyd, A. Brown, R. Carlson, M. Zekauskas,
J. Zurawski, M. Swany, and M. Grigoriev. perfsonar: Instantiating a global n etwork
measurement framework. In SOSP Workshop on Real Overlays and Distributed
Systems,2009.
[245]
J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood,
S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr.
XSEDE: Accelerating scientific discovery. Computing in Science & Engineering,
16(5):62–74, 2014.
[246]
R. Tudoran, A. Costan, G. Antoniu, and H. Soncu. TomusBlobs: Towards
communication-ecient storage for MapReduce applications in Azure. In 12th
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing,pages
427–434, 2012.
[247]
S. Tuecke, R. Ananthakrishnan, K. Chard, M. Lidman, B. McCollam, and I. Foster.
Globus Auth: A research identity and access management platform. In 12th IEEE
International Conference on e-Science,2016.
[248]
T. Tugend. UCLA to be first station in nationwide computer network, July 1969.
http://www.lk.cs.ucla.edu/LK/Bib/REPORT/press.html.
[249]
J. Turnbull. The Docker Book: Containerization is the new virtualization.Kindle,
2014.
[250]
A. Vahdat. A look inside Google’s data center net-
works, 2015.
https://cloudplatform.googleblog.com/2015/06/
A-Look-Inside-Googles-Data-Center-Networks.html.
[251] J. van Vliet and F. Paganelli. Programming AWS EC2.OReillyPress,2011.
364
Bibliography
[252]
T. C. Vance, N. Merati, C. Yang, and M. Yuan. Cloud Computing in O cean and
Atmospheric Sciences.Elsevier,2016.
[253]
J. VanderPlas. Python Data Science Handbook: Essential Tools for Working with
Data.OReillyMedia,2017.
[254]
J. Varia. Tips for securing your EC2 instance.
https://aws.amazon.com/
articles/1233.
[255]
N. Vijayakumar and B. Plale. Performance evaluation of rate-based join window
sizing for asynchronous data streams. In 13th IEEE International Symposium on
High Performance Distributed C omputing,pages260261,2004.
[256]
W. Vogels. MXNet Deep learning framework of choice at
AWS, Nov 2016.
http://www.allthingsdistributed.com/2016/11/
mxnet-default-framework-deep-learning-aws.html.
[257]
M. M. Waldrop. The Dream Machine: JCR Licklider and the Revolution that Made
Computing Personal.VikingPenguin,2001.
[258] T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 2012.
[259]
M. Wilde, M. Hategan, J. M. Wozniak, B. Cliord, D. S. Katz, and I. Foster. Swift:
Alanguagefordistributedparallelscripting.Parallel Computing,37(9):633652,
2011.
[260]
J. Wilkening, A. Wilke, N. Desai, and F. Meyer. Using clouds for metagenomics: A
case study. In IEEE International Conference on Cluster Computing,pages16,
2009. http://www.mcs.anl.gov/papers/P1665A.pdf.
[261]
N. Wilkins-Diehr, D. Gannon, G. Klimeck, S. Os te r, and S. Pamidighantam. Te ra-
Grid science gateways and their impact on science. Computer,41(11),2008.
[262]
K. Williams, E. Bilsland, A. Sparkes, W. Aubrey, M. Young, L. N. Soldatova,
K. De Grave, J. Ramon, M. de Clare, W. Sirawaraporn, S. G. Oliver, and R. D. King.
Cheaper faster drug development validated by the repositioning of drugs against
neglected tropical diseases. Journal of the Royal Society Interface,12(104):20141289,
2015.
[263] A. Wittig and M. Wittig. Amazon Web Services in Action.ManningPress,2015.
[264]
D. Xue, P. V. Balachandran, J. Hogde n, J. Theiler, D. Xue, and T. Lookman.
Accelerated search for materials with targeted prop erties by adaptive design. Nature
Communications,7,2016.
[265]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark:
Cluster computing with working sets. In HotCloud,2010.
https://www.usenix.
org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf.
365
Bibliography
[266]
Y. Zheng, X. Chen, Q. Jin, Y. Chen, X. Qu, X. Liu, E. Chang, W.-Y. Ma, Y. Rui,
and W. Sun. A cloud-based knowledge discovery system for monitoring fine-grained
air quality. Technical Report MSR-TR-2014–40, Microsoft Research, 2014.
366
Index
23andMe genotyping service, 253
academic cloud, 6
access tokens, in OAuth2, 231
ACID semantics, in database, 27
activation functions, 205
actors, 96
ADIOS, 165
Advanced Message Queuing Protocol, 112, 122
Advanced Photon Source, 240
Amazon cloud services, 29
Amazon Machine Learning, 203
Athena analytics, 149
Aurora relational database service, 33
Batch, 305
CloudFormation, 100
CloudTrail auditing, 318
CloudWatch metrics, 318
Deep Learning AMI, 212
DynamoDB, 31, 116, 309
EC2 Container Service, 343
EC2 Container Service (ECS), 114–120
Elastic Block Store (EBS), 29, 77, 264, 305
Elastic Compute Cloud, 75–80
Elastic File System (EFS), 29
Elastic MapReduce (EMR), 31, 143–146
Elasticsearch Service, 34
Glacier archival storage, 6, 31
Identity and Access Management (IAM), 115,
320
Lex voice input, 202
Polly text to speech, 202
Redshift data warehouse, 33
Rekognition deep learning, 202
Relational Data Service (RDS), 33
Relational Database Service (RDS), 306 , 309
Route 53, 305
Simple Email Service (SES), 309
Simple Queue Service (SQS), 34, 116, 117, 170
Simple Storage Service (S3), 10, 31, 38–41,
309
Titan graph database, 34
Virtual Private Cloud, 305
Virtual Private Cloud (VPC), 309
Animoto, 299
Apache CloudStack, 260
Apache libcloud Python SDK, 38
Apache Parquet, 150
Apache software foundation
Beam, 184
Flink, 187
Kafka, 180
YARN, 137
application whitelisting, 318
Argonne National Laboratory, 240
Aristotle academic cloud, 7, 70
Array of Things urban observatory, 164, 171, 339
artificial neural networks, 204
arXiv, document classifier for, 113, 197
Atmosphere, 35
Aurora relational database service, 33
AWS Bat ch , 3 05
Azure cloud services, 29
Azure Batch, 105
Azure Stack private cloud, 260
Blob storage service, 31
Cortana cognitive services, 220
Data Lake, 33, 148
DocumentDB, 33
Event Hubs, 34, 162, 175–179
File Storage, 30
Graph Engine, 34
HDInsight, 32
Machine Learning, 197–201
Queue storage service, 34
Quick Start orchestration, 105
Role-Based Access Control (RBAC), 319
Security Center service, 318
SQL Database service, 33
Storage Explorer, 31, 44
Stream Analytics, 175–179
Table storage service, 32
Threat Analytics service, 318
U-SQL data analytics tool, 149
367
Index
Virtual Machines, 80–81
Azure Stack private cloud software, 260
back propagation, 205
bag of task parallelism, 107
BigQuery, 150
Binder, 91
Bionimbus academic cloud, 7
bisection bandwidth, 106
Blob, 31
blob, binary large object, 24
Bridges computer system, 284
bucket, storage aggregation concept, 10
as used in Amazon cloud, 31
bulk synchronous parallelism (BSP), 67, 96, 108
Cayley graph database, 30, 34
Celery, a Python package, 122
Ceph distributed storage system, 35
CfnCluster (CloudFormation Cluster), 100
Chameleon academic cloud, 7, 284
Cinder, an OpenStack service, 284
client-side encryption, 323
Cloud BI, 67
cloud bursting, 6, 70
Cloud Datalab, 67
cloud native application, 62, 298, 335–337
Cloud Native Computing Foundation, 335
Cloud Pub/Sub, 34
Cloud Security Alliance, 328
cloud, typ es of
academic, 6
community, 6
discovery, 346
hybrid, 6
private, 5
public, 4
CloudBridge Python SDK, 38, 50, 84
CloudStack cloud software, 260, 261
CloudTrail, an Amazon cloud service, 318
CloudWatch, an Amazon cloud service, 318
CNTK, 210
see Microsoft Cognitive Toolkit, 210
community cloud, 6
container, aggregation construct in object store, 24
container, server virtualization method, 64, 85–94
compared with virtual machine, 66
Docker supp ort for, 86
sharing secrets, 320
Singularity as alternative to Docker, 94
Content distribution networks, 339
convolutional neural network, 207
cost
comparative studies in physics, 128, 332
of c loud for pCT analysis, 99
of d ierent instance types, 80
savings by elastic provisioner, 80
data model, 26
data stream analytics, 161
Data Transfer Node, 244
database management system (DBMS), 26
dataflow, 67
Datalab, 151
deep learning, 134, 204–212
TensorFlow toolkit, 215–2 18
deep neural network, 98, 206
Department of Energy, xiii, 128, 245
DevOps, 111
DMagic system, 240
Docker, 85
Swarm container management, 67, 125, 320
document store, 27
DSpace, 91
edge computing, 338
Elastic MapReduce (EMR), 143–146
enhanced networking, 101
ESnet, 128, 245
Eucalyptus cloud software, 73, 261–281
deployment planning, 263
euca2ools command line interface, 275
single cluster cloud, 267
eventual consistency, 27
Fermilab, 128, 332
Field Programmable Gate Arrays, 338
file shares, 29
filtered back projection, 99
fourth paradigm, 128
Galaxy workflow system, 83, 91, 304
Ganglia monitoring tool, 100
gcfuse, 128
Genome Wide Association Study (GWAS), 253
GeoDeepDive, 128
Gigabit testbeds, 330
GitHub, 13
and cloud access keys, 317, 326
Glance, an OpenStack service, 284
Globus application examples
data sharing at Advanced Photon Source, 240
NCAR Research Data Archive, 247
Sanger Institute Imputation Service, 253
Globus Genomics, 108, 128
Globus research data management service, 298
accounts, 236
endpoints, 226
Globus Connect, 51, 304
identity providers supported, 236
publication service, 239
Google cloud services, 29
368
Index
AppEngine, 82
BigQuery, 33
Bigtable NoSQL, 47–48
Cayley graph database, 30, 34
Cloud Bigtable, 32
Cloud Dataflow, 184
Cloud Datastore, 30, 32, 122
Cloud Pub/Sub, 34
Cloud pub/sub, 122
Cloud SQL, 33
Cloud Storage, 31
Coldline archival storage, 31
Compute Engine, 30, 82
Datastore NoSQL, 48–50
Kubernetes, 120–124
local SSD storage, 30
persistent disk storage, 30
Spanner, 33
storage services, 46–50
graph execution model, 96
Graphics processing unit (GPU), 80, 98, 99, 212,
218, 335
Hadoop, 96
Hadoop Distributed File System (HDFS), 136
HBase NoSQL database, 158
HDInsight, 147
Health Insurance Portability and Accountability Act
(HIPAA), 323
HEPCloud project, 128
high-performance computing, 94, 97–107, 283
and streaming, 165
on Amazon cloud, 100–103
on Azure cloud, 105–106
scaling challenges, 106
Hive data warehouse tool, 158
HTCondor job management system, 67, 128, 304
hybrid cloud, 6
hypervisor, 64, 74, 286
InCommon identity management federation, 226
infinite loop, see loop, infinite
infrastructure as a service (IaaS), 1, 63, 262, 318
Internet of Things, 175, 339
iRODS, 91
iSCSI Extensions for RDMA, 287
Jetstream academic cloud, 7, 35, 284
Jupyter, 13
JupyterHub multiuser system, 326
Kafka, 180
key pair, obtaining for Amazon clou d, 38, 75, 112
key-value store, 27
Keystone, an OpenStack service, 284
Kinesis, 34
Kinesis Analytics, 167
Kinesis Firehose, 167
Kinesis Streams, 167
Kubernetes, 67, 97
Lambda at the Edge, 339
local SSD, Google cloud service, 30
logistic function, 193
logistic regression, 193
loop, infinite, see infinite loop
Lustre parallel file system, 24
Machine learning, 191
scikit-learn package, 91
Vowpal Wabbit, 91, 198
machine learning, 134, 191–223
Amazon Machine Learning platform, 202–203
Azure Machine Learning, 197–201
MXNet open source library, 212–215
Spark MLlib, 192–197
magic operators, 143
manager worker parallelism, 107
many task parallelism, 66, 96, 107–108
MapReduce, 67, 96, 108
Mesos, 67, 97, 99
Message Passing Interface (MPI), 66
application to proton therapy, 99
in the cloud, 97
metacomputer, 330
microservice, 96, 110–122
and cloud native applications, 335
managing keys for, 320
Microsoft cloud, see Azure cloud services
Microsoft Cognitive Toolkit, 96, 210, 218
multitenancy, 288, 298, 301, 306
MySQL, 26
Amazon Aurora compatible with, 33
National Institute of Standards and Technology, xiii,
3, 328
National Institutes of Health, xiii
National Science Fo undation, xiii
National Security Agency, 321
Neutron, an OpenStack service, 284
NGINX web proxy with load balancing, 125
Nimbus cloud software, 73
NoSQL database, 27
Nova, an OpenStack service, 284
OAuth 2.0 authorization framework, 231
object store, 24
object, cloud storage unit, 10
Ocean Observatories Initiative, 163
Oozie workflow management tool, 158
Open Compute Project, 337
Open Researcher and Contributor ID (ORCID), 236
369
Index
OpenID Connect Core 1.0 (OIDC), 231, 236
OpenNebula cloud software, 73, 259, 261
OpenStack
Cinder block storage, 284
Glance image service, 284
Keystone identity component, 284
Neutron networking component, 284
Nova compute component, 284
Swift object storage, 284
OpenStack cloud services
Shared File Systems, 35
Swift object storage, 34
OpenStack cloud software, 73, 259, 283–296
and HPC, 284
and scientific workloads, 285
core services, 284
deployment, 288
persistent disk, Google cloud service, 30
personal health information (PHI), 323
Phoenix, 158
Pig, 158
platform as a service (PaaS), 3, 318
Portable Operating System Interface (POSIX), 23,
56
PostgreSQL, 26, 33, 80, 309
private cloud, 5
public cloud, 4
pros and cons, 68
publish/subscribe, 34
Python packages
Apache libcloud, 38
Azure Data Lake Store SDK, 148
Boto3 SDK, 11, 40, 76, 92, 168
Celery remote procedure call, 122
CloudBridge SDK, 38, 50, 84
Globus Auth SDK, 231–239
Globus Transfer SDK, 53, 227–230, 239
Google Cloud SDK, 46
Requests HTTP library, 253
scikit-learn machine learning, 91, 192
query lang uage, 26
RabbitMQ message broker, 122, 126
recurrent neural network (RNN), 208
RedCloud academic cloud, 7
reinforced learning, 220
relational database, 26
Relational Database Service, 306
Representational State Transfer, 9
research data portal, 243
resilient distributed dataset (RDD), 138, 171, 192,
194
resource owner, 232
resource server, 232
role-based security, 113
Route 53, the Amazon service, 305
Sanger Imputation Service, 253
Scala, 138
scale, challenges of, 66
Science DMZ, 244
science gateway, 302
scikit-learn machine learning, 91
Secure Socket Layer, 321
server-side encryption, 322
serverless computing, 62, 67
service-level agreement, 107
Simple Azure, 81
single program multiple data, 96
Single-Root I/O Virtualization, 287
SMB, 30
software as a service (SaaS), 2, 299–301, 318
software development kits, 10
solid state disk (SSD), 30, 98
Spark, 67, 96, 137–143
DataFrames, 142, 192
Simple example program, 138
Streaming, 170
Spark MLlib (machine learning), 192–197
Chicago restaurant example, 193
Estimators, 192
Pipeline, 192
Transformers, 192
Storage Service Encryption, 323
Swarm, 67
Swift parallel scripting language, 34
Swift, an OpenStack service, 284
TensorFlow machine learning library, 96, 157, 207–
208, 212, 215–218, 338, 344
Titan graph database, 34
topologies, 180
training set, 194
Transp o rt Layer Security, 321
tumbling window, 178
U-SQL data analytics tool, 149
Union File System, 86
Urban informatics, 163
UUID, universally unique identifier
use to name buckets, 46
used by Globus, 53, 236
virtual machine, 64, 73–84
compared with container, 66
instance storage, 77
Virtual Private Cloud, 305
virtual private network, 324, 326
virtualization, 74
VMWare Cloud Foundat i on, 26 0
370
Index
Vowpal Wabbit learning system, 91, 198
webHDFS, 148
XSEDE, 35
YARN, 137
Zeppelin web-based notebook, 143
371