Chapter 1

Orienting in the cloud universe

“I’ve also grown weary of reading about clouds in a book. Doesn’t this

piss you oﬀ? You’re reading a nice story, and suddenly the writer has

to stop and describe the clouds. Who cares?”

—George Carlin, “Seven Things I’m Tired Of”

We start this journey into cloud computing for science and engineering by intro-

ducing important concepts and the structure of the book, and reviewing tools that

you should know in order to obtain the most value from thi s material.

1.1 Cloud: Computer, assistant, and platform

Scientists and engineers can apply cloud capabilities in their work in many diﬀerent

ways. We ﬁnd it useful to think in terms of three categories of use.

First, a cloud is an

elastic computer

: a source of on-demand comp utin g

and storage that you can call upon when you need computing or storage capacity

larger than, or diﬀerent from, what is available locally. Accessing this capacity

in the cloud may be cheaper, faster, and/or more convenient than acquiring and

operating your own computing and storage systems. While there are diﬀerences

between the cloud computing and storage oﬀerings from diﬀerent cloud providers,

they provide quite similar capabilities: in particular, object storage and execution

of virtual machines and containers. We cover this

infrastructure as a service

(IaaS) technology and its applications in Parts I and II.

1.1. Cloud: Computer, assistant, and platform

Figure 1.1: Scientists can use clouds in three distinct ways: As a source of on-demand

computing and storage on which to run their own software (left); as a source of software

that can be run over the network (center) as a source of new platform capabilities that

can allow development of new types of software (right).

Second, a cloud is a tireless

laboratory assistant

: a source of powerful

software that can perform certain tasks more eﬀectively and/or cheaply than you

can yourself: for example, Academia.edu, Google Scholar, and ResearchGate to

access information about publications, facilitating research and citation; GitHub to

manage software and documents, facilitating collaboration, software sharing, and

reproducibility; Google Docs, Box, and Dropbox to share data; Science Exchange

to order experiments; Figshare for publishing data; Globus to move and m ana ge

large data; Skype and other services for communication; and m any others. In each

case, you can avoid substantial cogni tive, administrative, and ﬁnancial burdens

that you or members of your laboratory would incur if they had to perform these

tasks themselves. These

software as a service (SaaS)

capabilities are important,

but are largely out of scope for this book, although we do discuss how to build

your own software as a service in Chapter 14.

Third, a cloud is a

programming platform

: that is, a collection of powerful

software mechanisms that you can use to build software with capabilities that

would be diﬃcult or expensive to duplicate in your own lab : for example, an event

processing system that can process millions of events per second, a database that can

scale to billions of rows, an identity management service that can handle dozens of

diﬀerent identity providers, a data transfer service that can move terabytes securely

and reliably, or a service that is replicated in multiple geographic regions to ensure

continuous operations. These platform capabilities are arguably the most exciting

Chapter 1. Orienting in the cloud u niverse

part of cloud computing, because they enable i ndi vidu al programmers to create

and operate software systems that would otherwise require large teams. They allow

the cloud to be used as an interactive environment for large-scale computational

experimentation and discovery. They can al so be the most challenging to use

eﬀectively, because they have often been developed for use cases rather diﬀerent

from traditional technical computing. In addition, it is in this area that we see the

biggest variation across cloud vendors in terms of capabilities and interfaces. We

discuss these platform as a service (PaaS) capabilities in Part III.

Inevitably, the boundaries between these diﬀerent types of cloud system and

cloud usage are not always crisply deﬁned. For example, a growing number of

software-as-a-service oﬀerings provide APIs that allow them to be used as pl atform

services, and we often see (as discussed in Part III) platform services enhancing

the value of virtual computer oﬀerings.

1.2 The cloud landscape

The cloud landscape is large, diverse, and com pl ex. The U.S. National Institute of

Standards and Technology lists ﬁve essential characteristics of cloud co mputing :

on-demand self-service, broad network access, resource pooling, rapid elasticity

or expansion, and measured service [

197

]. Today, thousands of companies oﬀer

services with some or all of these characteristics, from low-level computing and

storage to sop hi sticated software: see Figure 1.2. But apart from the collaboration

and content management systems listed above, few of the commercial cloud services

shown in the ﬁgure are relevant to science and engineering.

One major exception is in the realm of clou d infrastructure: the elastic compute

services th at allow individuals to acquire storage and computing on demand. Here,

the landscap e is simpler, particularly when we focus on providers with oﬀerings

relevant to science and engineering. (Others specialize in speciﬁc products, such

as Oracle for databases or AT&T for telecom.) Three vendors, Amazon, Google,

and Microsoft, dominate the industry, as shown in Table 1.1 on the next page,

and each has proven useful for science and engineering. We focus in this book

on the services provided by those three providers and by one academic research

cloud, Jetstream [

122

]

jetstream-cloud.org

∗

. Nevertheless, other cloud providers

are al so impressive. For example, the New York-based DigitalOcean is popular

in the software engineering and cloud application development community, while

Rackspace supports those using the Amazon and Microsoft clouds as well as

∗

Ashaded,roundedrectangledenotesan

https

URL, in this case

https://jetstream.org

1.2. The cloud landscape

Figure 1.2: While dated, Bessemer Venture Partner’s picture of the top 300 cloud

computing companies in 2012 conveys the vast range of cloud service providers.

running its own cloud servers. Europ ean cloud providers include 1&1, UpCloud,

City Cloud, CloudSigma, CloudWatt, and Aruba Cloud. Large telecommunications

and search companies, such as China’s Baidu, are also rapidly building cloud data

centers. Together, these various companies operate more than one hundred data

centers around the globe, containing an estimated ten million servers and vast

storage. (We base these estimates on news articles [63, 169, 96, 204].)

The cloud services operated by Amazon, Google, and Microsoft are commonly

referred to as

public clouds

, by analogy with the public utilities (p ower, telephone,

Table 1.1: Ma jor cloud infrastructure providers of relevance to research.

Amazon

Market leader. Computing, storage, and platform services.

Extensively used in science and engineering.

Microsoft

Second biggest player. Computing, storage, and platform

services with both individual and enterprise customers.

Google

Began with a service called App Engine and is now using that

experience to release a full suite of cloud capabilities.

Chapter 1. Orienting in the cloud u niverse

water, sewer, etc.) on which most of us depend in our daily lives. Like public

utilities, they provide computing, storage, and/or other services to any member

of the public with th e ability to pay. (Public clouds are not regulated like public

utilities, leading some to argue that the term is inappropriate.)

In contrast, a

private cloud

is operated by a private institution or individual

to provide computing, storage, and/or other services to a more limited audience.

We can think of them as being analogous to a private electricity generator, al though

the utility that they provide is not electricity but on-demand access to computing,

storage, or software. Private clou ds are frequently deployed in larger enterprises.

IBM, VMware, and Microsoft are major providers of proprietary soluti ons for build-

ing on-premises cloud-like systems. OpenStack

openstack.org

is the dominant

open source cloud software solution, particularly in the US; this software is used,

for example, by Jetstream and in some public clou ds, such as that o perated by

Rackspace. OpenNebula [

202

]

opennebula.org

is another prominent open source

solution, used extensively in Europ e. The European Union has announced plans

for a Europe-wide science cloud [

163

]by2018

hnscicloud.eu

. There are also

numerous academic cloud projects in Europe.

Figure 1.3: Private (including academic and community), public, and hybrid clouds.

This distinction between public and private clouds may appear minor, but it

has important implications. Because the major public clouds operate at a scale

far larger than any private cloud, they can oﬀer a broad set of powerful features:

for example, elasticity, ﬁne-grained billing, high reliability due to geographic

distribution, a wide variety of resource types, and rich sets of platform services.

Equally importantly, they can achieve substantial economies of scale.

1.2. The cloud landscape

Private clouds, in contrast, typically oﬀer a limited set of cloud-like capabilities:

for example, just the ability to deploy virtual machine instances and store objects.

As we will see in Parts I and II of this book, these capabilities are enough to support

interesting applications. However, the lack of the many other services oﬀered by

the Amazon, Microsoft and Google platforms limit the range of things that can

be done. Private clouds may nevertheless be preferred in some circumstances,

for example because a speciﬁc workload can be run more cost-eﬀectively on an

in-house infrastructure, or b ecau se a company or researcher does not wis h sensitive

data to leave their premises. In such cases, so-called

hybrid clouds

may be used

to run selected tasks on public clouds: a process that is sometimes termed

cloud

bursting. (Cloud compu ting ha s i ns pired so me terrible terminology!)

community cloud

is a private cloud depl oyment designed to support a

speciﬁc community: for example, the genomics community or a set of companies

or academic institutions who want to share resources. The term

academic cloud

is sometimes used to refer to a private or community cloud focused on the needs

of the academic community. Figure 1.3 depicts these diﬀerent cloud types.

Private or public?

The merits of private clouds are hotly debated. Proponents

of private clouds argue that it can be signiﬁcantly cheaper to acquire and operate

dedicated computers and storage systems than to buy time on a public cloud. To give

just one example, let’s consider the problem of providing online access to a petabyte

of data. Storing that petabyte for a year in the Amazon public cloud object store

would cost, as of January 2017, $252,000 in the Amazon US East region. In contrast,

you can buy a 1 PB capacity SpectraLogic V erde system, a high-capacity storage

device, for $75,000, and that system of course should be usable for several years.

Asecondfrequentlycitedreasonforusingaprivatecloudistheneedtoprotect

sensitive data. Increasingly, such data can be stored on public cloud resources from

a regulatory and policy perspective, as we discuss in Chapter 15, but again the costs

can be daunting, particularly if your institution provides you with secure storage

and computing at subsidized rates. (If they do not, then the public cloud can enable

research that you could not undertake otherwise.)

Critics respond that private cloud enthusiasts underestimate the cos ts associated

with creating and running a cloud computing system, the diﬃculties inherent in

achieving high reliability and security, and the beneﬁts of a truly elastic cloud th at

always has available capacity. (Returning to your petabyte, who pays for power, space,

operations, and support? What about backups? And if you don’t need online access,

Amazon provides an archival storage service, Glacier

Glacier

,thatcanstorethat

same petabyte for just $48,000 for a year, with automated migration of infrequently

used data from object store to archive.) The question of whether to build or buy is

complex, with the answer depending on many factors. Suﬃce to say that you should

be careful to consider all relevant factors when choosing your cloud solution.

Chapter 1. Orienting in the cloud u niverse

We focus primarily in this book on public clouds, as they tend to be more

capable, more accessible, and easier to use than other clouds. However, we do

include material on the Eucalytus and O pen Stack software that are commonly

used to create private, community, and academic clouds, and on the Jetstream

academic cloud. Table 1.2 lists some p rivate clouds from th e academic community.

Table 1.2: Some private research clouds and their characteristics.

Name Description

Aristotle

Hybrid cloud for academic research, integrating Eucalyptu s private

cloud clusters and public cloud providers. federatedcloud.org

Bionimbus

Acloud-basedinfrastructureformanaging,analyzingandsharing

genomics datasets. bionimbus.opensciencedatacloud.org

Chameleon

Aconﬁgurableexperimentalenvironmentforlarge-scalecloudre-

search. chameleoncloud.org

Jetstream

Cloud computing for the U.S. academic community, operated as

part of the XSEDE research network. jetstream-cloud.org

RedCloud

Subscription-based cloud that provides virtual servers and storage

on demand. www.cac.cornell.edu/redcloud/

1.3 A guide to this book

This book has been written with you, the student, in mind. (Even if you are a

senior scientist or engineer, we know that you are still a student at heart!) Your

discipline may be physics, astronomy, b iol ogy, engineering, computer science, the

humanities, or one of the newer disciplines called computational or data science.

You may have come to this book because you have heard of new ways of computing

in the cloud and want to learn whether they matter to you. Perhaps:

• you have a lot of data that must be an alyzed by remote collaborators;

•

your current computing platform (e.g., your laptop) is no longer big enough

for your needs, and you lack access to a large cluster or supercom puter;

•

you have access to a supercomputer, but it does not work well for interactive

data analysis and collaboration tasks;

•

you want to apply new computational methods, such as machine learning or

stream analytics, that are hard to install, operate, and scale; or

•

you want to make software or data availabl e to your community as a service.

1.4. Accessing the cloud: Web, APIs, an d SDKs

We organize this book into ﬁve parts (see Figure 1.4 on th e following page),

covering the following topics:

1. Managing data in the cloud

: We describe the various types of data

storage systems that are available for use in the cloud, a nd ill us trate how you

can interact with these services using a cloud portal or directly with code.

2. Computing in the cloud

: Here we explore the spectrum of cloud com-

puting capabilities. These range from deploying single virtual machines

or containers to support basic interactive science experiment to clusters of

machines to do data analytics or traditional HPC computation.

3. Cloud as platform

: Beyond data storage and computing there are high-level

services that are particularly well suited to research applications. We examine

data analysis, machine learning, and streaming data analysis methods. We

also look at some specialized cloud tools designed speciﬁcally for science.

4. Building your own cloud

: It is possible to build a basic cloud from

scratch using some p owerful open source software packages. We describe two

examples and some of the tools needed.

5. Security and other topics

: Security is always a major concern for any

online activity. We address this topic at the end of the book, not because it

is unimportant but because managing security requires an understanding of

cloud architecture, as presented in previous chapters. We also consider some

concerns and thoughts about future cloud evolution.

1.4 Accessing the cloud: Web, APIs, and SDKs

We have explained how the cloud can be used variously as a virtual computer,

assistant, or platform. But how exactly do you use it for each of these things? We

provide details in later chapters, but let us ﬁrst explain some basic concepts.

1.4.1 Web interfaces, APIs, SDKs, and CLIs

Most cloud services can b e accessed in multiple ways. First, most support access via

the web, thus permitting intuitive point and click access without any programming

or even loca l software installation (beyond a web browser) on your part. The

availabili ty of such intuitive interfaces is pa rt of the attraction of clou d services.

Chapter 1. Orienting in the cloud u niverse

Managing&data&in&the&cloud

File%systems

Object%stores

Databases%(SQL)

NoSQL%and%graphs

Warehouses

Globus%file%services

Computing&in&the&cloud

Virtual%machines

Containers%– Docker

MapReduce%– Yarn%and%Spark

HPC%clusters%in%the%cloud

Mesos,%Swarm,%Kubernetes

HTCondor

The&cloud&as&platform

Data%analytics

Spark%&%Hadoop

Public%cloud%Too l s

Streaming%data

Kafka,%Spark,%Beam

Kinesis,%Azure%Events

Machine%learning

Scikit-Learn,%CNTK,%

Tenso r fl o w,%AWS%ML

Building&your&own&cloud

What%you%need%to%know

Using%Eucalyptus

Using%OpenStack

Security&and&other&topics

Securing%services%and%data%

Solutions

History,%critiques,%futures

Research%data%portals

DMZs%and%DTNs,%Globus

Science%gateways

Part&I

Part&II

Part&III

Part&IV

Part&V

Figure 1.4: The cloud for science, from the ground up.

A web interface becomes tedious if the same or similar actions must be per-

formed repeatedly. In such cas es, you likely want to write programs that issue

requests to cloud services on your behalf. Fortunately, most cloud services support

such programmatic access. Typically, they support a

Representational State

Transfer

(REST) application programming interface (API) that permits requests

to be transmitted via the secure Hypertext Transfer Protocol (HTTPS) that is

used by web browsers. (This common use of HTTPS is no t a coincidence: the web

interfaces discussed in the ﬁrst paragraph are often implemented via browser-hosted

Javascript programs that generate such REST messages.) REST APIs are the key

to programmatic interactions with cloud services.

The meaning of REST

. This term was introduced by Roy Fielding in 2000 [

121

who deﬁned a set of principles that should be followed to build distributed systems

that have desirable properties of the World Wide Web, such as performance, reliability,

scalability, and simplicity. These principles deﬁne that, among other things, a REST

(or RESTful) web service should refer to objects by uniform resource identiﬁers,

such as

myserver.org/myobject

,andthatoperationsontheseobjectsshouldbe

performed via HTTP operations, with for example a

PUT

being usually interpreted

as a request to create an object and a

GET

as a request to access its contents. We

give examples of REST operations below.

One way to interact with cloud services programmatically is to write prog rams

that generate REST messages directly. However, while constructing REST messages

“by hand” may appeal to hard-core system programmers, you will normally want

1.4. Accessing the cloud: Web, APIs, an d SDKs

to access cloud services via

software development kits

(SDKs) that you install

on your computer. Such SDKs permit access from programming languages such as

Python (our choice in this book), C++, Go, Java, PHP, and Ruby. (Sorry, Fortran

programmers, but Fortran SDKs are few and far between.) They typically render

operations on cloud services in ways that are consistent with the programming

model of the language in question. Cloud vendors typically provide SDKs for

accessing their services, but there are also good open source ones available, and if

you do not like any of them, you are free to develop your own.

Accessing a cloud service

.Weuseasimpleexampletoillustratethesediﬀerent

approaches to accessing cloud services. Consider the Amazon Simple Storage Service

(S3), which as we describe in Chapter 2, allows you to create and access containers

called

buckets

, within which you can store and retrieve byte strings called

objects

The Amazon web interface allows you to interact with S3 simply by pointing and

clicking. For example, Figure 1.5 on the following page shows it being used to create

anewbucketcalled

cloud4sciencebucket

, located within the US Standard region.

(Amazon, like other cloud providers, ope rates many data centers around the world.

The US Standard region is located in northern Virginia.) Such intuitive interfaces

that can be used without any programming or even local software installation (beyond

a web browser) on the part of the user are part of the attraction of cloud services.

S3 also deﬁnes a REST API that you can use to manipulate buckets and objects

programmatically. Thus, instead of u sing the Amazon web interface, we could have

created the bucket with name

cloud4sciencebucket

via a

PUT

request on the URI

cloud4sciencebucket.s3.amazonaws.com

. The following shows the syntax of this

PUT request, although omitting some of the header ﬁelds for simplicity.

PUT / HTTP/1.1

Host: cloud4sciencebucket.s3.amazonaws.com

Content-Length: length

Date: date

Authorization: authorization string

<CreateBucketConfiguration

xmlns="http://s3.amazonaws.com/doc/2006-03-01/">

<LocationConstraint>US Standard</LocationConstraint>

</CreateBucketConfiguration>

Similarly, a

DELETE

operation on the same URI requests d eletion of the bucket

that we just created, and a

GET

operation on that URI return s some or all of the

objects that may have subsequently been p laced in the bucket.

In later chapters, we describe such cloud service APIs and SDKs for a range

of cloud infrastructure and platform services. Not covered in this book, but also

Chapter 1. Orienting in the cloud u niverse

interesting, are the APIs and SDKs provided by many of the SaaS oﬀerings listed

above: for example, Dropbox, Google Do cs , LinkedIn, Science Exchange, and GitHub.

(Not all SaaS provide APIs: Google Scholar and ResearchGate, sadly, do not.)

The fact that we can easily access most cloud services both via web browser and

programmatically is one of the reasons why cloud computing has proved so impactful.

Finally, we show how an SDK simpliﬁes interactions with cloud services. The

following Python code uses the Boto3 SDK to interact with Amazon S3. We obtain

an S3 resource; delete the bucket created previously with the REST API; create the

bucket again; and upload a ﬁle to the newly created bucket.

import boto3

s3 = boto3 . r eso urc e ('s3 ')

# Delete the bucket previously created with the REST API

s3 . Buc ket ( ' cloud3sciencebucket'). delete ()

# Create that bucket again , specifying location

bucket = s3. create_bucket (Bucket = ' cloud4sciencebucket ',

CreateBucketConfiguration={

' LocationConstraint ': 'us- s ta n d a rd '})

# Upload a file ' test .jpg ' into the newly created bucket

bucket. put_object(Key=' test .jpg' ,Body=open( ' test .jpg' , 'rb '))

Figure 1.5: The Amazon S3 web interface at

console.aws.amazon.com/s3

,herebeing

used to create a new bucket called cloud4sciencebucket in the US S tan d ard region.

1.5. Tools used in this book

1.4.2 Local and cloud-hosted applications

The fact that we can, in a few lines, write programs that result in sophisticated

actions occurring in a cloud service is exciting. But where sh oul d those programs

run? One obvious location is your laptop or workstation, and indeed that may be

the right place for many purposes . For example, we might use a slightly expanded

version of the example program to upload 1,000 ﬁles from our laptop to S3.

However, in other cases, we want to run a program elsewhere: for example,

because we want the program to keep running once we close our laptop, we cannot

easily install required software on our local computer, or our program is intended

to provide services to other people. In such cases, a natural thing is to run our

program in the cloud. We discuss this topic in detail in Part II of this book, where

we see that we can create a cloud-hosted virtual computer, via either web interfaces

or APIs/SDKs, much as we created a bucket in the example above.

In summary, the cloud can be viewed both as a source of services and as a

place to run programs. Cloud services can be accessed from web browsers or from

programs—programs that can themselves run locally or in the cloud. It is this

diversity of usage modalities, and the relative simplicity of the methods by which

these usage modalities are employed, that accounts for the power of the cloud.

1.5 Tools used in this book

We make extensive use in both this book and its supporting online notebooks of

some standard tools that go beyond the world of cloud computing: the Python

programming language, Jupyter web-based computing tool, GitHub version control

and collaboration system, and Globus research data management service. We

recommend that any researcher who aspires to become proﬁcient in scientiﬁc

computing master all four systems. A ll are qu ite accessi ble and are supported by

excellent online resources. Time sp ent mastering them will be repaid many times

over in more productive research. We give a brief introduction to each here.

1.5.1 Python

You need some basic programming knowledge to get the most out of this book.

Most science and engineering students know the Python programming language,

so we use Python for our programming examples. If you do not know Python, the

book should still be interesting, but trust us, Python is easy and also tremendously

fun. Learn the basics, at least.

Chapter 1. Orienting in the cloud u niverse

The easiest way to get Python working on your computer is to install the free

Anaconda distribution provided by Continuum Analytics

continuum.io/downloads

This distribution includes multiple tools for installing and updating both Python

and installed packages. It is separate from any OS-level version of Python, and is

easy to completely uninstall. It works well on Windows, Mac, and Linux.

Alternatively, you can also create your Python environment manually, installing

Python, package managers, and Python packages separately. Packages like NumPy

and Pandas can be diﬃcult to get working, however, particularly on Windows

Anaconda simpliﬁes this setup considerably, regardless of your OS.

1.5.2 Jupyter: An interactive, web-based computing tool

To facilitate access to the various methods and tools presented in this book, we

provide complete source code for most code examples in the form of

Jupyter

notebooks

. Jupyter Notebook, or simply

Jupyter

, is a web application that

allows you to create and share documents (“notebooks”) containing live code,

equations, visualizations, and explanatory text. Figure 1.6 on the next page shows

what Jupyter looks like in your web browser. The code for the notebook in this

ﬁgure is in the code repository as Notebook 1, as documented in Chapter 17.

To install Juypter for Python, us e the Python package installer

pip

or download

and install Anaconda from Continuum Analytics. Later in this book we demonstrate

how to install Python and Jupyter as a virtual m achine or Docker container running

in a remote cloud server.

Our use of Jupyter emphasi zes that cloud computing lends itself to interactive

exploration. We make almost all of the examples in this book available as Jupyter

notebooks. Most were developed by the authors during interactive sessions using

one or more of the cloud platforms described in this book.

1.5.3 The GitHub version control system

We also recommend that you master

GitHub

. A version control system is a tool for

keeping track of changes that have been made to a document over time. GitHub is

a hosting service for projects that use the Git version control system. Both the Git

tool and the GitHub site are increasingly often used by researchers, to create digital

lab notebooks that record the data ﬁles, programs, papers, and other resources

associated with a project, with automatic tracking of the changes that are made

to those resources over time [

240

]. GitHub also makes it easy for collaborators to

work together on a p roject, whether a program or a paper: changes made by each

1.5. Tools used in this book

Figure 1.6: A samp le Jupyter notebook. This notebook includes four cells. The ﬁrst is a

markdown cell, i.e., one containing text. The following three provide Python code that

can be run from your web browser, with the last producing a visualization.

contributor are recorded and can easily be reconciled. For exam pl e, we used GitHub

to create this book, with both the authors and reviewers checking in changes and

Chapter 1. Orienting in the cloud u niverse

comments at diﬀerent times and time zones. Ram [

222

] provides a nice description

of how Git/GitHub can be used to promote reproducibility and transparency in

research. We also use GitHub to provide access to the online notebooks tha t

accompany this text. You can ﬁnd the repository at SciEngCloud.github.io .

1.5.4 Globus

We also use the Globus software as a service

globus.org

in our examples, and think

that you will ﬁnd it useful as well. This cloud-hosted service implements research

data, identity, and credential management capabilities that can greatly simplify

the development of cloud applications that need to access resources located on

university campuses, national computing centers, and other facilities. It comprises a

set of cloud-hosted software-as-a-service services (data transfer and synchronization,

authorization, data shari ng, data publication, data search, and g roups), plus a

simple software component, Globus Connect, deployabl e on computers associated

with storage systems, including laptops, lab servers, campus compute clusters,

cloud storage, and scientiﬁc instruments. REST APIs and a Python SDK simplify

integration into applications. We provide more information about Globus in later

chapters, as we introduce examples that demonstrate its use to manage access to

data, replicate data across sites, and publish data, among other things.

1.6 Summary

Pioneering scientists and engineers are already using the cloud in their work, often

in areas in which new data sources or modeling methods require more or diﬀerent

resources than are easily available in the laboratory: for example, in the analysis

of urban [

266

] and environmental [

110

117

137

142

252

]dataandin

biomedical data analysis and modeling [

188

203

260

]. We are certain that

many more researchers are reaching similar turning points and are thus ﬁnding

themselves needing to think about computing in new ways. This book is for them

and for a generation of computer scientists and engineers who recognize that cloud

computing is going to be essential for their careers.

It is almost a futile exercise to write a practical, hands-on book about a

technology that is evolving as rapidly as cloud. Our choices of major vendors and

services may well be rendered obsolete by new developments. Nevertheless, we

expect that most core concepts and tools will remain valid for a long time. The

Unix operating system, hot in the 1970s, lives on in Linux today; the Python

programming language has been with us for 25 years. As new and better ideas

1.7. Resources

emerge with implications for science and engineering, whether incremental or

revolutionary, we will strive to update the online resources and, as and when

feasible, produce revised versions of this text.

1.7 Resources

The U.S. National Institutes of Standards and Technology provides a u seful deﬁni-

tion of cloud computing, as “a model for enabling ubiquitous, convenient, on-demand

network access to a shared pool of conﬁgurable computing resources (e.g., networks,

servers, storage, applications and services) that can be rapidly provisioned and

released with minimal management eﬀort or service provider interaction” [197].

We recommend Charles Severance’s Python for Informatics: Exploring Infor-

mation [

233

], which covers basic Python and provides material relevant to web

data and MySQL. This book is freely available online and is supported by excellent

online lectures and exercises.

There are many Jupyter resources. The main Jupyter site

jupyter.org

has

many valuable resources. Fernando Pérez and Brian Granger have an excellent

blog called the “State of Jupyter” [219], that is both history and a look ahead.

Each public cloud has a portal where you can l earn about their services,

obtain free accounts with modest-sized al locations for experimentation, and track

your data resources and compute activities: Amazon

aws.amazon.com

, Microsoft

azure.microsoft.com

, and Google

cloud.google.com

. Amazon and Microsoft

grant programs can provide larger allocations: see

aws.amazon.com/grants

and

research.microsoft.com/azure

. The latter sites list examples of cloud in research,

as do reports by Gannon et al. [135] and Lifka et al. [182].

To access the NSF-funded Jetstream cloud you need an allocation through the

XSEDE program. If you are a U.S. academic researcher you can qualify for an

allocation. Details are at jetstream-cloud.org and www.xsede.org .

Bibliography

[1]

Access control at the project level.

https://cloud.google.com/storage/docs/

access-control/iam.

[2]

Apache Flink dataﬂow programming model.

https://ci.apache.org/projects/

flink/flink-docs-release-1.2/concepts/programming-model.html.

[3]

Assignments for Udacity deep learning class with TensorFlow.

https://github.

com/tensorflow/tensorflow/tree/master/tensorflow/examples/udacity.

[4]

AWS Case Study: Animoto.

https://aws.amazon.com/solutions/

case-studies/animoto/.

[5]

AWS Identity and Access Management best practices.

http://docs.aws.amazon.

com/IAM/latest/UserGuide/best-practices.html.

[6]

Azure Batch Shipyard recipes.

https://github.com/Azure/batch-shipyard/

tree/master/recipes.

[7]

Azure Data Lake Store Python SDK.

https://github.com/Azure/

azure-data-lake-store-python.

[8]

Azure: Deploy a slurm cluster.

https://github.com/Azure/

azure-quickstart-templates/tree/master/slurm/README.md.

[9] Bare metal on OpenStack: Ironic. https://wiki.openstack.org/wiki/Ironic.

[10]

CentOS 7 / RHEL 7 – Open ports.

http://www.linuxbrigade.com/

centos-7-rhel-7-open-ports/.

[11]

Cloudbridge documentation.

http://cloudbridge.readthedocs.io/en/latest/

[12]

Containers on OpenStack: Magnum.

https://wiki.openstack.org/wiki/Magnum

[13]

Deep learning AMI Amazon Linux version.

https://aws.amazon.com/

marketplace/pp/B01M0AXXQB.

347

Bibliography

[14]

Euca2ools overview.

https://docs.hpcloud.com/eucalyptus/4.3.0/

euca2ools-guide/index.html.

[15]

Eucalyptus EDGE network conﬁguration.

https://docs.eucalyptus.com/

eucalyptus/4.3/install-guide/nw_edge.html.

[16]

Eucalyptus installation guide.

https://docs.eucalyptus.com/eucalyptus/

latest/shared/install_section.html.

[17]

Eucalyptus network conﬁguration requirements.

https://docs.hpcloud.com/

eucalyptus/4.3.0/install-guide/preparing_firewalls.html.

[18]

Eucalyptus: Plan services placement.

https://docs.eucalyptus.com/

eucalyptus/latest/install-guide/services_understanding.html.

[19]

Eucalyptus: Planning networking modes.

https://docs.eucalyptus.com/

eucalyptus/latest/install-guide/planning_networking_modes.html.

[20] Galaxy on Jetstream. https://wiki.galaxyproject.org/Cloud/Jetstream.

[21]

Get started: Create Apache Spark cluster on HDInsight Linux an d run interactive

queries using Spark SQL.

https://azure.microsoft.com/en-us/documentation/

articles/hdinsight-apache-spark-jupyter-spark-sql/.

[22]

Globus endpoint activation.

https://docs.globus.org/api/transfer/

endpoint_activation/.

[23]

Google Cloud Dataﬂow: Complete E xamples.

http:https://cloud.google.com/

dataflow/examples/all-examples.

[24]

Google Cloud Datalab Quickstart.

https://cloud.google.com/datalab/docs/

quickstarts/quickstart-local.

[25]

IBM Analytics Stream Computing.

http://www.ibm.com/analytics/us/en/

technology/stream-computing/.

[26] The Kubernetes project. http://kubernetes.io.

[27] Layers library reference. https://www.cntk.ai/pythondocs/layerref.html.

[28] Linux RAID. https://raid.wiki.kernel.org/index.php/Linux_Raid.

[29]

Machine Learning Library (MLlib) guide.

https://spark.apache.org/docs/

latest/ml-guide.html.

[30]

Making secure requests to Amazon Web Services.

https://aws.amazon.com/

articles/1928.

[31]

Microsoft Azure Event Hubs.

https://azure.microsoft.com/en-us/services/

event-hubs/.

348

Bibliography

[32]

Microsoft Azure Stack.

https://azure.microsoft.com/en-us/overview/

azure-stack/.

[33]

NCBI BLAST on Windows Azure.

https://www.microsoft.com/en-us/

download/details.aspx?id=52513.

[34] Ocean Observatories Initiative. http://oceanobservatories.org.

[35] The Open Compute Project. http://opencompute.org.

[36]

OpenStack documentation: CPU topologies.

https://docs.openstack.org/

admin-guide/compute-cpu-topologies.html.

[37]

OpenStack in production: Hints and tips from the CERN OpenStack cloud team.

http://openstack-in-production.blogspot.co.uk/.

[38]

OpenStack Newton release notes.

https://www.openstack.org/software/

newton/.

[39]

OpenStack: Operators mailing list.

http://lists.openstack.org/pipermail/

openstack-operators/.

[40]

OpenStack: Scientiﬁc working group.

https://wiki.openstack.org/wiki/

Scientific_working_group.

[41]

Predict with pre-trained models.

http://mxnet.io/tutorials/python/predict_

imagenet.html.

[42] Rados object storage utility. http://docs.ceph.com/docs/giant/man/8/rados/.

[43] Riak cloud storage. http://docs.basho.com/riak/cs/2.1.1/.

[44]

Sample applications built using Amazon Machine Learning.

https://github.com/

awslabs/machine-learning-samples.

[45]

Spark SQL, DataFrames and Datasets guide.

http://spark.apache.org/docs/

latest/sql-programming-guide.html.

[46]

The Red Hat Package Manager.

https://en.wikipedia.org/wiki/RPM_Package_

Manager.

[47]

Theano deep learning library.

http://www.deeplearning.net/software/theano/

[48]

Transferring RDA data with Globus.

http://ncarrda.blogspot.com/2015/06/

transferring-rda-data-with-globus.html.

[49] TripleO online documentation. http://tripleo.org.

[50]

VMware Cloud Foundation.

https://www.vmware.com/products/

cloud-foundation.html.

349

Bibliography

[51]

Welcome to Bridges.

https://www.psc.edu/index.php/resources/computing/

bridges.

[52]

What is IAM?

https://docs.aws.amazon.com/IAM/latest/UserGuide/

Introduction.html.

[53]

Setup Linux Network Bridges on CentOS for Nova Net-

working, Nov 2015.

https://platform9.com/support/

setup-network-bridges-on-centos-nova-networking/.

[54]

OpenStack user s u rvey, Oct 2016.

https://www.openstack.org/assets/survey/

October2016SurveyReport.pdf.

[55]

Using AWS in the context of New Zealand privacy consid e rations. Technical report,

Oct. 2016.

https://d0.awsstatic.com/whitepapers/compliance/Using_AWS_

in_the_context_of_New_Zealand_Privacy_Considerations.pdf.

[56]

G. Agha. An overview of actor languages. SIGPLAN Notices,21(10):58–67,June

1986.

[57]

T. Akidau. The world beyond batch: Streaming 102, Jan 2016.

https://www.

oreilly.com/ideas/the-world-beyond-batch-streaming-102.

[58]

T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma,

R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, and S. Whittle. The dataﬂow

model: A practical approach to balancing correctness, latency, and cost in massive-

scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endow-

ment,8(12):1792–1803,Aug.2015.

[59]

T. Akidau and F. Perry. Dataﬂow/Beam and Spark: A Programming

Model Comparison, Feb 2016.

https://cloud.google.com/dataflow/blog/

dataflow-beam-and-spark-comparison.

[60]

A. Aliper, S. Plis, A. Artemov, A. Ulloa, P. Mamoshina, and A. Zhavoronkov. Deep

learning applications for predicting pharmacological properties of d rugs and drug

repurposing using transcriptomic data. Molecular Pharmaceutics,2016.

[61]

W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, and

I. Foster. The Globus striped GridFTP framework and server. In ACM/IEEE

Conference on Supercomputing,page54,2005.

[62]

B. Allen, J. Bresnahan, L. Childers, I. Foster, G. Kandaswamy, R. Kettimuthu,

J. Kordas, M. Link, S. Martin, K. Pickett, and S. Tuecke. Software as a service for

data scientists. Communications of the ACM,55(2):81–88,Feb.2012.

[63]

S. Anthony. How big is the Cloud? ExtremeTech,May2012.

http://www.

extremetech.com/computing/129183-how-big-is-the-cloud.

350

Bibliography

[64]

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan,

M. J. Franklin, A. Ghodsi, et al. S park SQL: Relational data processing in Spark. In

ACM SIGMOD International Conference on Management of Data,pages1383–1394,

2015.

[65] P. Bailis and K. Kingsbury. The network is reliable. Queue,12(7):20,2014.

[66]

R. Barga, J. Goldstein, M. Ali, and M. Hong. Consistent streaming through time:

A vision for event stream processing. In Conference on Innovative Data Systems

Research,pages363–374,2007.

[67]

P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer,

I. Pratt, and A. Warﬁeld. Xen and the art of virtualization. ACM SIGOPS Operating

Systems Review,37(5):164–177,2003.

[68]

W. Barnett, V. Welch, A. Walsh, and C. A. Stewart. A roadmap for using NSF cyber-

infrastructure with InCommon, 2011.

https://www.incommon.org/federation/

cyberroadmap.html.

[69]

L. A. Barroso, J. Clidaras, and U. Hölzle. The datacenter as a computer: An

introduction to the design of warehouse-scale machines. Synthesis Lectures on

Computer Architecture,8(3):1–154,2013.

[70] S. Beer. Brain of the Firm.PenguinPress,1972.

[71]

D. Bernstein. Containers and cloud: From LXC to Docker to Kubernetes. IEEE

Cloud Computing,1(3):81–84,2014.

[72]

P. Bernstein, S. Berkov, J. Thelin, and S. Burkhardt. Orleans - Virtual Actors.

http://research.microsoft.com/en-us/projects/orleans/.

[73]

K. Bhuvaneshwar, D. Sulakhe, R. Gauba, A. Ro driguez, R. Madduri, U. Dave,

L. Lacinski, I. Foster, Y. Gusev, and S. Madhavan. A case study for cloud based high

throughput analys is of NGS data usin g the Glob us Genomics system. Computational

and Structural Biotechnology Journal,13:64–74,2015.

[74]

M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D.

Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, and J. Zhao. End to end

learning for self-driving cars. arXiv preprint arXiv:1604.07316,2016.

[75]

F. Bonomi, R. Milito, J. Zhu, and S. Addepalli. Fog computing and its role in the

internet of things. In M CC Workshop on Mobile Cloud Computing,pages13–16.

ACM, 2012.

[76]

D. E. Boyle, D. C. Yates, and E. M. Yeatman. Urban sensor data streams: London

2013. IEEE Internet Computing,17(6):12–20,2013.

[77]

T. Bray. One Amazon year, December 2015.

https://www.tbray.org/ongoing/

When/201x/2015/12/01/One-Amazon-Year.

351

Bibliography

[78]

E. Brewer. CAP twelve years later: How the “rules” have changed. Computer,

45(2):23–29, 2012.

[79]

E. Brewer. Kubernetes and the path to cloud native. In 6th ACM Symposium on

Cloud Computing, pages 167–167. ACM, 2015.

[80]

J. Bryce. Embracing datacenter d iversity. In OpenStack Austin.2016.

https:

//www.openstack.org/videos/video/embracing-datacenter-diversity.

[81]

Y. Bu, B. Howe, M. Balazins ka, and M. D. Ernst. HaLoop: Eﬃcient iterative data

processing on large clusters. Proceedings of the VLDB Endowment,3(1-2):285–296,

2010.

[82]

S. Bugiel, S. Nürnberger, T. Pöppelmann, A.-R. Sad eghi, and T. Schneider. Ama-

zonIA: When elasticity snaps back. In 18th ACM conference on Computer and

Communications Security, pages 389–400. ACM, 2011.

[83]

J. Cantarella, C. Shonkwiler, and E. Uehara. A fast direct sampling algorithm for

equilateral closed polygons, Jan 2017. https://arxiv.org/abs/1510.02466v2.

[84]

C. Catlett, T. Malik, B. Goldstein, J. Giuﬀrida, Y. Shao, A. Panella, D. Eder, E. v.

Zanten, R. Mitchum, S. Thaler, and I. Foster. Plenario: An open data discovery

and exploration platform for urban science. Bulletin of the IEEE Computer Society

Technical Committee on Data Engineering,pages27–42,2014.

[85]

A. Caulﬁeld, E. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil,

M. Humphrey, P. Kaur, J.-Y. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Pa-

pamichael, L. Woods, S. Lanka, D. Chiou, and D. Burger. A cloud-scale acceleration

architecture. In 49th Annual IEEE/ACM International Symposium on Microarchi-

tecture,October2016.

[86]

M. Cezar. Setting up NTP (Network Time Protocol) Server in

RHEL/CentOS 7. Tecmint,Mar2015.

http://www.tecmint.com/

install-ntp-server-in-centos/.

[87]

K. M. Chandy, O. Etzion, and R. von Ammon. Event process ing. Dagstuhl Seminar

Proceedings 10201, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany,

2011.

[88]

K. Chard, S. Caton, O. F. Rana, and K. Bubendorfer. Social cloud: Cloud computing

in social networks. IEEE CLOUD,10:99–106,2010.

[89]

K. Chard, J. Pruyne, B. Blaiszik, R. Ananthakrishnan, S. Tuecke, and I. Foster.

Globus data publication as a service: Lowering barriers to reproducible science. In

11th IEEE International Conference on eScience,2015.

[90]

R. Chard, K. Chard, K. Bubendorfer, L. Lacinski, R. Madduri, and I. Foster. Cost-

aware cloud provisioning. In IEEE 11th International Conference on e-Science,

pages 136–144, 2015.

352

Bibliography

[91]

R. Chard, R. Madduri, N. Karonis, K. Chard, K. Duﬃn, C. Ordonez, T. Uram,

J. Fleischauer, I. Foster, M. Papka, and J. Winans. Scalable pCT image reconstruc-

tion delivered as a cloud service. IEEE Transactions on Cloud Computing,2015.

http://ieeexplore.ieee.org/document/7160740/.

[92]

T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and

Z. Zhang. Mxnet: A ﬂexible and eﬃcient machine learning library for heterogeneous

distributed systems. CoRR,abs/1512.01274,2015.

[93]

Y. Chen, V. Paxson, and R. H. Katz. What’s new about cloud computing security.

University of California, Berkeley Report No. UCB/EECS-2010-5 January,2010.

[94]

T. Che n g and J. Wang. Application of a Dynamic Recurrent Neural Network in

Spatio-Temporal Forecasting,pages173–186. SpringerBerlinHeidelberg,Berlin,

Heidelberg, 2007.

[95]

K. Cho, B. Van Merriënboer, D. Bahd anau, and Y. Bengio. On the proper-

ties of neural machine translation: Encoder-decoder approaches. arXiv preprint

arXiv:1409.1259,2014.

[96]

J. Clark. 5 numbers that illustrate the mind bending size of Amazon’s cloud.

Bloomberg Global Tech,Nov2014.

http://www.bloomberg.com/news/2014-11-14/

5-numbers-that-illustrate-the-mind-bending-size-of-amazon-s-cloud.

html.

[97]

Cloud Computing Security Working Group. NIST Cloud Computing S e-

curity Reference Architecture. Special Publication 500-299, National In-

stitute of Standards and Technology, 2013.

http://collaborate.nist.

gov/twiki-cloud-computing/pub/CloudComputing/CloudSecurity/NIST_

Security_Reference_Architecture_2013.05.15_v1.0.pdf.

[98]

D. T. Cohen, G. W. Hatchard, and S. G. Wilson. Population trends in incorporated

places: 2000 to 2013. Technical Report P25-1142, US Census, Mar 2015.

[99]

A. Conesa, P. Madrigal, S. Tarazona, D. Gomez-Cabrero, A. Cervera, A. McPherson,

M. W. Szcześniak, D. J. Gaﬀney, L. L. Elo, X. Zhang, et al. A survey of best

practices for RNA-seq data analysis. Genome biology,17(1):13,2016.

[100]

F. J. Corbató and V. Vyssotsky. Introduction and overview of the Multics system.

IEEE Annals of the History of Computing,14(2):12–13,1992.

[101]

J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat,

A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li,

A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito,

M. Szymaniak, C. Taylor, R. Wang, , and D. Woodford. Spanner: Google’s globally

distributed database. ACM Transactions on Computer Systems,31(3):8,2013.

353

Bibliography

[102]

T. Cowles, J. Delaney, J. Orcutt, and R. We ller. The Ocean Observatories Initiative:

Sustained ocean observing across a range of spatial scales. Marine Technology

Society Journal,44(6):54–64,2010.

[103]

D. R. Cox. The regression analysis of binary sequences. Journal of the Royal

Statistical Society. Series B (Methodological),pages215–242,1958.

[104]

R. J. Creasy. The origin of the VM/370 time-sharing system. IBM Journal of

Research and Development,25(5):483–490,1981.

[105]

J. Czyzyk, M. P. Mesnier, and J. J. Moré. The NEOS server. IEEE Computational

Science and Engineering,5(3):68–75,1998.

[106]

E. Dart, L. Rotman, B. Tierney, M. Hester, and J. Zurawski. The Science DMZ: A

network design pattern for data-intensive science. Scientiﬁc Programming,22(2):173–

185, 2014.

[107]

F. De Carlo. DMagic data management system.

http://dmagic.readthedocs.io

[108]

J. Dean and S. Ghe mawat. MapReduce: Simpliﬁed data processing on large clus te rs .

Communications of the ACM,51(1):107–113,2008.

[109]

E. Deelman, K. Vahi, M. Rynge, G. Juve, R. Mayani, and R. F. da Silva. Pegasus

in the cloud: Science automation through workﬂow technologies. IEEE Internet

Computing,20(1):70–76,2016.

[110]

P. Dhingra, K. Tolle, and D. Gannon. Using cloud-based analytics to save lives.

Cloud Computing in Ocean and Atmospheric Sciences,page221,2016.

[111]

S. Dieleman. My solution for the Galaxy Zoo challenge, Apr 2014.

http://benanne.

github.io/2014/04/05/galaxy-zoo.html.

[112]

C. Docan, M. Parashar, and S. Klasky. DataSpaces: An interaction and coordination

framework for coupled simulation workﬂows. Cluster Computing,15(2):163–181,

2012.

[113]

A. Dubey and D. Wagle. Delivering software as a service. The McKinsey Quarterly,

6(2007):2007, 2007.

[114]

D. Eadline. Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Com-

puting in the Apache Hadoop 2 Ecosystem.Addison-Wesley,2016.

[115]

G. Eisenhauer, M. Wolf, H. Abbasi, and K. Schwan. Event-based systems: Op-

portunities and challenges at exascale. In 3rd ACM International Conference on

Distributed Event-Based Systems,2009.

[116]

S. Ekanayake, S. Kamburugamuve, and G. Fox. SPIDAL: high performance data

analytics with Java and MPI on large multicore HPC clusters. In Spring Simulation

Multi-Conference,pages3–6,2016.

354

Bibliography

[117]

J. Elliott, D. Kelly, J. Chryssanthacopoulos, M. Glotter, K. Jhunjhnuwala, N. Best,

M. Wilde, and I. Foster. Th e parallel system for integrating impact models and

sectors (pSIMS). Environmental Modelling & Software,62:509–516,2014.

[118]

O. Etzioni. Deep learning isn’t a dangerous magic genie. It’s

just math. Wired,June2016.

https://www.wired.com/2016/06/

deep-learning-isnt-dangerous-magic-genie-just-math/.

[119]

B. Familiar. Microservices, IoT and Azure: Leveraging DevOps and Microservice

Architecture to deliver SaaS Solutions.APress,2015.

[120]

M. R. Ferré. Cloud native applications (for dummies), 2014.

http://www.it20.

info/2014/12/cloud-native-applications-for-dummies/.

[121]

R. T. Fielding. Architectural styles and the design of network-based software archi-

tectures. PhD thes is, University of California, Irvine, 2000.

[122]

J. Fischer, S. Tuecke, I. Foster, and C. A. Stewart. Jetstream: A distributed cloud

infrastructure for underresourced higher education communities. In 1st Workshop on

The Science of Cyberinfrastructure: Research, Experience, Applications and Models,

pages 53–61. ACM, 2015.

[123]

I. Foster. Globus Online: Accelerating and democratizing science through cloud-

based services. IEEE Internet Computing,15(3):70–73,May2011.

[124]

I. Foster, K. Chard, and S. Tuecke. The discovery cloud: Accelerating and democ-

ratizing research on a global scale. In I EEE International Conference on Cloud

Engineering, pages 68–77. IEEE, 2016.

[125]

I. Foster, R. Ghani, R. S. Jarmin, F. Kreuter, and J. I. Lane, editors. Big Data and

Social Science: A Practical Guide to Methods and Tools.Taylor&FrancisGroup,

2016. See also http://www.bigdatasocialscience.com.

[126]

I. Foster and C. Kess elman. The history of the grid. In High Performance Computing:

From Grids and Clouds to Exascale, pages 3–30. IOS Press, 2011.

[127]

I. Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud computing and grid computing

360-degree compared. In Grid Computing Environments Workshop, pages 1–10. Ieee,

2008.

[128]

A. Fox, D. A. Patterson, and S. Joseph. Engineering Software as a Service: An

Agile Approach using Cloud Computing. Strawberry C anyon LLC, 2013.

[129]

G. Fox and D. Gannon. Using clouds for technical computing, 2013.

http://www.

academia.edu/14845479/Using_Clouds_for_Technical_Computing.

[130]

G. Fox, S. Jha, and L. Ramakrishnan. Streaming and Steering Applications:

Requirements and Infrastructure. http://streamingsystems.org.

355

Bibliography

[131]

G. C. Fox, R. D. Williams, and G. C. Messina. Parallel computing works! Morgan

Kaufmann, 2014.

[132]

B. H. Frank. AWS wants to dominate beyond the public cloud with Lambda updates.

PC World,Dec.2016.http://www.pcworld.com/article/3147389/.

[133]

D. Gannon. Performance Analysis of a Cloud Microservice-based

ML Classiﬁer, Oct 2015.

https://esciencegroup.com/2015/10/08/

performance-analysis-of-a-cloud-microservice-based-ml-classifier/.

[134]

D. Gannon. CNTK revisited. A new deep learning toolkit release

from Microsoft, Nov 2016.

https://esciencegroup.com/2016/11/10/

cntk-revisited-a-new-deep-learning-toolkit-release-from-microsoft/.

[135]

D. Gannon, D. Fay, D. Green, K. Takeda, and W. Yi. Science in the cloud: Lessons

from three years of research projects on Microsoft Azure. In 5th International

Workshop on Scientiﬁc Cloud Computing, pages 1–8. ACM, 2014.

[136]

Gartner Research. Software as a Service (SaaS).

http://www.gartner.com/

it-glossary/software-as-a-service-saas.

[137]

K. Gee and W. Hunt. Enhancing stormwater management beneﬁts of rainwater

harvesting via innovative technologies. Journal of Environmental Engineering,

142(8):04016039, 2016.

[138]

L. George. HBase: The Deﬁnitive G uide: Random Access to Your Planet-Size Data.

O’Reilly Media, Inc., 2011.

[139] A. V. Gerbessiotis and L. G. Valiant. Direct bulk-synchronous parallel algorithms.

Journal of Parallel and Distributed Computing,22(2):251–267,1994.

[140]

S. Goasguen. Enjoy Kubernetes with Python.

https://www.linux.com/learn/

kubernetes/enjoy-kubernetes-python.

[141]

J. Go ecks, A. Nekrutenko, J. Taylor, and T. G. Team. Galaxy: A comprehensive

approach for supporting accessible, reproducible, and transparent computational

research in the life sciences. Genome Biol,11(8):R86,2010.

[142]

J. Gong, P. Yue, and H. Zhou. Geoprocessing in the Microsoft cloud comput-

ing platform–Azure. In Joint Symposium of ISPRS Technical Commission IV &

AutoCarto,page6,2010.

[143]

I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.

http://www.deeplearningbook.org.

[144]

A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent

neural networks. In IEEE International Conference on Acoustics, Speech, and Signal

Processing, pages 6645–6649. IEEE, 2013.

356

Bibliography

[145]

A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The cost of a cloud: Research

problems in data center networks. ACM SIGCOMM Computer Communication

Review,39(1):68–73,2008.

[146]

K. Gremban. Get started with access management in the Azure por-

tal.

https://docs.microsoft.com/en-us/azure/active-directory/

role-based-access-control-what-is.

[147]

W. Gropp, E. Lusk, and R. Thakur. Using MPI-2: Advanced features of the Message

Passing Interface. MIT Press, 1999.

[148]

J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques.Morgan-

Kaufmann, 2011.

[149]

D. Hardt. OAuth 2.0 authorization framework speciﬁcation, 2012.

http://tools.

ietf.org/html/rfc6749.

[150]

J. A. Hartigan and M. A. Wong. Algorithm AS 136: A k-means clustering algorithm.

Journal of the Royal Statistical Society. Series C (Applied Statistics),28(1):100–108,

1979.

[151]

K. Hashizume, D. G. Rosado, E. Fernández-Medina, and E. B. Fernandez. An

analysis of security issues for cloud computing. Journal of Internet Services and

Applications,4(1):1,2013.

[152]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.

CoRR,abs/1512.03385,2015.

[153]

T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: Data-Intensive Scientiﬁc

Discovery.Kindle,2009.

[154]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz,

S. Shen ker, and I. Stoica. Mesos: A platform for ﬁne-grained resource sharing

in the data center. In USENIX Symposium on Networked Systems Design and

Implementation,pages22–22,2011.

[155]

B. Holzman. Fermilab HEPCloud: An elastic computing facility for High Energy

Physics. In International Conference on Computing in High Energy Physics.2016.

https://indico.cern.ch/event/432527/contributions/1072465/.

[156]

A. Howard. Running MPI applications in Amazon EC2, May 2015.

https://

cyclecomputing.com/running-mpi-applications-in-amazon-ec2/.

[157]

W. Huang, A. Ganjali, B. H. Kim, S. Oh, and D. Lie. The state of public

infrastructure-as-a-service cloud security. ACM Computing Surveys,47(4):68,2015.

[158]

T. Hunt. Introducing “Have I been pwned?” – aggregating ac-

counts across website breaches, Dec 2013.

https://www.troyhunt.com/

introducing-have-i-been-pwned/.

357

Bibliography

[159]

A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia,

D. Gunter, D. Skinner, G. Ceder, et al. The materials project: A materials genome

approach to accelerating materials innovation. APL Materials,1(1):011002,2013.

[160]

S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wan-

derer, J. Zhou, M. Zhu, et al. B4: Experience with a globally-deployed software

deﬁned WAN. ACM SIGCOMM Computer Communication Review,43(4):3–14,

2013.

[161] Y. Jia and E. Shelhamer. Caﬀe. http://caffe.berkeleyvision.org/.

[162]

B. Johnson. Cloud computing is a trap, warns GNU founder Richard Stallman.

Guardian Newspaper,Sep2008.

https://www.theguardian.com/technology/

2008/sep/29/cloud.computing.richard.stallman.

[163]

B. Jones. Towards the European open science cloud, 2015.

http://doi.org/10.

5281/zenodo.16001.

[164]

N. Jouppi. Google supercharges machine learning tasks with TPU custom chip.

Google Cloud Pl atform Blog,May2016.

https://cloudplatform.googleblog.com/

2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.

html.

[165]

S. Kamburugamuve and G. Fox. Survey of distributed stream processing, Feb 2016.

https://www.researchgate.net/publication/299411481.

[166]

S. Kamburugamuve, P. Wickramasinghe, S. Ekanayake, and G. Fox. Anatomy of

machine learning algorithm implementations in MPI, Spark, and Flink, Jan 2017.

https://www.researchgate.net/publication/312426658.

[167]

N. T. Karonis, K. L. Duﬃn, C. E. Ordoñez, B. Erdelyi, T. D. Uram, E. C. Olson,

G. Coutrakon, and M. E. Papka. Distributed and hardware accelerated computing

for clinical medical imaging using proton computed tomography (pCT). Journal of

Parallel and Distributed Computing,73(12):1605–1612,2013.

[168]

A. Karpathy. The unreasonable eﬀectiveness of recurrent neural networks, Feb 2015.

http://karpathy.github.io/2015/05/21/rnn-effectiveness.

[169]

M. Kassner. A look at Amazon’s world class data center

ecosystem, Dec 2014.

http://www.techrepublic.com/article/

a-look-at-amazons-world-class-data-center-ecosystem.

[170]

S. Kemp. Password-les s logins with OpenSSH, 2005.

https://

debian-administration.org/article/152/.

[171]

R. D. King, J. Rowland, S. G. Oliver, M. Young, W. Aubrey, E. Byrne, M. Liakata,

M. Markham, P. Pir, L. N. Soldatova, A. Sparkes, K. E. Whelan, and A. Clare. The

automation of science. Science,324(5923):85–89,2009.

358

Bibliography

[172]

G. Klimeck, M. McLennan, S. P. Brophy, G. B. Adams III, and M. S. Lundstrom.

nanohub.org: Advancing education and research in nanotechnology. Computing in

Science & Engineering,10(5):17–23,2008.

[173]

S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. M. Patel,

K. Ramasamy, and S. Taneja. Twitter Heron: Stream processing at scale. In ACM

SIGMOD International Conference on Management of Data, pages 239–250. ACM,

2015.

[174]

H. S. Kuyuk, R. M. Allen, H. Brown, M. Hellweg, I. Henson, and D. Neuhauser. De-

signing a network-based earthquake early warning algorithm for California: ElarmS-2.

Bulletin of the Seismological Society of America,2013.

[175]

M. Lamann a. The LHC computing grid project at CERN. Nuclear Instruments

and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors

and Associated Equipment,534(1):1–6,2004.

[176]

K. A. Lawrence, M. Zentner, N. Wilkins-Diehr, J. A. Wernert, M. Pierce, S. Marru,

and S. Michael. Science gateways today and tomorrow: Positive perspectives of

nearly 5000 members of the research community. Concurrency and Computation:

Practice and Experience,27(16):4252–4268,2015.

[177]

J. Layton. A container for HPC.

https://www.admin-magazine.com/HPC/

Articles/Singularity-A-Container-for-HPC.

[178]

J. A. Le, H. El-Askary, M. Allali, and D. Struppa. Application of recurrent neural

networks for d rought projections in California. Atmospheric Research,187,2017.

[179] H. Lee. Simple Azure. https://readthedocs.org/projects/simple-azure/.

[180]

P. D. Lena, K. Nagata, and P. F. Baldi. Deep spatio-temporal architectures

and learning for protein structure prediction. In Advances in Neural Information

Processing Systems 25, p ages 512–520. Curran Associates, Inc., 2012.

[181]

Y. Li. Introduction to Docker secrets management.

https://blog.docker.com/

2017/02/docker-secrets-management.

[182]

D. Lifka, I. Foster, S. Mehringer, M. Parashar, P. Redfern, C. Stewart, and S. Tuecke.

XSEDE cloud survey report, 2013. http://hdl.handle.net/2142/45766.

[183]

I. Liu and B. Ramakrishnan. Bach in 2014: Music composition with recurrent neural

network. CoRR,abs/1412.3191,2014.

[184]

Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R. Tchoua,

J. Lofstead, R. Oldﬁeld, M. Parashar, N. Samatova, K. Schwan, A. Shoshani, M . Wolf,

K. Wu, and W. Yu. Hello ADIOS: The challenges and lessons of developing leadership

class I/O frameworks. Concurrency and Computation: Practice and Experience,

26(7):1453–1473, 2014.

359

Bibliography

[185]

Y. Liu, A. Padmanabhan, and S. Wang. CyberGIS Gateway for enabling data-rich

geospatial research and education. Concurrency and Computation: Practice and

Experience,27(2):395–407,2015.

[186]

R. Madduri, K. Chard, R. Chard, L. Lacinski, A. Rodriguez, D. Sulakhe, D. Kelly,

U. Dave , and I. Foster. The Globus Galaxies platform: Delivering science gateways

as a service. Concurrency – Practice and Experience,27(16):4344–4360,2015.

[187]

R. K. Madduri, D. Sulakhe, L. Lacinski, B. Liu, A. Rodriguez, K. Chard, U. J. Dave,

and I. T. Foster. Experiences building Globus Genomics: A next-generation sequenc-

ing analysis se rvic e using Galaxy, Globus, and Amazon Web Services. Concurrency

– Practice and Experience,26(13):2266–2279,2014.

[188]

P. K. Mantha, A. Luckow, and S. Jha. Pilot-MapRedu ce: An extensible and ﬂexible

MapReduce implementation for distributed data. In 3rd International Workshop on

MapReduce and Its Applications, pages 17–24. ACM, 2012.

[189]

J. Margolis. Amazon Echo’s role in deep space exploration.

Financial Times,Jan2017.

https://www.ft.com/content/

24529e30-d0e3-11e6-b06b-680c49b4b4c0.

[190]

N. Marz and J. Warren. Big Data – Principles and best practices of scalable realtime

data systems.Manning,2015.

[191]

A. Matsunaga, J. Fortes, K. Keahey, and M. Tsugawa. Sky computing. IEEE

Internet Computing,13:43–51,2009.

[192] K. Matthias and S. P. Kane. Docker: Up and Running.O’Reilly,2016.

[193]

W. McKinney. Python for Data Analysis: Data Wrangling with Pandas, NumPy,

and IPython.O’ReillyMedia,2015.

[194]

N. Mehrotra, L. Franks, P. McKay, R. McAllister, and J. Gao. Get started:

Create Apache Spark cluster in Azure HDInsight and run interactive queries

using Spark SQL.

https://docs.microsoft.com/en-us/azure/hdinsight/

hdinsight-apache-spark-jupyter-spark-sql/.

[195]

N. Mehrotra, R. McMurray, L. Franks, and J. Gao. Machine learning: Predic-

tive analysis on food inspection data using MLlib with Apache Spark cluster

on HDInsight Linux.

https://docs.microsoft.com/en-us/azure/hdinsight/

hdinsight-apache-spark-machine-learning-mllib-ipython.

[196]

P. Mehrotra, J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoﬀ, S. Saini, and

R. Biswas. Performance evaluation of Amazon EC2 for NASA HPC applications.

In 3rd Workshop on Scientiﬁc Cloud Computing, pages 41–50. ACM, 2012.

[197]

P. Mell and T. Grance. The NIST deﬁnition of cloud computing. Special Publication

800-145, National Institute of Standards and Technology, 2011.

http://nvlpubs.

nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf.

360

Bibliography

[198]

X. Meng, J. Bradley, B. Yavuz, E. S p arks , S. Venkataraman, D. Liu , J. Freeman,

D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia,

and A. Talwalkary. MLlib: Machine learning in Apache Spark. Journal of Machine

Learning Research,17(34):1–7,2016.

[199]

F. Meyer, D. Paarmann, M. D’Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian,

A. Rod riguez, R. Stevens, A. Wilke, et al. The metagenomics RAST server–A public

resource for the automatic phylogenetic and functional analysis of metagenomes.

BMC bioinformatics,9(1):386,2008.

[200]

Microsoft Research Connections. MSR Courseware.

https://github.com/

MSRConnections/Azure-training-course.

[201]

M. A. Miller, W. Pfeiﬀer, and T. Schwartz. Creating the CIPRES science gateway

for inference of large phylogenetic trees. In Gateway Computing Environments

Workshop,pages1–8,2010.

[202]

D. Milojičić, I. M. Llorente, and R. S. Montero. OpenNebula: A cloud management

tool. IEEE Internet Computing,15(2):11–14,2011.

[203]

N. M. Mohamed, H. Lin, and W.-C. Feng. Accelerating data-intensive gen om e

analysis in the cloud. In 5th International Conference on Bioinformatics and

Computational Biology.2013.

[204]

T. P. Morgan. A rare peek at the massive scale of AWS. En-

terpriseTech,Nov2014.

http://www.enterprisetech.com/2014/11/14/

rare-peek-massive-scale-aws.

[205]

A. Morin, J. Urban, P. D. Adams, I. Foster, A. Sali, D. Baker, and P. Sliz. Shining

light into black boxes. Science,336(6078):159–160,2012.

[206]

A. Mouat. Docker security: Using containers safely in production.

https://gallery.mailchimp.com/979c70339150d05eec1531104/files/

Docker_Security_Red_Hat.pdf.

[207]

A. C. Muller and S. Guido. Introduction to Machine Learning with Python: A Guide

for Data Scientists.O’ReillyPublishing,2017.

[208]

N. Nakata, J. P. Chang, J. F. Lawrence, and P. Boué. Body wave extraction and

tomography at long beach, california, with ambient-noise interferometry. Journal of

Geophysical Research: Solid Earth,120(2):1159–1173,2015.

[209]

F. Nelli. Python Data Analytics: Data Analysis and Science using Pandas, Matplotlib

and the Python Programming Language.Apress,2015.

[210]

M. A. Nielsen. Neural Networks and Deep Learning.DeterminationPress,2015.

http://neuralnetworksanddeeplearning.com.

361

Bibliography

[211]

B. Nikolic. Data processing for the Square Kilometre Array telescope.

http:

//www.mrao.cam.ac.uk/~bn204/publications/2015/SKA-SDP-Streaming.pdf.

[212]

D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseﬀ, and

D. Zagorodnov. The Eucalyptus open-source cloud-computing system. In 9th

IEEE/ACM International Symposium on Cluster Computing and the Grid,pages

124–131, 2009.

[213]

C. Olah. Understanding LSTM networks, Aug 2015.

http://colah.github.io/

posts/2015-08-Understanding-LSTMs/.

[214]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-

foreign language for data processing. In ACM SIGMOD International Conference

on Management of Data, pages 1099–1110. ACM, 2008.

[215]

R. Orihuela and D. Bass. Help wanted : Black belts in data,

Jun 2015.

http://www.bloomberg.com/news/articles/2015-06-04/

help-wanted-black-belts-in-data.

[216]

K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. Chung. Toward

accelerating deep learning at scale using specialized hardware in the datacenter. In

27th HotChips Symposium on High-Performance Chips. IEEE, August 2015.

[217]

D. F. Parkhill. The Challenge of the Computer Utility.Addison-WesleyEducational

Publishers, 1966.

[218]

N. Paskin. Digital object identiﬁer (DOI) system. Encyclopedia of Library and

Information Sciences,3:1586–1592,2010.

[219]

F. Pérez and B. Granger. The state of Jupyter, Jan 2017.

https:/www.oreilly.

com/ideas/the-state-of-jupyter.

[220]

D. A. Phillips, C. Puskas, Santillan, L. M., Wang, R. W. King, W. M. Szeliga,

T. Melbourne, M. Murray, M. Floyd, and T. A. Herring. Plate Boundary Observatory

and related networks: GPS data analysis methods and geodetic products. Reviews

of Geophysics,54:759–f808,2016.

[221]

I. Raicu, I. Foster, and Y. Zhao. Many-task computing for grids and supercomputers.

In IEEE Workshop on Many-Task Computing on Grids and Supercomputers,2008.

[222]

K. Ram. Git can facilitate greater reproducibility and increased transparency in

science. Source Code for Biology and Medicine,8(1):7,2013.

[223]

L. Ramakrishnan, P. T. Zbiegel, S. Campbell, R. Bradsh aw, R. S. Canon, S. Coghlan,

I. Sakrejda, N. Desai, T. Dec lerck, and A. Liu. Magellan: Experiences from a science

cloud. In 2nd International Workshop on Scientiﬁc Cloud Computing,pages49–58.

ACM, 2011.

362

Bibliography

[224] S. Rashka. Py thon Machine Learning.PacktPublishing,2016.

[225] K. Reitz. Requests: HTTP for humans. http://docs.python-requests.org.

[226] J. Richer. OAuth 2.0 token introspection. RFC 7662, IETF, 2015.

[227]

M. Rosenblum and T. Garﬁnkel. Virtual machine monitors: Current technology

and future trends. Computer,38(5):39–47,2005.

[228]

M. Russinovich. Report from Open Networking Summit: Achieving hyper-scale

with software deﬁned networking. http://bit.ly/2laCxLT.

[229]

S. Ryza, U. Laserson, S. Owen, and J. Wills. Advanced Analytics with Spark:

Patterns for Learning from Data at Scale.O’ReillyMedia,2015.

[230]

N. Sakimura, J. Bradley, M. Jones, B. d. Medeiros, and C. Mortimore. OpenID

Connect Core 1.0 incorporating errata set 1, 2014.

http://openid.net/specs/

openid-connect-core-1_0.html.

[231]

D. Sanderson. Programming Google App Engine with Python: Build and Run

Scalable Python Apps on Google’s Infrastructure.O’ReillyPress,2015.

[232]

M. Satyanarayanan. The emergence of edge computing. Computer,50(1):30–39,

2017.

[233]

C. Severance. Python for informatics: Exploring information, 2013.

http://www.

pythonlearn.com/book.php.

[234]

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driess-

che, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Diele-

man, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach,

K. Kavukcuoglu, T. Graepel, and D. Hass abis. Mastering the game of Go with deep

neural networks and tree search. Nature,529(7587):484–489,2016.

[235]

F. Simorjay. Shared responsibilities for cloud computing. Technical

report, Microsoft, Mar 2016.

https://gallery.technet.microsoft.com/

Shared-Responsibilities-81d0ff91.

[236]

A. Singh, J. Ong, A. Agarwal, G. Anderson , A. Armistead, R. Bannon, S. Boving,

G. Desai, B. Felderman, P. Germano, et al. Jupiter rising: A decade of Clos

topologies and centralized control in Google’s datacenter network. ACM SIGCOMM

Computer Communication Review,45(4):183–197,2015.

[237]

L. Smarr and C. E. Catlett. Metacomputing. Communications of the ACM,35(6):44–

53, 1992.

[238] R. M. Stallman. Who does that server really serve? Boston Review,35(2),2010.

363

Bibliography

[239]

R. Stevens, P. Woodward, T. DeFanti, and C. Catlett. From the I-WAY to the

National Technology Grid. Communications of the ACM,40(11):50–60,1997.

[240]

C. Strasser. Git/GitHub: A primer for researchers, 2014.

http://datapub.cdlib.

org/2014/05/05/github-a-primer-for-researchers/.

[241]

A. Szalay and J. Gray. The world-wide telescope. Science,293(5537):2037–2040,

2001.

[242]

T. Tetrick. Best practices for securing access to your

Azure virtual machines, Jun 2014.

https://blogs.

technet.microsoft.com/uspartner_ts2team/2014/06/04/

best-practices-for-securing-access-to-your-azure-virtual-machines/.

[243]

D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice:

The Condor experience. Concurrency and computation: practice and experience,

17(2-4):323–356, 2005.

[244]

B. Tierney, J. Metzger, J. Boote, E. Boyd, A. Brown, R. Carlson, M. Zekauskas,

J. Zurawski, M. Swany, and M. Grigoriev. perfsonar: Instantiating a global n etwork

measurement framework. In SOSP Workshop on Real Overlays and Distributed

Systems,2009.

[245]

J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood,

S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr.

XSEDE: Accelerating scientiﬁc discovery. Computing in Science & Engineering,

16(5):62–74, 2014.

[246]

R. Tudoran, A. Costan, G. Antoniu, and H. Soncu. TomusBlobs: Towards

communication-eﬃcient storage for MapReduce applications in Azure. In 12th

IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing,pages

427–434, 2012.

[247]

S. Tuecke, R. Ananthakrishnan, K. Chard, M. Lidman, B. McCollam, and I. Foster.

Globus Auth: A research identity and access management platform. In 12th IEEE

International Conference on e-Science,2016.

[248]

T. Tugend. UCLA to be ﬁrst station in nationwide computer network, July 1969.

http://www.lk.cs.ucla.edu/LK/Bib/REPORT/press.html.

[249]

J. Turnbull. The Docker Book: Containerization is the new virtualization.Kindle,

2014.

[250]

A. Vahdat. A look inside Google’s data center net-

works, 2015.

https://cloudplatform.googleblog.com/2015/06/

A-Look-Inside-Googles-Data-Center-Networks.html.

[251] J. van Vliet and F. Paganelli. Programming AWS EC2.O’ReillyPress,2011.

364

Bibliography

[252]

T. C. Vance, N. Merati, C. Yang, and M. Yuan. Cloud Computing in O cean and

Atmospheric Sciences.Elsevier,2016.

[253]

J. VanderPlas. Python Data Science Handbook: Essential Tools for Working with

Data.O’ReillyMedia,2017.

[254]

J. Varia. Tips for securing your EC2 instance.

https://aws.amazon.com/

articles/1233.

[255]

N. Vijayakumar and B. Plale. Performance evaluation of rate-based join window

sizing for asynchronous data streams. In 13th IEEE International Symposium on

High Performance Distributed C omputing,pages260–261,2004.

[256]

W. Vogels. MXNet – Deep learning framework of choice at

AWS, Nov 2016.

http://www.allthingsdistributed.com/2016/11/

mxnet-default-framework-deep-learning-aws.html.

[257]

M. M. Waldrop. The Dream Machine: JCR Licklider and the Revolution that Made

Computing Personal.VikingPenguin,2001.

[258] T. White. Hadoop: The Deﬁnitive Guide. O’Reilly Media, Inc., 2012.

[259]

M. Wilde, M. Hategan, J. M. Wozniak, B. Cliﬀord, D. S. Katz, and I. Foster. Swift:

Alanguagefordistributedparallelscripting.Parallel Computing,37(9):633–652,

2011.

[260]

J. Wilkening, A. Wilke, N. Desai, and F. Meyer. Using clouds for metagenomics: A

case study. In IEEE International Conference on Cluster Computing,pages1–6,

2009. http://www.mcs.anl.gov/papers/P1665A.pdf.

[261]

N. Wilkins-Diehr, D. Gannon, G. Klimeck, S. Os te r, and S. Pamidighantam. Te ra-

Grid science gateways and their impact on science. Computer,41(11),2008.

[262]

K. Williams, E. Bilsland, A. Sparkes, W. Aubrey, M. Young, L. N. Soldatova,

K. De Grave, J. Ramon, M. de Clare, W. Sirawaraporn, S. G. Oliver, and R. D. King.

Cheaper faster drug development validated by the repositioning of drugs against

neglected tropical diseases. Journal of the Royal Society Interface,12(104):20141289,

2015.

[263] A. Wittig and M. Wittig. Amazon Web Services in Action.ManningPress,2015.

[264]

D. Xue, P. V. Balachandran, J. Hogde n, J. Theiler, D. Xue, and T. Lookman.

Accelerated search for materials with targeted prop erties by adaptive design. Nature

Communications,7,2016.

[265]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark:

Cluster computing with working sets. In HotCloud,2010.

https://www.usenix.

org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf.

365

Bibliography

[266]

Y. Zheng, X. Chen, Q. Jin, Y. Chen, X. Qu, X. Liu, E. Chang, W.-Y. Ma, Y. Rui,

and W. Sun. A cloud-based knowledge discovery system for monitoring ﬁne-grained

air quality. Technical Report MSR-TR-2014–40, Microsoft Research, 2014.

366

Index

23andMe genotyping service, 253

academic cloud, 6

access tokens, in OAuth2, 231

ACID semantics, in database, 27

activation functions, 205

actors, 96

ADIOS, 165

Advanced Message Queuing Protocol, 112, 122

Advanced Photon Source, 240

Amazon cloud services, 29

Amazon Machine Learning, 203

Athena analytics, 149

Aurora relational database service, 33

Batch, 305

CloudFormation, 100

CloudTrail auditing, 318

CloudWatch metrics, 318

Deep Learning AMI, 212

DynamoDB, 31, 116, 309

EC2 Container Service, 343

EC2 Container Service (ECS), 114–120

Elastic Block Store (EBS), 29, 77, 264, 305

Elastic Compute Cloud, 75–80

Elastic File System (EFS), 29

Elastic MapReduce (EMR), 31, 143–146

Elasticsearch Service, 34

Glacier archival storage, 6, 31

Identity and Access Management (IAM), 115,

320

Lex voice input, 202

Polly text to speech, 202

Redshift data warehouse, 33

Rekognition deep learning, 202

Relational Data Service (RDS), 33

Relational Database Service (RDS), 306 , 309

Route 53, 305

Simple Email Service (SES), 309

Simple Queue Service (SQS), 34, 116, 117, 170

Simple Storage Service (S3), 10, 31, 38–41,

309

Titan graph database, 34

Virtual Private Cloud, 305

Virtual Private Cloud (VPC), 309

Animoto, 299

Apache CloudStack, 260

Apache libcloud Python SDK, 38

Apache Parquet, 150

Apache software foundation

Beam, 184

Flink, 187

Kafka, 180

YARN, 137

application whitelisting, 318

Argonne National Laboratory, 240

Aristotle academic cloud, 7, 70

Array of Things urban observatory, 164, 171, 339

artiﬁcial neural networks, 204

arXiv, document classiﬁer for, 113, 197

Atmosphere, 35

Aurora relational database service, 33

AWS Bat ch , 3 05

Azure cloud services, 29

Azure Batch, 105

Azure Stack private cloud, 260

Blob storage service, 31

Cortana cognitive services, 220

Data Lake, 33, 148

DocumentDB, 33

Event Hubs, 34, 162, 175–179

File Storage, 30

Graph Engine, 34

HDInsight, 32

Machine Learning, 197–201

Queue storage service, 34

Quick Start orchestration, 105

Role-Based Access Control (RBAC), 319

Security Center service, 318

SQL Database service, 33

Storage Explorer, 31, 44

Stream Analytics, 175–179

Table storage service, 32

Threat Analytics service, 318

U-SQL data analytics tool, 149

367

Index

Virtual Machines, 80–81

Azure Stack private cloud software, 260

back propagation, 205

bag of task parallelism, 107

BigQuery, 150

Binder, 91

Bionimbus academic cloud, 7

bisection bandwidth, 106

Blob, 31

blob, binary large object, 24

Bridges computer system, 284

bucket, storage aggregation concept, 10

as used in Amazon cloud, 31

bulk synchronous parallelism (BSP), 67, 96, 108

Cayley graph database, 30, 34

Celery, a Python package, 122

Ceph distributed storage system, 35

CfnCluster (CloudFormation Cluster), 100

Chameleon academic cloud, 7, 284

Cinder, an OpenStack service, 284

client-side encryption, 323

Cloud BI, 67

cloud bursting, 6, 70

Cloud Datalab, 67

cloud native application, 62, 298, 335–337

Cloud Native Computing Foundation, 335

Cloud Pub/Sub, 34

Cloud Security Alliance, 328

cloud, typ es of

academic, 6

community, 6

discovery, 346

hybrid, 6

private, 5

public, 4

CloudBridge Python SDK, 38, 50, 84

CloudStack cloud software, 260, 261

CloudTrail, an Amazon cloud service, 318

CloudWatch, an Amazon cloud service, 318

CNTK, 210

see Microsoft Cognitive Toolkit, 210

community cloud, 6

container, aggregation construct in object store, 24

container, server virtualization method, 64, 85–94

compared with virtual machine, 66

Docker supp ort for, 86

sharing secrets, 320

Singularity as alternative to Docker, 94

Content distribution networks, 339

convolutional neural network, 207

cost

comparative studies in physics, 128, 332

of c loud for pCT analysis, 99

of d iﬀerent instance types, 80

savings by elastic provisioner, 80

data model, 26

data stream analytics, 161

Data Transfer Node, 244

database management system (DBMS), 26

dataﬂow, 67

Datalab, 151

deep learning, 134, 204–212

TensorFlow toolkit, 215–2 18

deep neural network, 98, 206

Department of Energy, xiii, 128, 245

DevOps, 111

DMagic system, 240

Docker, 85

Swarm container management, 67, 125, 320

document store, 27

DSpace, 91

edge computing, 338

Elastic MapReduce (EMR), 143–146

enhanced networking, 101

ESnet, 128, 245

Eucalyptus cloud software, 73, 261–281

deployment planning, 263

euca2ools command line interface, 275

single cluster cloud, 267

eventual consistency, 27

Fermilab, 128, 332

Field Programmable Gate Arrays, 338

ﬁle shares, 29

ﬁltered back projection, 99

fourth paradigm, 128

Galaxy workﬂow system, 83, 91, 304

Ganglia monitoring tool, 100

gcfuse, 128

Genome Wide Association Study (GWAS), 253

GeoDeepDive, 128

Gigabit testbeds, 330

GitHub, 13

and cloud access keys, 317, 326

Glance, an OpenStack service, 284

Globus application examples

data sharing at Advanced Photon Source, 240

NCAR Research Data Archive, 247

Sanger Institute Imputation Service, 253

Globus Genomics, 108, 128

Globus research data management service, 298

accounts, 236

endpoints, 226

Globus Connect, 51, 304

identity providers supported, 236

publication service, 239

Google cloud services, 29

368

Index

AppEngine, 82

BigQuery, 33

Bigtable NoSQL, 47–48

Cayley graph database, 30, 34

Cloud Bigtable, 32

Cloud Dataﬂow, 184

Cloud Datastore, 30, 32, 122

Cloud Pub/Sub, 34

Cloud pub/sub, 122

Cloud SQL, 33

Cloud Storage, 31

Coldline archival storage, 31

Compute Engine, 30, 82

Datastore NoSQL, 48–50

Kubernetes, 120–124

local SSD storage, 30

persistent disk storage, 30

Spanner, 33

storage services, 46–50

graph execution model, 96

Graphics processing unit (GPU), 80, 98, 99, 212,

218, 335

Hadoop, 96

Hadoop Distributed File System (HDFS), 136

HBase NoSQL database, 158

HDInsight, 147

Health Insurance Portability and Accountability Act

(HIPAA), 323

HEPCloud project, 128

high-performance computing, 94, 97–107, 283

and streaming, 165

on Amazon cloud, 100–103

on Azure cloud, 105–106

scaling challenges, 106

Hive data warehouse tool, 158

HTCondor job management system, 67, 128, 304

hybrid cloud, 6

hypervisor, 64, 74, 286

InCommon identity management federation, 226

inﬁnite loop, see loop, inﬁnite

infrastructure as a service (IaaS), 1, 63, 262, 318

Internet of Things, 175, 339

iRODS, 91

iSCSI Extensions for RDMA, 287

Jetstream academic cloud, 7, 35, 284

Jupyter, 13

JupyterHub multiuser system, 326

Kafka, 180

key pair, obtaining for Amazon clou d, 38, 75, 112

key-value store, 27

Keystone, an OpenStack service, 284

Kinesis, 34

Kinesis Analytics, 167

Kinesis Firehose, 167

Kinesis Streams, 167

Kubernetes, 67, 97

Lambda at the Edge, 339

local SSD, Google cloud service, 30

logistic function, 193

logistic regression, 193

loop, inﬁnite, see inﬁnite loop

Lustre parallel ﬁle system, 24

Machine learning, 191

scikit-learn package, 91

Vowpal Wabbit, 91, 198

machine learning, 134, 191–223

Amazon Machine Learning platform, 202–203

Azure Machine Learning, 197–201

MXNet open source library, 212–215

Spark MLlib, 192–197

magic operators, 143

manager worker parallelism, 107

many task parallelism, 66, 96, 107–108

MapReduce, 67, 96, 108

Mesos, 67, 97, 99

Message Passing Interface (MPI), 66

application to proton therapy, 99

in the cloud, 97

metacomputer, 330

microservice, 96, 110–122

and cloud native applications, 335

managing keys for, 320

Microsoft cloud, see Azure cloud services

Microsoft Cognitive Toolkit, 96, 210, 218

multitenancy, 288, 298, 301, 306

MySQL, 26

Amazon Aurora compatible with, 33

National Institute of Standards and Technology, xiii,

3, 328

National Institutes of Health, xiii

National Science Fo undation, xiii

National Security Agency, 321

Neutron, an OpenStack service, 284

NGINX web proxy with load balancing, 125

Nimbus cloud software, 73

NoSQL database, 27

Nova, an OpenStack service, 284

OAuth 2.0 authorization framework, 231

object store, 24

object, cloud storage unit, 10

Ocean Observatories Initiative, 163

Oozie workﬂow management tool, 158

Open Compute Project, 337

Open Researcher and Contributor ID (ORCID), 236

369

Index

OpenID Connect Core 1.0 (OIDC), 231, 236

OpenNebula cloud software, 73, 259, 261

OpenStack

Cinder block storage, 284

Glance image service, 284

Keystone identity component, 284

Neutron networking component, 284

Nova compute component, 284

Swift object storage, 284

OpenStack cloud services

Shared File Systems, 35

Swift object storage, 34

OpenStack cloud software, 73, 259, 283–296

and HPC, 284

and scientiﬁc workloads, 285

core services, 284

deployment, 288

persistent disk, Google cloud service, 30

personal health information (PHI), 323

Phoenix, 158

Pig, 158

platform as a service (PaaS), 3, 318

Portable Operating System Interface (POSIX), 23,

PostgreSQL, 26, 33, 80, 309

private cloud, 5

public cloud, 4

pros and cons, 68

publish/subscribe, 34

Python packages

Apache libcloud, 38

Azure Data Lake Store SDK, 148

Boto3 SDK, 11, 40, 76, 92, 168

Celery remote procedure call, 122

CloudBridge SDK, 38, 50, 84

Globus Auth SDK, 231–239

Globus Transfer SDK, 53, 227–230, 239

Google Cloud SDK, 46

Requests HTTP library, 253

scikit-learn machine learning, 91, 192

query lang uage, 26

RabbitMQ message broker, 122, 126

recurrent neural network (RNN), 208

RedCloud academic cloud, 7

reinforced learning, 220

relational database, 26

Relational Database Service, 306

Representational State Transfer, 9

research data portal, 243

resilient distributed dataset (RDD), 138, 171, 192,

194

resource owner, 232

resource server, 232

role-based security, 113

Route 53, the Amazon service, 305

Sanger Imputation Service, 253

Scala, 138

scale, challenges of, 66

Science DMZ, 244

science gateway, 302

scikit-learn machine learning, 91

Secure Socket Layer, 321

server-side encryption, 322

serverless computing, 62, 67

service-level agreement, 107

Simple Azure, 81

single program multiple data, 96

Single-Root I/O Virtualization, 287

SMB, 30

software as a service (SaaS), 2, 299–301, 318

software development kits, 10

solid state disk (SSD), 30, 98

Spark, 67, 96, 137–143

DataFrames, 142, 192

Simple example program, 138

Streaming, 170

Spark MLlib (machine learning), 192–197

Chicago restaurant example, 193

Estimators, 192

Pipeline, 192

Transformers, 192

Storage Service Encryption, 323

Swarm, 67

Swift parallel scripting language, 34

Swift, an OpenStack service, 284

TensorFlow machine learning library, 96, 157, 207–

208, 212, 215–218, 338, 344

Titan graph database, 34

topologies, 180

training set, 194

Transp o rt Layer Security, 321

tumbling window, 178

U-SQL data analytics tool, 149

Union File System, 86

Urban informatics, 163

UUID, universally unique identiﬁer

use to name buckets, 46

used by Globus, 53, 236

virtual machine, 64, 73–84

compared with container, 66

instance storage, 77

Virtual Private Cloud, 305

virtual private network, 324, 326

virtualization, 74

VMWare Cloud Foundat i on, 26 0

370

Index

Vowpal Wabbit learning system, 91, 198

webHDFS, 148

XSEDE, 35

YARN, 137

Zeppelin web-based notebook, 143

371