Part I

Managing Data in the Cloud

Managing&data&in&the&cloud

File%systems

Object%stores

Datab a s es%(SQL)

NoSQL%and%graphs

Warehouses

Globus%file%services

Computing&in&the&cloud

Virtual%machines

Containers%– Docker

MapReduce%– Yarn%and%Spark

HPC%clust ers%in%the%cloud

Mesos,%Swarm,%Kubernetes

HTCond or

The&cloud&as&platform

Data%analytics

Spark%&%Hadoop

Public%cloud%Tool s

Streaming%data

Kafka,%Spark,%Beam

Kinesis,%Azure%Events

Machine%learning

Scikit-Learn,%CNTK,%

Tenso rf lo w,%AWS% ML

Building&your&own&cloud

What%you%need%to%know

Using%Eucalyptus

Using%OpenStack

Security&and&other&topics

Securing%services%and%data%

Solutions

History,%critiques,%futures

Research%data%portals

DMZs%and%DT Ns,%Globu s

Science%gateways

Part&I

Part&II

Part&III

Part&IV

Part&V

Part I:

Managing Data in the Cloud

Data storage was the ﬁrst manifestation of the cloud. Amazon’s S3 public data

storage service was launched in 2006. In 2008, Dropbox was introduced as a cloud

service that could replace having to share ﬁles by passing around USB ﬂash drives.

That same year, Microsoft introduced its SkyDrive cloud storage service, later

integrated with a service called Live Mesh that allowed synchronization across

multiple machines, and in 2014 rebranded as OneDrive due to a lawsuit over the

use of the word “Sky.” Google introduced Google Drive in 2012.

These services all demonstrate the utility of a cloud service that allows you

access to data anywhere, at any time, and on any device. However, they only

represent one data storage model that is important for cloud computing. In this

ﬁrst part of the book we explore the following models.

• File system

storage is the well-known model of organizing data into folders

and directories. In the cloud, ﬁle storage is usually accessed by attaching a

virtual disk to a virtual machine.

• Blob storage

, where Blob is shorthand for Binary Large Object, provides

a ﬂat object model for d ata. It is extremely scalable, in ways that are

challenging for ﬁle systems.

• Databases

provide highly structured data collections. We consider three

primary types of database in this book:

Relational databases, which have a formal algebra of composition tha t

can be invoked by the structured query language, SQL.

Tables and NoSQL databases, which are more easily distributed over

multiple machines.

Graph databases, in which data are represented as a graph of nodes

and edges.

• Data warehouses

that can support and enable search over massive amounts

of data.

The two chapters in part I ﬁrst explore these diﬀerent models and then describe

the landscape of storage oﬀerings from the various cloud providers. One important

capability of cloud data models for science is the support that they provide for

managing data remotely; each provides an API and SDK that can be used to script

data ma nagem ent tasks. We use various simple exam pl es to illustrate the use of

Python SDKs for Amazon, Azure, Google, and OpenStack. We also describe the

Globus ﬁle management services, which are of particular importance for scientiﬁc

applications in which data are often produced and consumed outside the cloud

and need to move seamlessly among diﬀerent locations.