Part I
Managing Data in the Cloud
Managing&data&in&the&cloud
File%systems
Object%stores
Datab a s es%(SQL)
NoSQL%and%graphs
Warehouses
Globus%file%services
Computing&in&the&cloud
Virtual%machines
Containers% Docker
MapReduce% Yarn%and%Spark
HPC%clust ers%in%the%cloud
Mesos,%Swarm,%Kubernetes
HTCond or
The&cloud&as&platform
Data%analytics
Spark%&%Hadoop
Public%cloud%Tool s
Streaming%data
Kafka,%Spark,%Beam
Kinesis,%Azure%Events
Machine%learning
Scikit-Learn,%CNTK,%
Tenso rf lo w,%AWS% ML
Building&your&own&cloud
What%you%need%to%know
Using%Eucalyptus
Using%OpenStack
Security&and&other&topics
Securing%services%and%data%
Solutions
History,%critiques,%futures
Research%data%portals
DMZs%and%DT Ns,%Globu s
Science%gateways
Part&I
Part&II
Part&III
Part&IV
Part&V
Part I:
Managing Data in the Cloud
Data storage was the first manifestation of the cloud. Amazon’s S3 public data
storage service was launched in 2006. In 2008, Dropbox was introduced as a cloud
service that could replace having to share files by passing around USB flash drives.
That same year, Microsoft introduced its SkyDrive cloud storage service, later
integrated with a service called Live Mesh that allowed synchronization across
multiple machines, and in 2014 rebranded as OneDrive due to a lawsuit over the
use of the word “Sky.” Google introduced Google Drive in 2012.
These services all demonstrate the utility of a cloud service that allows you
access to data anywhere, at any time, and on any device. However, they only
represent one data storage model that is important for cloud computing. In this
first part of the book we explore the following models.
File system
storage is the well-known model of organizing data into folders
and directories. In the cloud, file storage is usually accessed by attaching a
virtual disk to a virtual machine.
Blob storage
, where Blob is shorthand for Binary Large Object, provides
a flat object model for d ata. It is extremely scalable, in ways that are
challenging for file systems.
Databases
provide highly structured data collections. We consider three
primary types of database in this book:
1.
Relational databases, which have a formal algebra of composition tha t
can be invoked by the structured query language, SQL.
2.
Tables and NoSQL databases, which are more easily distributed over
multiple machines.
3.
Graph databases, in which data are represented as a graph of nodes
and edges.
Data warehouses
that can support and enable search over massive amounts
of data.
The two chapters in part I first explore these dierent models and then describe
the landscape of storage oerings from the various cloud providers. One important
capability of cloud data models for science is the support that they provide for
managing data remotely; each provides an API and SDK that can be used to script
data ma nagem ent tasks. We use various simple exam pl es to illustrate the use of
Python SDKs for Amazon, Azure, Google, and OpenStack. We also describe the
Globus file management services, which are of particular importance for scientific
applications in which data are often produced and consumed outside the cloud
and need to move seamlessly among dierent locations.
20