Chapter 2

Storage as a Service

“As a general rule the most successful man in life is the man who has

the best information.”

—Benjamin Disraeli

Science is concerned above all with data: with their acquisition, preservation,

organization, analysis, and exchange. Thus we begin this book with a discussi on

of major cloud data storage services. These services collectively support a wide

range of data storage models, from unstructured objects to relational tables, and

oﬀer a variety of performance, reliability, and cost characteristics. Collectively

they provide the scientist or engineer with a wonderfully rich, if initially da unting,

set of data storage capabilities.

In this chapter and the next, we introduce important cloud data storage

concepts and illustrate how thes e concepts are realized in major cloud storage

systems, using a range of examples to show how to use these systems to outsource

simple data management tasks. In l ater chapters, we build on this foundation

to show how cloud storage systems can be used in conjunction with other cloud

services to construct powerful data management and analysis capabilities, for

example when a data store is combined with an event notiﬁcation service and

compute service to enable analysis of streaming data.

2.1. Three Motivating Examples

2.1 Three Motivating Examples

Many cloud services that we use routinely—services such as Box, Dropbox,

OneDrive, Google Docs, YouTube, Facebook, and Netﬂix—are, above all, data

services. Each run s in the cloud, hosts digital content in the cloud, and provides

specialized methods for accessing, storing, and sharing that content. Each is

built on one or more—often multiple—storage services that have been variously

optimized for speed, scale, reliability, and/or consistency. Our interest here is in

how the basic infrastructure components on which these applications are built can

be applied to science and engineering problems. To address this question, we need

to understand the various types of storage s ystems tha t are common in the cloud

so that we may evaluate their relative merits. To provide context, we consider the

following three s cience and engineering use cases.

UC1

A climate science laboratory has assembled a s et of simulation output ﬁles,

each in Network Common Data Form (NetCDF) format: some 20 TB in

total. These data are to be made accessible via interactive tools running in

a web portal. The d ata sizes are such that data need to be partitioned to

enable distributed analyses over multiple machines running in parallel.

UC2

A seismic observatory is acquiring records describing experimental observa-

tions, each specifying the time of the observation, experimental parameters,

and the measurement itself, in CSV format. There may be a total of 1,000,000

records totaling some 100 TB when they ﬁnish collecting. They need to store

these data to en abl e easy access by a large team, and to permit tracking of

the data i nventory and its accesses.

UC3

A team of scientists operates a col lectio n of several thousand instruments,

each of which generates a data record every few seconds. The individual

records are not large, but managing the aggregate stream of all outputs

in order to perform analyses across the entire collection every few hours

introduces data management challenges. This problem is simi lar to that of

analyzing large web traﬃc or social media streams.

Each use case requires diﬀerent storage and processing models. In the para-

graphs and chapters that follow, we divide these scenarios into more speciﬁc data

collection examples and show how each can map to speciﬁc cloud storage services.

Chapter 2. Storage as a Service

2.2 Storage Models

Before reviewing speciﬁc cloud storage services, we say a few words about storage

models: that is, the diﬀerent ways in which data can be organized in a storage

system. An exciting feature of cloud storage systems is that they support a wide

range of diﬀerent storage models: not just the ﬁle systems that most researchers use

on a daily basis, but also more specialized models such as object stores, relational

databases, table stores, NoSQL databases, graph databases, data warehouses, and

archival storage. Furthermore, their implementations of these models are often

highly scalable, adapting easily from megabytes to hundreds of terabytes and

beyond, while avoiding the need for d edi cated operations expertise on the part

of the user. Given the many challenges faced by scientists and engineers as they

struggle with rapidly growing data, cloud storage can thus be the answer to their

prayers. But one needs to understand the properties of these diﬀerent storage

models in order to choo se the right system.

The right storage model for a data collection can depend on not only the

nature and si ze of the data, but also the analyses to be performed, sharing plans,

and update frequencies, among other factors. We review here some of the more

important storage models supported by cloud storage services and, for each, their

principal capabilities and their pros and cons for various purposes. This material

provides background for the detailed service descriptions in the next chapter.

2.2.1 File Systems

The one storage model with which every scientist and engineer is surely familiar

is the ﬁle system, organized around a tree of directories or folders. This model

has proven to be an extremely intuitive and useful data storage abstraction. The

standard API for the Unix-derived version of the ﬁle system is called the

Portable

Operating System Interface

(POSIX). We are all familiar with the POSIX ﬁle

system: we use it every day on our Apple, Linux, or Windows computer. Using

command line tools, graphical user interfaces, or APIs, we create, read, write, and

delete ﬁles located within directories.

The ﬁle system storage model has important advantages. It allows for the

direct use of many existing programs without modiﬁcation: we can navigate a ﬁle

system with familiar ﬁle system browsers, run programs written in our favorite

analysis tool (Python, R, S tata, SPSS, Mathematica, etc.) on ﬁles, and share

ﬁles via email. The ﬁle system model also provides a straightforward mechanism

for representing hierarchical relationships among data—directories and ﬁles—and

2.2. Storage Models

supports concurrent access by multiple readers. In the early 1990s, the POSIX

model was extended to distributed network ﬁle systems and there were some

attempts at wide area versions. By the late 1990s, the Linux Cluster community

created Lustre, a true parallel ﬁle system that supports the POSIX standard.

The ﬁle system model also has disadvantages as a basis for science and engi-

neering, particularly as data volumes grow. From a data modeling perspective, it

provides no support for enforcing conventions concerning the representation of data

elements and their relationships. Thus, wh ile one may, for example, choose to store

environmental data in NetCDF ﬁles, genomes in FASTA ﬁles, and experimental

observations in comma-separated-value (CSV) ﬁles, the ﬁle system does nothin g

to prevent the use of inconsistent representations within those ﬁles. Furthermore,

the rigid hi erarchical organization enforced by a ﬁle system often does not match

the relationships that one wants to capture in science. Lacking any information on

the data model, ﬁle systems cannot help users navigate complex data co llecti ons .

The ﬁle system model also has problems from a scalability perspective: the need

to maintain consistency as multiple processes read and write a ﬁle system can lead

to bottlenecks in ﬁle system implementations.

For these reasons, clo ud storage services designed for la rge quantities of data

frequently adopt diﬀ erent storage models, as we discuss next.

2.2.2 Object Stores

The object storage model, like the ﬁle system model, stores unstructured binary

objects. In the database world, objects are often referred to as

blobs

, for binary

large object, and we use that name here when it is consistent with terminology

adopted by cloud vendors. An o bject/blob store simpliﬁes the ﬁle system model

in important ways: in particular, it eliminates hierarchy and forbids updates to

objects once created. Diﬀerent object storage services diﬀer in their details, but in

general they support a two-level folder-ﬁle hierarchy that allows for the creation of

object

conta i ners

, each of which can hold zero or more

objects

. Each o bject is

identiﬁed by a unique identiﬁer and can have various metadata associated with

it. Objects cannot be modiﬁed once uploaded: they can only be deleted—or, in

object stores that support versioning, replaced.

We can use this storage model to store the NetCDF data in use case UC1. As

shown in ﬁgure 2.1 on the following p age, we create a single container and store

each NetCDF ﬁle in that container as an object, with the NetCDF ﬁlename as the

object identiﬁer. Any authorized individual who possesses that object identiﬁer

can then access the data via simple HTTP requests or API calls.

Chapter 2. Storage as a Service

The object store model has important advantages in terms of simplicity, per-

formance, and reliability. The fact that objects cannot be mod iﬁed once created

makes it easy to build highly scalable and reliable implementations. For example,

each object can be replicated across multiple physical storage devices to in crease

resilience and (when there are many concu rrent readers) performance, without any

specialized synchronization logic in the implementation to deal with concurrent

updates. Objects can be moved manually or automatically among storage classes

with diﬀerent performance and cost parameters.

The object store model also has limitations. It provides little support for

organizing data and no support for search: A user must know an object’s identiﬁer

in order to access it. Thus, an object store would likely be inadequate as a basis for

organizing the 1,000,000 environmental records of UC2: We would need to create

a separate index to map from ﬁle characteristics to object identiﬁers. Nor does

an object store provide any mechanism for working with structured data. Thus,

for exam ple, while we could load each UC2 dataset i nto a separate object, a user

would likely have to download the entire object to a compu ter to compute on its

contents. Finally, an object store cannot easily be mounted as a ﬁle system or

accessed with existing tools in the ways that a ﬁle system can.

Figure 2.1: Object storage model with versioning. Each NetCDF ﬁle is stored in a separate

container, and all versions of the same NetCDF ﬁle are stored in the same container.

2.2. Storage Models

2.2.3 Relational Databases

database

is a structured collection of data about entities and their relationships.

It models real-world objects—both entities (e.g., microscopes, experiments, and

samples, as in UC2 above) and relationships (e.g., “sample1” was tested on “July

24”)—and captures structure in ways that allow these entities and relationships

to be queried for analysis. A

database manag e ment system

(DBMS) is a

software suite designed to safely store and eﬃciently manage databases and to

assist with the maintenance and discovery of the relationships that databases

represent. In general, a DBMS encompasses three components: i ts

data model

(which deﬁnes how data are represented), its

query language

(which deﬁnes

how the user interacts with the data), an d support for

transactions and crash

recovery (to ensure reliable execution desp ite system failures).

For a variety of reasons, science and engineering data often belong in a database

rather than in ﬁles or objects. The use of a DBMS simpliﬁes data management and

manipulation and provides for eﬃcient querying and analysis, durable and reliable

storage, scaling to large data sizes, validation of data formats, and management of

concurrent accesses.

While DBMSs in general, and cloud-based DBMSs in particu lar, support a

wide variety of data fo rmats, two major classes can be distinguished: relational

and NoSQL. (We also discuss another class, graph databases, below.)

Relational DBMSs allow for the eﬃcient storage, organization, and analysis of

large quantities of tabular data: data organized as tables, in which rows represent

entities (e.g., experiments) and columns represent attributes of those entities

(e.g., experimenter, sample, res ult). The associated Structured Query Language

(SQL) can then be used to specify a wide range of operations on such tables,

such a s composi tions and joins. For example, the foll owing SQL joins two tables,

Experiments and People , to ﬁnd all experiments performed by Smith:

select experiment -id from Experiments , People

where Experiments.person -id = People.person -id

and People.name = "Smith";

SQL s tatements can be executed with high eﬃcien cy thanks to sophisticated

indexing and query planning techniques. Thus this join can be executed quickly

even i f there are millions of records in the tables being joined.

Many open source, commercial, and cloud-hosted relational DBMSs exist.

Among the open source DBMSs,

MySQL

and

PostgreSQL

(often simply Post-

gres) are particularly widely used. Both MySQL and Postgres are available in

Chapter 2. Storage as a Service

cloud-hosted forms. In addition, cloud vendors oﬀer specialized relational DBMSs

that are desi gned to scale to particularly large data sizes.

Relational databases have two i mportant properties. First, they support a

relational algebra th at provides a clear, mathematical meaning to the SQL language,

facilitating eﬃcient and correct implementations. Second, they support

ACID

semantics

, a term that captures four im portant database properties:

tomicity

(the entire transaction succeeds or fails),

onsistency (the data collection is never

left in an invalid or conﬂicting state),

solation (concurrent transactions cannot

interfere with each other), and

urability (once a transaction completes, s ystem

failures cannot invalidate the result).

2.2.4 NoSQL Databases

While relational DBMSs have long dominated the database world, other technologies

have become popular for some applicati on classes. A relational DBMS is almost

certainly the right technology to use for highly structured datasets of moderate size.

But if your data are less regular (if, for exampl e, you are dealin g with large amounts

of text or if diﬀerent items have diﬀerent properties) or extremely large, you may

want to consider a NoSQL DBMS . The design of these systems has typically been

motivated by a desire to sca le the quantities of data and number of users that can

be supported, and to deal with unstructured data that are not easily represented

in tabular form. For example, a

key-value store

can organize large numbers of

records, each of which associates an arbitrary key with an arbitrary value. (A

variant called a document store permits text search on the stored values.)

NoSQL databases h ave limitations relative to relational d ataba ses. The name

NoSQL is derived from “non SQL,” meaning that they do not support the full

relational algebra. For example, they typically do not support queries that join

two tables, such as those shown above.

Another deﬁnition of NoSQL is “not only SQL,” meaning that most of SQL

is supported but other properties are available. For example, a NoSQL database

may allow for the rapid ingest of large quantities of unstructured data, such as

the instrument events of UC3. Arbitrary data can be stored without modiﬁcations

to a database schema, and new col umn s introduced over time as data and/or

understanding evolves. NoSQL databases i n the cloud are often distributed over

multiple servers and also replicated over diﬀerent data centers. Hence they often

fail to satisfy all of the ACID properties. Consistency is often replaced by

eventual

consistency

, meaning that database state may be momentarily inconsistent across

replicas. This relaxation of ACID properties is acceptable if your concern is to

2.2. Storage Models

respond rapidly to queries about the current state of a store’s inventory. It may

be unacceptable if the d ata in question are, for example, medical records.

Challenges of scale: The CAP theorem

(from Foster et al. [

125

]). For many

years, and still today, the big relational database vendors (Oracle, IBM, Sybase,

Microsoft) were the mainstay of how data were stored. During the Internet boom,

startups looking for low-cost alternatives to commercial relational DBMSs turned

to MySQL and PostgreSQL. However, these systems proved inadequate for big sites

as they could not cope well with large traﬃc spikes, as for example when many

customers all suddenly wanted to order the same item. That is, they did not scale.

An obvious solution to scaling databases is to d istribute and/or replicate data

across multiple computers, for example by distributing diﬀerent tables, or diﬀerent

rows from the same table. However, distribution and replication also introduce

challenges, as we now explain. Let us ﬁrst deﬁne some terms. In a system that

comprises multiple computers:

• Consistency

indicates that all computers see the same data at the same time.

• Ava i l a b i l ity

indicates that every request receives a response about whether it

succeeded or failed.

• Partition tolerance

indicates that the sys tem continues to operate even if a

network failure prevents computers from communicating.

An important result in distributed systems (the “CAP Theorem” [

]) observes

that it is not possible to create a distributed system with all three properties. This

situation creates a challenge with large transactional datasets. Distribution is needed

for high p erformance , b ut as the number of computers grows, so too does the likelihood

of network disruption among computers [

]. As stric t consistency cannot be achieved

at the same time as availability and p artition -toleran ce, the DBMS designer must

choose between high consistency or high availability for a particular system.

The right combination of availability and consistency will depend on the needs of

the service . For example, in an e- c omme rc e setting, we may choose high availability

for a checkout process to ensure that revenue-producing requests to add items to

ashoppingcartarehonored. Errorscanbehiddenfromthecustomerandsorted

out later. However, for order submission—when a customer submits an order—we

should favor consistency, because several services (credit card processing, shipping

and handling, reporting) need to access the data simultaneously.

2.2.5 Graph Databases

A graph is a data structure in which

edges

connect

nodes

. Graphs are useful when

we need to search data based on relationships among data items. For example, in

UC2, measurements from diﬀerent experiments might be related by their use of the

Chapter 2. Storage as a Service

same measurement modality; in a database of scientiﬁc publications, publications

can be represented as nodes and citations, sha red authors, or even shared concepts

as edges. Often graph databases are built on top of existing NoSQL databases.

2.2.6 Data Warehouses

The term data warehouse is commonly used to refer to data management systems

optimized to support analytic queries that involve reading large datasets. Data

warehouses have diﬀerent d esi gn goals and properties than do DBMSs. For exam ple,

a medical center’s clinical DBMS is typically designed to enable many concurrent

requests to read and update information on individual patients (e.g., “what is

Ms. Sm ith’ s current weight?” or “Mr. Jones was prescribed Aspirin”). Data from

this DBMS are uploaded periodically (e.g., once a day) into the medical center’s

data warehouse to support aggregate queries such as “What factors are correlated

with length of stay?” As we discuss in the next section, several cloud vendors oﬀer

data warehouse so luti ons that can scale to extremely large data volumes.

2.3 The Cloud Storage Landscape

The major public cloud companies provide a rich collection of storage services.

The cloud providers described h ere are Amazon Web Services (hereafter referred

to as

Amazon

), Microsoft Azure (hereafter

Azure

), and Google Cloud (hereafter,

Google

). Table 2.1 lists selected oﬀerings from these three major vendors. (When

a single vendor has multiple services in a category, those services tend to have quite

diﬀerent characteristics.) There are also other, more specialized storage services

not listed in the table, some of which we mention in the following. We expand

upon each of the rows in this table in the text that follows.

We do not list OpenStack options in the table because they are dependent

upon the speciﬁc deployment, are not part of the OpenS tack standa rd, and are

not as extensive as those provided by the three main public clouds. Nevertheless,

some standards for ﬁle services exist, as we discuss in section 2.3.7 on page 34.

2.3.1 File Systems

File systems (also referred to as

ﬁle shares

) are virtual data drives that can be

attached to virtual machines. We describe the following services i n greater detail

in part II of the book. Amazon’s

Elastic Block Store

(EBS) and

Elastic File

System

(EFS) services oﬀer related but diﬀerent services. EBS is a device that

2.3. The Cloud Storage Landscape

Table 2.1: Storage as a service options from ma jor public cloud vendors.

Model Amazon Google Azure

Files

Elastic File System

(EFS), Elastic Block

Store (EBS)

Google Cloud

attached ﬁle system

Azure File Storage

Objects

Simple Storage Service

(S3)

Cloud Storage Blob Storage Service

Relational

Relational Data

Service (RDS), Aurora

Cloud SQL, Spanner Azure SQL

NoSQL DynamoDB, HBase

Cloud Datastore,

Bigtable

Azure Tables, HBase

Graph Titan Cayley Graph Engine

Warehouse

analytics

Redshift BigQuery Data Lake

you can mount onto a single Amazon EC2 compute server instance at a time; it is

designed for applications that require low-latency access to data from a single EC2

instance. For examp le, you might use it to store working data that are to be read

and written frequently by an application, but that are too large to ﬁt in memory.

EFS, in contrast, is a general-purpose ﬁle storage service. It provides a ﬁle system

interface, ﬁle system access semantics (e.g., strong consistency, ﬁle locking), and

concurrently-accessible storage for many Amazon EC2 instances. You might use

EFS to hold state that is to be read and written by many concurrent processes.

Note that both EBS and EFS can be accessed directly only by EC2 instances, that

is, from inside the Ama zon cloud.

Google Compute Engine

has a diﬀerent attached storage model. There are

three types of attached disks (and also a way to attach an object store). The

cheapest,

persistent disks

, can be up to 64 TB in size.

Local SSD

(solid state

disk) is higher performance but more expensive and can be up to 3 TB. Finally,

RAM disk

is in-memory, limited to 208 GB, and expensive. Persistent disks can

be accessed anywhere in a zone, but SSD and RAM are only accessible by the

instance to which they are attached.

The

Azure File Storage

service allows users to create ﬁle shares in the cloud

that can be accessed by a special protocol,

SMB

, that allows Microsoft Windows

VMs and Linux VMs to mount these ﬁle shares a s stand ard parts of their ﬁle

system. These ﬁle shares can al so be mounted on your Windows or Mac.

Chapter 2. Storage as a Service

2.3.2 Object Stores

Amazon’s

Simple Storage Service

(S3) was historically its ﬁrst cloud service.

It is highly popular: as of 2016, it reportedly holds trillions of objects in billions

of containers, wh ich S3 calls

buckets

. S3 is a classic object store, with all of the

properties listed in section 2.2.2 on page 24. We describe S3 in more detail in

section 3.2 on p age 38, where we also present examples of its use. The related

Glacier

service is designed for long-term, secure, durable, extremely low cost data

archiving. Access times for an object in Glacier may be several hours, so this is

not for app li cations that need rapid data access.

Google’s

Cloud Storage

, like Amazon S3, provides a basic object storage

system that is durable, replicated and highly available. It supports three storage

tiers, each with diﬀerent performance and price levels. The most expensive is

Standard

multiregional, the mid-range tier is

Regional

(DRA), and the bottom

tier is

Nearline

. Standard storage is for data you expect to access often, DRA

is for batch jobs for which response tim e is not a critical issue, and Nearline is

for cold storage and disaster recovery. Google Cloud also has

Coldline

, which is

similar to AWS Glacier.

Azure Storage

oﬀers a suite of services with a similar scope in terms of

models supported to those provided by Amazon and Google. Azure provides the

user with a uniﬁed view of many of its storage types associated with their account,

as shown in ﬁgure 2.2 on the next page. This integration means that you can use

the Azure Storage Explorer tool

storageexplorer.com

to see and manage all of

these storage products from your PC or Mac. While Azure storage services were

originally optimized for close integration with Microsoft Windows environments,

Linux is now a n important part of Azure, so this diﬀerentiation is less noticeable.

The Azure

Blob

storage service, like Ama zon’s S3, is concerned with highly

reliable storage of unstructured objects, which Microsoft calls blobs.LikeAmazon

and Google Cloud, Azure blob storage has tiered storage and pricing. The tiers

are hot for frequently accessed data and cool for data that are accessed less often.

2.3.3 NoSQL Services

Amazon’s

DynamoDB

is a powerful NoSQL database based on an extensible key-

value model: for each row, the primary key column is the only required attribute,

but any number of additional columns can be deﬁned, in dexed, and made searchable

in various ways, including full-text search via Elasticsearch. DynamoDB’s rich

feature set deﬁes a concise description, but we illustrate some of its uses in

section 3.2 on page 38. Related to this is Amazon

Elastic MapReduce

(EMR),

2.3. The Cloud Storage Landscape

Figure 2.2: Major Azure storage types.

discussed in section 8.3 on page 143, which allows analysis of large quantities of

data with Spark and other data analytics platforms.

Cloud Bigtable

, Google’s highly scalable NoSQL database service, is the

same database that powers many core Google services, including Search, Analytics,

Maps, and Gmail. Bigtable maps two arbitrary strings (row key and column key)

and a timestamp (permitting versioning and garbage collection) to an associated

arbitrary byte array. It is designed to handle such large and sparse datasets in a

manner that is eﬃcient in space used and th at supports massive workloads, while

providing low latency and h igh bandwidth. You deploy Bigtable on a Google-hosted

cluster, which can be dynamically resized if needed. The open source Apache

HBase database system, fundamental to th e Apache Hadoop system, is compatible

with Bigtable. Google’s

Cloud Datastore

has many similarities to Bigtable. An

important diﬀerence is that it i mpl ements ACID semantics and thus the user

does not have to wait for possible inconsistencies to be resolved, as is required for

Bigtable. Cloud Datastore ha s a much richer set of S QL-like operators than do

many of the other NoSQL systems described here.

The Azure

Table

storage service is a simple NoSQL key-value store, designed

to support the highly reliable storage of any large number of key-value pairs. It is

similar to Amazon DynamoDB. Its query capabilities are limited, but it can support

many queries at modest cost. Azure

HDInsight

provides an implementation of the

Hadoop storage service h osted on Azure cloud co mpu ters, with implementations of

popular big data tools, including Spark, the HBase NoS QL database, and the Hive

SQL database implemented to run eﬃciently and scalably on top of that Hadoop

Chapter 2. Storage as a Service

fabric. We describe this service in more detail in part III.

DocumentDB

is a

NoSQL service, like Table, but supporting full text indexing and query, albeit at

higher cost due to the greater resources needed for indexing.

2.3.4 Relational Databases

Relational databases are a mature technology, so the main innovation in the cloud

is deployments that scale to especially large sizes.

Amazon’s

Relational Database Service

(RDS) allows you to set up a con-

ventional relational database (e.g., MySQL or Postgres) on Amazon computers,

thus permitting MySQL and Postgres applications to be ported to Amazon with-

out change. The MySQL-compatib le Amazon

Aurora

service provides higher

scalability, performance, and resilience than an RDS MySQL instance: it can scale

to many terabytes, replicate data across data centers (known as availability zones),

and create m any read replicas to support large numbers of concurrent reads.

Google’s

Cloud SQL

relational database service has similar capabilities to

those provided by such services in Amazon and Azure. Their

Spanner

system [

101

]

is a globally distributed relational database that provides ACID transactions and

SQL semantics with h igh scaling and availability.

Azure’s

SQL Database

provides a relational database service similar to

Amazon RDS. It is based on their mature SQL Server technology and is highly

available and scalable.

2.3.5 Warehouse Analytics

Cloud data warehouses are designed speciﬁcally for running analytics queries over

large collections. You interact with them from the cloud portal or via REST APIs.

We discuss these systems in greater detail in part III.

Amazon

Redshift

is a data warehouse system, designed to support high-

performance execution of analytic and reporting workloads against large datasets.

For massive data analytics, Google provides the

BigQuery

petascale data

warehouse. BigQuery is fully distributed and replicated, so durability is not an

issue. It also supports SQL query semantics.

The Azure

Data Lake

is a full suite of data analytics tools built on the

open-source YARN and WebHDFS platforms.

2.3. The Cloud Storage Landscape

2.3.6 Graphs and More

Each of the three cloud providers provides not only graph databases but also

other services, including messaging and stream services that, li ke warehouses, are

important tools for stream and log analytics. Messaging services allow applications

to send and receive messages using what are referred to as

publish/subscribe

semantics. They allow one application to wait on a queue for a message to arrive

while other applications prepare the message and send it to the queue. This data

capability is important for many cl oud applications that need to distribute work

tasks or to process streams of incoming events. We discuss mes sagi ng applications

in chapters 7 through 9.

Amazon’s

Titan

extension to DynamoDB supports graph databases. Other

Amazon services not listed in table 2.1 include the

Simple Queue Service

(SQS:

see section 9 .3 on page 167) and the

Elasticsearch Service

, a cloud-hosted

version of the Elasti csearch open-source search and analytics engine. Amazon

Kinesis, also d iscu ssed in section 9.3, supports analysis of stream data.

Google supports the open sou rce graph database

Cayley

.Its

Cloud Pub/Sub

service provides mess agi ng in a similar manner to Amazon SQS.

The Azure graph database,

Graph Engine

, is a distribu ted, in-memory large

graph processing system. The

Queue

service is the Azure pub/sub service, similar

to Google Cloud Pub/Sub. The

Azure Event Hubs

service is similar to Amazon

Kinesis. We return to these services in chapter 9.

2.3.7 OpenStack Storage Services and Jetstream

The OpenStack open source cloud software supports only a few standard storage

services: object storage, block storage, and ﬁle system storage. However, many

research groups and some large companies such as IBM are investing substantial

eﬀort to imp roving this situation.

The OpenStack object storage service is called

Swift

(not to be confused with

the Swi ft parallel scripting language [

259

] or Apple’s Swift language). Like Amazon

S3 and the Azure Blob service, Swift implements a REST API that users can use

to store, delete, manage permissions of, an d associate metadata with immutable

unstructured data objects located within containers. Th ese objects are repli cated

across multiple storage servers for fault tolerance and performance reasons, and

can be accessed from anywhere.

Chapter 2. Storage as a Service

The OpenStack

Shared File Systems

service, like the Amazon EFS and

Azure File service, implements a ﬁle system model in the cloud environment. U sers

interact with this service by mounting remote ﬁle systems, called shares, on their

virtual machine instances. They can also create shares, conﬁguring the ﬁle system

protocol supported; manage access to shares; delete shares; and conﬁgure rate

limits and quotas, among other things. Shares can be mounted on any number

of client machines, using NFS, CIFS, GlusterFS, or HDFS drivers. Shares can be

accessed only from virtual machine instances running on the OpenStack cloud.

Many OpenStack deployments use the

Ceph

distributed storage system to

manage their storage. This open source system supports object storage with

interfaces compatible with Amazon S3 and OpenStack Swift. It als o supports

block-level and ﬁle-level storage.

We use the U.S. National Science Foundation’s

Jetstream

cloud for Open-

Stack exampl es. Operated as part of the

XSEDE

supercomputer project [

245

]

xsede.org

, Jetstream is more experimental than the large public clouds because

it is designed to support new classes of interactive scientiﬁc computing. Jetstream

runs the OpenStack object store, based on Ceph, which implements the Swift

API. The primary user interaction with Jetstream is through a system known as

Atmosphere

built by the University of Arizona as part of the NSF IPlant collabo-

rative. Atmosphere is designed to manage virtual machines, data, and visualization

tools for communities of scientists, and provides a volume management system for

mounting external volumes on V Ms. We explore Atmosphere in greater detail in

section 5.5 on page 82. Jetstream also operates the Globus identity, group, and

ﬁle management s ervices, which we describe in the next chapter.

2.4 Summary

As this chapter has illu strated, the data storage models used in the cloud are as

varied as the types of data a nd the types of processing that scientists would want

to u se. Let us return to our use cases and see how they map to the data types.

The ﬁrst use case, involving climate simulation output ﬁles in NetCDF format,

is clearly a case for an Amazon S3, Google Cloud Storage, or Azure Blob Storage.

Each blob can be up to 1 TB in size (5 TB for S3). As we show in chapter 3, each

service provides simple APIs that can be used to access data. Another solution for

moving the data to an d from S3 is to use the Globus ﬁle transfer protocols, which

have been optimized for managing big data objects: see section 3.6 on page 51.

2.5. Resources

The second use case, involving 1,000,000 records descri bing experimental obser-

vations, could also be handled with simple blob storage, but the cloud presents us

with better sol ution s. The simplest is to use a standard relational SQL database,

but the meri ts of this approach depend on how strict we are with the schema

that describes the data. Do all records have the same structure, or do some have

ﬁelds that others do not? In the latter case, a NoSQL database may be a superior

solution. Other factors are scale and the possible need for parallel access. Cloud

NoSQL stores like Azure Tables, Amazon DynamoDB, and Google Bigtable, are

massively scalable and replicated . Unlike conventional SQL database solutions,

they a re designed for highly parallel massive data analysis.

The third u se case, involving a massive set of instrument event records, is

also appropriate for cloud NoSQL d ataba ses. However, data warehouses such as

Amazon Redshift and Azure Da ta Lake are designed to be complete platforms for

performing data analytics on massive data collections. If our instrument records are

streaming in real time, we can use event streaming to o ls based on publish/subscribe

semantics, as we discuss in chapter 9.

2.5 Resources

The storage capabilities of the public clouds are rapidly evolving and thus it is

important to consult the releva nt documentation, which is easily accessed from

the cloud portals:

aws.amazon.com

for A mazo n,

azure.microsoft.com

for A zure,

and cloud.google.com for Google.

Troy Hunt uses the example of his “have I been pwned” site, which uses

the Azure Table service to enable rapid searches against more than one billion

compromised accounts, to i llu strate some of th e pros and cons of Azure’s Table,

DocumentDB, and SQL Database services [ 158] .