Chapter 2
Storage as a Service
“As a general rule the most successful man in life is the man who has
the best information.”
—Benjamin Disraeli
Science is concerned above all with data: with their acquisition, preservation,
organization, analysis, and exchange. Thus we begin this book with a discussi on
of major cloud data storage services. These services collectively support a wide
range of data storage models, from unstructured objects to relational tables, and
oer a variety of performance, reliability, and cost characteristics. Collectively
they provide the scientist or engineer with a wonderfully rich, if initially da unting,
set of data storage capabilities.
In this chapter and the next, we introduce important cloud data storage
concepts and illustrate how thes e concepts are realized in major cloud storage
systems, using a range of examples to show how to use these systems to outsource
simple data management tasks. In l ater chapters, we build on this foundation
to show how cloud storage systems can be used in conjunction with other cloud
services to construct powerful data management and analysis capabilities, for
example when a data store is combined with an event notification service and
compute service to enable analysis of streaming data.
2.1. Three Motivating Examples
2.1 Three Motivating Examples
Many cloud services that we use routinely—services such as Box, Dropbox,
OneDrive, Google Docs, YouTube, Facebook, and Netflix—are, above all, data
services. Each run s in the cloud, hosts digital content in the cloud, and provides
specialized methods for accessing, storing, and sharing that content. Each is
built on one or more—often multiple—storage services that have been variously
optimized for speed, scale, reliability, and/or consistency. Our interest here is in
how the basic infrastructure components on which these applications are built can
be applied to science and engineering problems. To address this question, we need
to understand the various types of storage s ystems tha t are common in the cloud
so that we may evaluate their relative merits. To provide context, we consider the
following three s cience and engineering use cases.
UC1
A climate science laboratory has assembled a s et of simulation output files,
each in Network Common Data Form (NetCDF) format: some 20 TB in
total. These data are to be made accessible via interactive tools running in
a web portal. The d ata sizes are such that data need to be partitioned to
enable distributed analyses over multiple machines running in parallel.
UC2
A seismic observatory is acquiring records describing experimental observa-
tions, each specifying the time of the observation, experimental parameters,
and the measurement itself, in CSV format. There may be a total of 1,000,000
records totaling some 100 TB when they finish collecting. They need to store
these data to en abl e easy access by a large team, and to permit tracking of
the data i nventory and its accesses.
UC3
A team of scientists operates a col lectio n of several thousand instruments,
each of which generates a data record every few seconds. The individual
records are not large, but managing the aggregate stream of all outputs
in order to perform analyses across the entire collection every few hours
introduces data management challenges. This problem is simi lar to that of
analyzing large web trac or social media streams.
Each use case requires dierent storage and processing models. In the para-
graphs and chapters that follow, we divide these scenarios into more specific data
collection examples and show how each can map to specific cloud storage services.
22
Chapter 2. Storage as a Service
2.2 Storage Models
Before reviewing specific cloud storage services, we say a few words about storage
models: that is, the dierent ways in which data can be organized in a storage
system. An exciting feature of cloud storage systems is that they support a wide
range of dierent storage models: not just the file systems that most researchers use
on a daily basis, but also more specialized models such as object stores, relational
databases, table stores, NoSQL databases, graph databases, data warehouses, and
archival storage. Furthermore, their implementations of these models are often
highly scalable, adapting easily from megabytes to hundreds of terabytes and
beyond, while avoiding the need for d edi cated operations expertise on the part
of the user. Given the many challenges faced by scientists and engineers as they
struggle with rapidly growing data, cloud storage can thus be the answer to their
prayers. But one needs to understand the properties of these dierent storage
models in order to choo se the right system.
The right storage model for a data collection can depend on not only the
nature and si ze of the data, but also the analyses to be performed, sharing plans,
and update frequencies, among other factors. We review here some of the more
important storage models supported by cloud storage services and, for each, their
principal capabilities and their pros and cons for various purposes. This material
provides background for the detailed service descriptions in the next chapter.
2.2.1 File Systems
The one storage model with which every scientist and engineer is surely familiar
is the file system, organized around a tree of directories or folders. This model
has proven to be an extremely intuitive and useful data storage abstraction. The
standard API for the Unix-derived version of the file system is called the
Portable
Operating System Interface
(POSIX). We are all familiar with the POSIX file
system: we use it every day on our Apple, Linux, or Windows computer. Using
command line tools, graphical user interfaces, or APIs, we create, read, write, and
delete files located within directories.
The file system storage model has important advantages. It allows for the
direct use of many existing programs without modification: we can navigate a file
system with familiar file system browsers, run programs written in our favorite
analysis tool (Python, R, S tata, SPSS, Mathematica, etc.) on files, and share
files via email. The file system model also provides a straightforward mechanism
for representing hierarchical relationships among data—directories and files—and
23
2.2. Storage Models
supports concurrent access by multiple readers. In the early 1990s, the POSIX
model was extended to distributed network file systems and there were some
attempts at wide area versions. By the late 1990s, the Linux Cluster community
created Lustre, a true parallel file system that supports the POSIX standard.
The file system model also has disadvantages as a basis for science and engi-
neering, particularly as data volumes grow. From a data modeling perspective, it
provides no support for enforcing conventions concerning the representation of data
elements and their relationships. Thus, wh ile one may, for example, choose to store
environmental data in NetCDF files, genomes in FASTA files, and experimental
observations in comma-separated-value (CSV) files, the file system does nothin g
to prevent the use of inconsistent representations within those files. Furthermore,
the rigid hi erarchical organization enforced by a file system often does not match
the relationships that one wants to capture in science. Lacking any information on
the data model, file systems cannot help users navigate complex data co llecti ons .
The file system model also has problems from a scalability perspective: the need
to maintain consistency as multiple processes read and write a file system can lead
to bottlenecks in file system implementations.
For these reasons, clo ud storage services designed for la rge quantities of data
frequently adopt dierent storage models, as we discuss next.
2.2.2 Object Stores
The object storage model, like the file system model, stores unstructured binary
objects. In the database world, objects are often referred to as
blobs
, for binary
large object, and we use that name here when it is consistent with terminology
adopted by cloud vendors. An o bject/blob store simplifies the file system model
in important ways: in particular, it eliminates hierarchy and forbids updates to
objects once created. Dierent object storage services dier in their details, but in
general they support a two-level folder-file hierarchy that allows for the creation of
object
conta i ners
, each of which can hold zero or more
objects
. Each o bject is
identified by a unique identifier and can have various metadata associated with
it. Objects cannot be modified once uploaded: they can only be deleted—or, in
object stores that support versioning, replaced.
We can use this storage model to store the NetCDF data in use case UC1. As
shown in figure 2.1 on the following p age, we create a single container and store
each NetCDF file in that container as an object, with the NetCDF filename as the
object identifier. Any authorized individual who possesses that object identifier
can then access the data via simple HTTP requests or API calls.
24
Chapter 2. Storage as a Service
The object store model has important advantages in terms of simplicity, per-
formance, and reliability. The fact that objects cannot be mod ified once created
makes it easy to build highly scalable and reliable implementations. For example,
each object can be replicated across multiple physical storage devices to in crease
resilience and (when there are many concu rrent readers) performance, without any
specialized synchronization logic in the implementation to deal with concurrent
updates. Objects can be moved manually or automatically among storage classes
with dierent performance and cost parameters.
The object store model also has limitations. It provides little support for
organizing data and no support for search: A user must know an object’s identifier
in order to access it. Thus, an object store would likely be inadequate as a basis for
organizing the 1,000,000 environmental records of UC2: We would need to create
a separate index to map from file characteristics to object identifiers. Nor does
an object store provide any mechanism for working with structured data. Thus,
for exam ple, while we could load each UC2 dataset i nto a separate object, a user
would likely have to download the entire object to a compu ter to compute on its
contents. Finally, an object store cannot easily be mounted as a file system or
accessed with existing tools in the ways that a file system can.
Figure 2.1: Object storage model with versioning. Each NetCDF file is stored in a separate
container, and all versions of the same NetCDF file are stored in the same container.
25
2.2. Storage Models
2.2.3 Relational Databases
A
database
is a structured collection of data about entities and their relationships.
It models real-world objects—both entities (e.g., microscopes, experiments, and
samples, as in UC2 above) and relationships (e.g., “sample1” was tested on “July
24”)—and captures structure in ways that allow these entities and relationships
to be queried for analysis. A
database manag e ment system
(DBMS) is a
software suite designed to safely store and eciently manage databases and to
assist with the maintenance and discovery of the relationships that databases
represent. In general, a DBMS encompasses three components: i ts
data model
(which defines how data are represented), its
query language
(which defines
how the user interacts with the data), an d support for
transactions and crash
recovery (to ensure reliable execution desp ite system failures).
For a variety of reasons, science and engineering data often belong in a database
rather than in files or objects. The use of a DBMS simplifies data management and
manipulation and provides for ecient querying and analysis, durable and reliable
storage, scaling to large data sizes, validation of data formats, and management of
concurrent accesses.
While DBMSs in general, and cloud-based DBMSs in particu lar, support a
wide variety of data fo rmats, two major classes can be distinguished: relational
and NoSQL. (We also discuss another class, graph databases, below.)
Relational DBMSs allow for the ecient storage, organization, and analysis of
large quantities of tabular data: data organized as tables, in which rows represent
entities (e.g., experiments) and columns represent attributes of those entities
(e.g., experimenter, sample, res ult). The associated Structured Query Language
(SQL) can then be used to specify a wide range of operations on such tables,
such a s composi tions and joins. For example, the foll owing SQL joins two tables,
Experiments and People , to find all experiments performed by Smith:
select experiment -id from Experiments , People
where Experiments.person -id = People.person -id
and People.name = "Smith";
SQL s tatements can be executed with high ecien cy thanks to sophisticated
indexing and query planning techniques. Thus this join can be executed quickly
even i f there are millions of records in the tables being joined.
Many open source, commercial, and cloud-hosted relational DBMSs exist.
Among the open source DBMSs,
MySQL
and
PostgreSQL
(often simply Post-
gres) are particularly widely used. Both MySQL and Postgres are available in
26
Chapter 2. Storage as a Service
cloud-hosted forms. In addition, cloud vendors oer specialized relational DBMSs
that are desi gned to scale to particularly large data sizes.
Relational databases have two i mportant properties. First, they support a
relational algebra th at provides a clear, mathematical meaning to the SQL language,
facilitating ecient and correct implementations. Second, they support
ACID
semantics
, a term that captures four im portant database properties:
A
tomicity
(the entire transaction succeeds or fails),
C
onsistency (the data collection is never
left in an invalid or conflicting state),
I
solation (concurrent transactions cannot
interfere with each other), and
D
urability (once a transaction completes, s ystem
failures cannot invalidate the result).
2.2.4 NoSQL Databases
While relational DBMSs have long dominated the database world, other technologies
have become popular for some applicati on classes. A relational DBMS is almost
certainly the right technology to use for highly structured datasets of moderate size.
But if your data are less regular (if, for exampl e, you are dealin g with large amounts
of text or if dierent items have dierent properties) or extremely large, you may
want to consider a NoSQL DBMS . The design of these systems has typically been
motivated by a desire to sca le the quantities of data and number of users that can
be supported, and to deal with unstructured data that are not easily represented
in tabular form. For example, a
key-value store
can organize large numbers of
records, each of which associates an arbitrary key with an arbitrary value. (A
variant called a document store permits text search on the stored values.)
NoSQL databases h ave limitations relative to relational d ataba ses. The name
NoSQL is derived from “non SQL,” meaning that they do not support the full
relational algebra. For example, they typically do not support queries that join
two tables, such as those shown above.
Another definition of NoSQL is “not only SQL,” meaning that most of SQL
is supported but other properties are available. For example, a NoSQL database
may allow for the rapid ingest of large quantities of unstructured data, such as
the instrument events of UC3. Arbitrary data can be stored without modifications
to a database schema, and new col umn s introduced over time as data and/or
understanding evolves. NoSQL databases i n the cloud are often distributed over
multiple servers and also replicated over dierent data centers. Hence they often
fail to satisfy all of the ACID properties. Consistency is often replaced by
eventual
consistency
, meaning that database state may be momentarily inconsistent across
replicas. This relaxation of ACID properties is acceptable if your concern is to
27
2.2. Storage Models
respond rapidly to queries about the current state of a store’s inventory. It may
be unacceptable if the d ata in question are, for example, medical records.
Challenges of scale: The CAP theorem
(from Foster et al. [
125
]). For many
years, and still today, the big relational database vendors (Oracle, IBM, Sybase,
Microsoft) were the mainstay of how data were stored. During the Internet boom,
startups looking for low-cost alternatives to commercial relational DBMSs turned
to MySQL and PostgreSQL. However, these systems proved inadequate for big sites
as they could not cope well with large trac spikes, as for example when many
customers all suddenly wanted to order the same item. That is, they did not scale.
An obvious solution to scaling databases is to d istribute and/or replicate data
across multiple computers, for example by distributing dierent tables, or dierent
rows from the same table. However, distribution and replication also introduce
challenges, as we now explain. Let us first define some terms. In a system that
comprises multiple computers:
Consistency
indicates that all computers see the same data at the same time.
Ava i l a b i l ity
indicates that every request receives a response about whether it
succeeded or failed.
Partition tolerance
indicates that the sys tem continues to operate even if a
network failure prevents computers from communicating.
An important result in distributed systems (the “CAP Theorem” [
78
]) observes
that it is not possible to create a distributed system with all three properties. This
situation creates a challenge with large transactional datasets. Distribution is needed
for high p erformance , b ut as the number of computers grows, so too does the likelihood
of network disruption among computers [
65
]. As stric t consistency cannot be achieved
at the same time as availability and p artition -toleran ce, the DBMS designer must
choose between high consistency or high availability for a particular system.
The right combination of availability and consistency will depend on the needs of
the service . For example, in an e- c omme rc e setting, we may choose high availability
for a checkout process to ensure that revenue-producing requests to add items to
ashoppingcartarehonored. Errorscanbehiddenfromthecustomerandsorted
out later. However, for order submission—when a customer submits an order—we
should favor consistency, because several services (credit card processing, shipping
and handling, reporting) need to access the data simultaneously.
2.2.5 Graph Databases
A graph is a data structure in which
edges
connect
nodes
. Graphs are useful when
we need to search data based on relationships among data items. For example, in
UC2, measurements from dierent experiments might be related by their use of the
28
Chapter 2. Storage as a Service
same measurement modality; in a database of scientific publications, publications
can be represented as nodes and citations, sha red authors, or even shared concepts
as edges. Often graph databases are built on top of existing NoSQL databases.
2.2.6 Data Warehouses
The term data warehouse is commonly used to refer to data management systems
optimized to support analytic queries that involve reading large datasets. Data
warehouses have dierent d esi gn goals and properties than do DBMSs. For exam ple,
a medical center’s clinical DBMS is typically designed to enable many concurrent
requests to read and update information on individual patients (e.g., “what is
Ms. Sm ith’ s current weight?” or “Mr. Jones was prescribed Aspirin”). Data from
this DBMS are uploaded periodically (e.g., once a day) into the medical center’s
data warehouse to support aggregate queries such as “What factors are correlated
with length of stay?” As we discuss in the next section, several cloud vendors oer
data warehouse so luti ons that can scale to extremely large data volumes.
2.3 The Cloud Storage Landscape
The major public cloud companies provide a rich collection of storage services.
The cloud providers described h ere are Amazon Web Services (hereafter referred
to as
Amazon
), Microsoft Azure (hereafter
Azure
), and Google Cloud (hereafter,
Google
). Table 2.1 lists selected oerings from these three major vendors. (When
a single vendor has multiple services in a category, those services tend to have quite
dierent characteristics.) There are also other, more specialized storage services
not listed in the table, some of which we mention in the following. We expand
upon each of the rows in this table in the text that follows.
We do not list OpenStack options in the table because they are dependent
upon the specific deployment, are not part of the OpenS tack standa rd, and are
not as extensive as those provided by the three main public clouds. Nevertheless,
some standards for file services exist, as we discuss in section 2.3.7 on page 34.
2.3.1 File Systems
File systems (also referred to as
file shares
) are virtual data drives that can be
attached to virtual machines. We describe the following services i n greater detail
in part II of the book. Amazon’s
Elastic Block Store
(EBS) and
Elastic File
System
(EFS) services oer related but dierent services. EBS is a device that
29
2.3. The Cloud Storage Landscape
Table 2.1: Storage as a service options from ma jor public cloud vendors.
Model Amazon Google Azure
Files
Elastic File System
(EFS), Elastic Block
Store (EBS)
Google Cloud
attached file system
Azure File Storage
Objects
Simple Storage Service
(S3)
Cloud Storage Blob Storage Service
Relational
Relational Data
Service (RDS), Aurora
Cloud SQL, Spanner Azure SQL
NoSQL DynamoDB, HBase
Cloud Datastore,
Bigtable
Azure Tables, HBase
Graph Titan Cayley Graph Engine
Warehouse
analytics
Redshift BigQuery Data Lake
you can mount onto a single Amazon EC2 compute server instance at a time; it is
designed for applications that require low-latency access to data from a single EC2
instance. For examp le, you might use it to store working data that are to be read
and written frequently by an application, but that are too large to fit in memory.
EFS, in contrast, is a general-purpose file storage service. It provides a file system
interface, file system access semantics (e.g., strong consistency, file locking), and
concurrently-accessible storage for many Amazon EC2 instances. You might use
EFS to hold state that is to be read and written by many concurrent processes.
Note that both EBS and EFS can be accessed directly only by EC2 instances, that
is, from inside the Ama zon cloud.
Google Compute Engine
has a dierent attached storage model. There are
three types of attached disks (and also a way to attach an object store). The
cheapest,
persistent disks
, can be up to 64 TB in size.
Local SSD
(solid state
disk) is higher performance but more expensive and can be up to 3 TB. Finally,
RAM disk
is in-memory, limited to 208 GB, and expensive. Persistent disks can
be accessed anywhere in a zone, but SSD and RAM are only accessible by the
instance to which they are attached.
The
Azure File Storage
service allows users to create file shares in the cloud
that can be accessed by a special protocol,
SMB
, that allows Microsoft Windows
VMs and Linux VMs to mount these file shares a s stand ard parts of their file
system. These file shares can al so be mounted on your Windows or Mac.
30
Chapter 2. Storage as a Service
2.3.2 Object Stores
Amazon’s
Simple Storage Service
(S3) was historically its first cloud service.
It is highly popular: as of 2016, it reportedly holds trillions of objects in billions
of containers, wh ich S3 calls
buckets
. S3 is a classic object store, with all of the
properties listed in section 2.2.2 on page 24. We describe S3 in more detail in
section 3.2 on p age 38, where we also present examples of its use. The related
Glacier
service is designed for long-term, secure, durable, extremely low cost data
archiving. Access times for an object in Glacier may be several hours, so this is
not for app li cations that need rapid data access.
Google’s
Cloud Storage
, like Amazon S3, provides a basic object storage
system that is durable, replicated and highly available. It supports three storage
tiers, each with dierent performance and price levels. The most expensive is
Standard
multiregional, the mid-range tier is
Regional
(DRA), and the bottom
tier is
Nearline
. Standard storage is for data you expect to access often, DRA
is for batch jobs for which response tim e is not a critical issue, and Nearline is
for cold storage and disaster recovery. Google Cloud also has
Coldline
, which is
similar to AWS Glacier.
Azure Storage
oers a suite of services with a similar scope in terms of
models supported to those provided by Amazon and Google. Azure provides the
user with a unified view of many of its storage types associated with their account,
as shown in figure 2.2 on the next page. This integration means that you can use
the Azure Storage Explorer tool
storageexplorer.com
to see and manage all of
these storage products from your PC or Mac. While Azure storage services were
originally optimized for close integration with Microsoft Windows environments,
Linux is now a n important part of Azure, so this dierentiation is less noticeable.
The Azure
Blob
storage service, like Ama zon’s S3, is concerned with highly
reliable storage of unstructured objects, which Microsoft calls blobs.LikeAmazon
and Google Cloud, Azure blob storage has tiered storage and pricing. The tiers
are hot for frequently accessed data and cool for data that are accessed less often.
2.3.3 NoSQL Services
Amazon’s
DynamoDB
is a powerful NoSQL database based on an extensible key-
value model: for each row, the primary key column is the only required attribute,
but any number of additional columns can be defined, in dexed, and made searchable
in various ways, including full-text search via Elasticsearch. DynamoDB’s rich
feature set defies a concise description, but we illustrate some of its uses in
section 3.2 on page 38. Related to this is Amazon
Elastic MapReduce
(EMR),
31
2.3. The Cloud Storage Landscape
Figure 2.2: Major Azure storage types.
discussed in section 8.3 on page 143, which allows analysis of large quantities of
data with Spark and other data analytics platforms.
Cloud Bigtable
, Google’s highly scalable NoSQL database service, is the
same database that powers many core Google services, including Search, Analytics,
Maps, and Gmail. Bigtable maps two arbitrary strings (row key and column key)
and a timestamp (permitting versioning and garbage collection) to an associated
arbitrary byte array. It is designed to handle such large and sparse datasets in a
manner that is ecient in space used and th at supports massive workloads, while
providing low latency and h igh bandwidth. You deploy Bigtable on a Google-hosted
cluster, which can be dynamically resized if needed. The open source Apache
HBase database system, fundamental to th e Apache Hadoop system, is compatible
with Bigtable. Google’s
Cloud Datastore
has many similarities to Bigtable. An
important dierence is that it i mpl ements ACID semantics and thus the user
does not have to wait for possible inconsistencies to be resolved, as is required for
Bigtable. Cloud Datastore ha s a much richer set of S QL-like operators than do
many of the other NoSQL systems described here.
The Azure
Table
storage service is a simple NoSQL key-value store, designed
to support the highly reliable storage of any large number of key-value pairs. It is
similar to Amazon DynamoDB. Its query capabilities are limited, but it can support
many queries at modest cost. Azure
HDInsight
provides an implementation of the
Hadoop storage service h osted on Azure cloud co mpu ters, with implementations of
popular big data tools, including Spark, the HBase NoS QL database, and the Hive
SQL database implemented to run eciently and scalably on top of that Hadoop
32
Chapter 2. Storage as a Service
fabric. We describe this service in more detail in part III.
DocumentDB
is a
NoSQL service, like Table, but supporting full text indexing and query, albeit at
higher cost due to the greater resources needed for indexing.
2.3.4 Relational Databases
Relational databases are a mature technology, so the main innovation in the cloud
is deployments that scale to especially large sizes.
Amazon’s
Relational Database Service
(RDS) allows you to set up a con-
ventional relational database (e.g., MySQL or Postgres) on Amazon computers,
thus permitting MySQL and Postgres applications to be ported to Amazon with-
out change. The MySQL-compatib le Amazon
Aurora
service provides higher
scalability, performance, and resilience than an RDS MySQL instance: it can scale
to many terabytes, replicate data across data centers (known as availability zones),
and create m any read replicas to support large numbers of concurrent reads.
Google’s
Cloud SQL
relational database service has similar capabilities to
those provided by such services in Amazon and Azure. Their
Spanner
system [
101
]
is a globally distributed relational database that provides ACID transactions and
SQL semantics with h igh scaling and availability.
Azure’s
SQL Database
provides a relational database service similar to
Amazon RDS. It is based on their mature SQL Server technology and is highly
available and scalable.
2.3.5 Warehouse Analytics
Cloud data warehouses are designed specifically for running analytics queries over
large collections. You interact with them from the cloud portal or via REST APIs.
We discuss these systems in greater detail in part III.
Amazon
Redshift
is a data warehouse system, designed to support high-
performance execution of analytic and reporting workloads against large datasets.
For massive data analytics, Google provides the
BigQuery
petascale data
warehouse. BigQuery is fully distributed and replicated, so durability is not an
issue. It also supports SQL query semantics.
The Azure
Data Lake
is a full suite of data analytics tools built on the
open-source YARN and WebHDFS platforms.
33
2.3. The Cloud Storage Landscape
2.3.6 Graphs and More
Each of the three cloud providers provides not only graph databases but also
other services, including messaging and stream services that, li ke warehouses, are
important tools for stream and log analytics. Messaging services allow applications
to send and receive messages using what are referred to as
publish/subscribe
semantics. They allow one application to wait on a queue for a message to arrive
while other applications prepare the message and send it to the queue. This data
capability is important for many cl oud applications that need to distribute work
tasks or to process streams of incoming events. We discuss mes sagi ng applications
in chapters 7 through 9.
Amazon’s
Titan
extension to DynamoDB supports graph databases. Other
Amazon services not listed in table 2.1 include the
Simple Queue Service
(SQS:
see section 9 .3 on page 167) and the
Elasticsearch Service
, a cloud-hosted
version of the Elasti csearch open-source search and analytics engine. Amazon
Kinesis, also d iscu ssed in section 9.3, supports analysis of stream data.
Google supports the open sou rce graph database
Cayley
.Its
Cloud Pub/Sub
service provides mess agi ng in a similar manner to Amazon SQS.
The Azure graph database,
Graph Engine
, is a distribu ted, in-memory large
graph processing system. The
Queue
service is the Azure pub/sub service, similar
to Google Cloud Pub/Sub. The
Azure Event Hubs
service is similar to Amazon
Kinesis. We return to these services in chapter 9.
2.3.7 OpenStack Storage Services and Jetstream
The OpenStack open source cloud software supports only a few standard storage
services: object storage, block storage, and file system storage. However, many
research groups and some large companies such as IBM are investing substantial
eort to imp roving this situation.
The OpenStack object storage service is called
Swift
(not to be confused with
the Swi ft parallel scripting language [
259
] or Apple’s Swift language). Like Amazon
S3 and the Azure Blob service, Swift implements a REST API that users can use
to store, delete, manage permissions of, an d associate metadata with immutable
unstructured data objects located within containers. Th ese objects are repli cated
across multiple storage servers for fault tolerance and performance reasons, and
can be accessed from anywhere.
34
Chapter 2. Storage as a Service
The OpenStack
Shared File Systems
service, like the Amazon EFS and
Azure File service, implements a file system model in the cloud environment. U sers
interact with this service by mounting remote file systems, called shares, on their
virtual machine instances. They can also create shares, configuring the file system
protocol supported; manage access to shares; delete shares; and configure rate
limits and quotas, among other things. Shares can be mounted on any number
of client machines, using NFS, CIFS, GlusterFS, or HDFS drivers. Shares can be
accessed only from virtual machine instances running on the OpenStack cloud.
Many OpenStack deployments use the
Ceph
distributed storage system to
manage their storage. This open source system supports object storage with
interfaces compatible with Amazon S3 and OpenStack Swift. It als o supports
block-level and file-level storage.
We use the U.S. National Science Foundation’s
Jetstream
cloud for Open-
Stack exampl es. Operated as part of the
XSEDE
supercomputer project [
245
]
xsede.org
, Jetstream is more experimental than the large public clouds because
it is designed to support new classes of interactive scientific computing. Jetstream
runs the OpenStack object store, based on Ceph, which implements the Swift
API. The primary user interaction with Jetstream is through a system known as
Atmosphere
built by the University of Arizona as part of the NSF IPlant collabo-
rative. Atmosphere is designed to manage virtual machines, data, and visualization
tools for communities of scientists, and provides a volume management system for
mounting external volumes on V Ms. We explore Atmosphere in greater detail in
section 5.5 on page 82. Jetstream also operates the Globus identity, group, and
file management s ervices, which we describe in the next chapter.
2.4 Summary
As this chapter has illu strated, the data storage models used in the cloud are as
varied as the types of data a nd the types of processing that scientists would want
to u se. Let us return to our use cases and see how they map to the data types.
The first use case, involving climate simulation output files in NetCDF format,
is clearly a case for an Amazon S3, Google Cloud Storage, or Azure Blob Storage.
Each blob can be up to 1 TB in size (5 TB for S3). As we show in chapter 3, each
service provides simple APIs that can be used to access data. Another solution for
moving the data to an d from S3 is to use the Globus file transfer protocols, which
have been optimized for managing big data objects: see section 3.6 on page 51.
35
2.5. Resources
The second use case, involving 1,000,000 records descri bing experimental obser-
vations, could also be handled with simple blob storage, but the cloud presents us
with better sol ution s. The simplest is to use a standard relational SQL database,
but the meri ts of this approach depend on how strict we are with the schema
that describes the data. Do all records have the same structure, or do some have
fields that others do not? In the latter case, a NoSQL database may be a superior
solution. Other factors are scale and the possible need for parallel access. Cloud
NoSQL stores like Azure Tables, Amazon DynamoDB, and Google Bigtable, are
massively scalable and replicated . Unlike conventional SQL database solutions,
they a re designed for highly parallel massive data analysis.
The third u se case, involving a massive set of instrument event records, is
also appropriate for cloud NoSQL d ataba ses. However, data warehouses such as
Amazon Redshift and Azure Da ta Lake are designed to be complete platforms for
performing data analytics on massive data collections. If our instrument records are
streaming in real time, we can use event streaming to o ls based on publish/subscribe
semantics, as we discuss in chapter 9.
2.5 Resources
The storage capabilities of the public clouds are rapidly evolving and thus it is
important to consult the releva nt documentation, which is easily accessed from
the cloud portals:
aws.amazon.com
for A mazo n,
azure.microsoft.com
for A zure,
and cloud.google.com for Google.
Troy Hunt uses the example of his “have I been pwned” site, which uses
the Azure Table service to enable rapid searches against more than one billion
compromised accounts, to i llu strate some of th e pros and cons of Azure’s Table,
DocumentDB, and SQL Database services [ 158] .
36