Chapter 3
Using Cloud Storage Services
“Collecting data is only the first step toward wisdom, but sharing data
is the first step toward community.”
—Henry Louis Gates Jr.
We introduced in chapter 2 a set of important cloud storage concepts and a range
of cloud provider services that implement these concepts in practice. While the
services of dierent cloud providers are often similar in outline, they invariably
dier in the details. Thus here we describe, and use examples to illustrate the use
of the services used in three major public clouds (Amazon, Azure, Google) and in
one major open source cloud system, OpenStack. And because your science will
often involve data that exist outs ide the cloud, we also describe how you can use
Globus to move data between the clou d and oth er data centers and to share data
with collaborators.
3.1 Two Access Methods: Portals and A PIs
As we discussed in section 1.4 on page 8, cloud providers make available two main
methods for managing data and services: portals and REST APIs.
Each cloud provider’s web portal typically allows you to accomplish anything
that you want to do with a few mouse clicks. We provide several examples of such
web portals to illustrate how they work.
While such portals are good for performing simple actions, they are not ideal
for the repetitive tasks, such as managing hu nd reds of data objects, that scientists
3.2. Using Amazon Cloud Storage Services
need to do on a daily basi s. For such tasks, we need an interface to the cloud that
we can program. Cloud providers make this possible by providing REST APIs that
programmers can use to acces s their services programmatically. For programming
convenience, you will usually access these APIs via software development kits
(SDKs), which give programmers language-specific functions for interacting with
cloud services. We discuss the Python SDKs here. The code below is all for
Python 2.7, but is easily converted to Python 3.
Each cloud has special features that make it unique, and thus the dierent
cloud provider’s REST AP Is and SDKs are not identical. Two eorts are under
way to create a standard Python SDK:
CloudBridge
[
11
]and
Apache Libcloud
libcloud.apache.org
. While both aim to support the standard tasks for all clouds,
those tas ks are only the lowest common denominator of cloud capabilities; many
unique cap abi liti es of each cloud are available only through the REST API and
SDK for that platform. At the time of this writing, Libcloud is not complete; we
will provide an online update when it is ready an d fully docum ented. However, we
do make use of CloudBrid ge in o ur OpenS tack examples.
Building a data sample collection in the cloud
. We use the following simple
example throughout this chapter to illustrate the use of Amazon, Azure, and Google
cloud storage services. We have a collection of data samples stored on our personal
computer and for each sample we have four items of metadata: item number, creation
date, experiment id, and a text string comment. To enable access to these samples by
our collaborators, we want to upload them to cloud storage and to create a searchable
table, also hosted in the cloud, containing the metadata and cloud storage URL for
each object, as shown in figure 3.1 on the following page.
We assume that each data sample is in a binary le on our personal computer
and that the ass ociated metadata are contained in a comma separated value (CSV)
file, with one line per item, also on our personal computer. Each line in this CSV file
has the following format:
item id, experiment id, date, filename, comment string
3.2 Using Amazon Cloud Storage Services
Our Amazon solution to the example problem uses S3 to store the blobs and
DynamoDB to store the table. We first need our Amazon key pair, i.e., access
key plus secret key, which we can obtain from the Amazon
IAM Management
Console
. Having created a new user, we select the create access key button to
create our security credentials, which we can then download, as shown in figure 3.2
on the following page.
38
Chapter 3. Using Cloud Storage Services
Figure 3.1: The simple cloud storage use case that we employ in this chapter involves the
upload of a collection of data blobs to cloud storage and the creation of a NoSQL table
containing metadata.
Figure 3.2: Downloading security credentials from the Amazon IAM Management Console.
39
3.2. Using Amazon Cloud Storage Services
We can proceed to create the required S3 bucket, upload our blobs to that
bucket, and so forth, all from the Amazon web portal. (We showed in figure 1.5
on page 11 the use of this portal to create a bucket.) However, there are a lot of
blobs, so we instead use the Amazon Python Boto3 SDK for these tasks. Details
on how to in stall this SDK are found in the link provided in the Resources section
at the end of this chapter.
Boto3 considers each service to be a
resource
. Thus, to u se the S3 system, we
need to create an S3 resource object. To do that, we need to specify the credentials
that we obtained from the IAM Management Console. Several ways can be used to
provide these credentials to our Python program. The s imp les t is to provide them
as special named parameters to the resource instance creation function, as follows.
import boto3
s3 = boto3 .resource ( 's3 ',
aws_access_key_id=' YOUR ACCESS KEY ',
aws_secret_access_key=' your secret key ' )
This approach is not recommended from a security perspective, since credentials
in code have a habit of leaking, if, for example, code is pushed to a shared repository:
see section 15.1 on page 315. Fortunately, this method is needed only if your
Python program is runni ng as a separate service on a machine or a container
that does not have access to your security keys. If you are running on your own
machine, the proper solution is to have a home directory
.aws
that contains two
protected files:
config
, containing your default Amazon region, and
credentials
,
containing your access and secret keys. If we have this directory in place, then the
access key and secret key parameters are not needed.
Having created the S3 resource object, we can now create the S3 bucket,
datacont
, in which we will store our data objects. The following code performs
this action. Note the (optional) second argument to the
create_bucket
call, which
specifies the geographic region in which the bucket should be created. At the time
of writing, Amazon operates i n 13 regions; region
us-west-2
is located in Oregon.
import boto3
s3 = boto3 .resource ( 's3 ')
s3 . create_bucke t ( Bucket = ' datacont ',CreateBucketConfiguration={
' LocationConstraint': 'us-w est -2' })
Now that we have created the new bucket, we can l oad our data objects into it
with a command such as the following.
40
Chapter 3. Using Cloud Storage Services
# Upload a file , 'test . jpg ' into the newly created bucket
s3 . Object ( ' datacont ', 'test .jpg' ). put (
Body= open( '/home/mydata/test.jpg' , 'rb '))
Having learned how to upload a file to S3, we can now create the DynamoDB
table in which we will store metadata and references to S3 objects. We create this
table by defining a special key that is composed of a
PartitionKey
and a
RowKey
.
NoSQL systems such as DynamoDB are distributed over multiple storage devices,
which enable constructing extremely large tables that can then be accessed in
parallel by many servers, each access in g one storage device. Hence the table’s
aggregate bandwidth is multiplied by the number of storage devices. DynamoDB
distributes data via row: for any row, every element in that row is map ped to the
same device. Thus, to determine the device on which a data value is located, you
need only look up the
PartitionKey
, which is hashed to an index that determines
the physical storage device in which the row resides. The
RowKey
specifies that
items are stored in order so rted by the
RowKey
value. While it is not n ecessary to
have both keys, we also illustrate the use of
RowKey
here. We can use the following
code to create the DynamoDB table.
dyndb = boto3. resource(' dynamodb ',region_name='us-w est -2' )
# The first time that we define a table , we use
table = dyndb. create_table (
TableName=' DataTable ',
KeySchema =[
{ ' AttributeName ': ' PartitionKey' , ' KeyType': ' HASH '},
{ ' AttributeName ': ' RowKey', ' KeyType ': ' RANGE' }
],
AttributeDefinitions=[
{ ' AttributeName ': ' PartitionKey' , ' AttributeType ': 'S ' },
{ ' AttributeName ': ' RowKey', ' AttributeType': 'S ' }
]
)
# Wait for the table to be created
table. meta. client. get_waiter(' table_exists ')
.wait(TableName=' DataTable')
# If the table has been previously defined , use:
# table = dyndb. Table (" DataTable ")
We are now ready to read the metadata from the CSV file, move the data
objects into the blob store, and enter the metadata row into the table. We do
this as follows. Recall that our CSV file format has
item[3]
as filename,
item[0]
41
3.3. Using Microsoft Azure Storage S ervices
as itemID,
item[1]
as experimentID,
item[2]
as date, and
item[4]
as comment.
Note that we n eed to state explicitly, via
ACL='public-read'
, that the URL for
the data file is to be publicly readable. The complete code is in notebook 2.
import csv
urlbase = "https ://s3 -us -west -2. amazonaws.com/datacont/"
with open( '\path-to-your-data\experiments.csv' , ' rb ')ascsvfile:
csvf = csv.reader (csvfile , delimiter =' ,',quotechar='| ')
for item in csvf :
body = open( 'path - to - your- data \ datafiles\\ '+item[3], 'rb ')
s3 . Object ( ' datacont ',item[3]).put(Body=body)
md = s3 . Object ( ' datacont ',item[3]).Acl()
.put(ACL='public - read ')
url= urlbase +item [3]
metadata_item ={' PartitionKey':item[0],'RowKey ':item[1],
' description ' :item[4],' date ' :item[2],' url ':url}
table. put_item(Item= metadata_item )
3.3 Using Microsoft Azure Storage Services
We first note some basic dierences between your Amazon and Azure accounts.
As we described above, your Amazon account ID i s defined by a pair consisting of
your access key and your secret key. Similarly, your Azure account is defined by
your personal ID and a subscripti on ID. Your personal ID is probably your email
address, so that is public; the subscription ID is something to keep secret.
We use Azure’s standard blob storage an d Table service to implement the
example. The dierences between Amazon DynamoDB and the Azure Table service
are subtle. With the Azure Table service, each row has the fields
PartitionKey
,
RowKey
,
comments
,
date
,and
URL
as before, but this time the
RowKey
is a unique
integer for each row. The
PartitionKey
is used as a hash to locate the row into a
specific storage device, and the RowKey is a uni que gl obal i dentifier for the row.
In addition to these semantic dierences between DynamoDB and Azure Tables,
there are fun dam ental dierences between the Amazon and Azure object storage
services. In S3, you create buckets and then create blobs within a bucket. S3 also
provides an illusion of folders, although these are actually just blob name prefixes
(e.g.,
folder1/
). In contrast, Azure storage is based on
Storage Accounts
,a
higher level abstraction than buckets. You can create as many storage accounts as
you want; each can contain five di erent types of objects: blobs, containers, file
shares, tables, and queues. Blobs are stored in bucket-like containers that can also
have a pseudo directory-like structure, similar to S3 buckets.
42
Chapter 3. Using Cloud Storage Services
Given your user ID and subscription ID , you can use the Azure P ython SDK
to create storage accounts, much as we create buckets in S3. However, we find it
easier to use the Azure portal. Login and click on
storage account
in the menu
on the left to bring up a panel for storage accounts. To add a new account, click
on the
+
sign at the top of the panel. You need to supply a name and some
additional p arameters such as location, duplication, and d istrib ution . Figure 3.3
shows the storage account that we added, called escistore.
One big dierence between S3 and Azure Storage accounts is that each storage
account co mes with two unique access keys, either of wh ich can be used to access
and modify the storage account. Unlike S3, you do not need the subscription
ID or user ID to add containers, tables, blobs or queues; you only need a valid
key. You can also invalidate either key and generate new keys at any time from
the portal. The reason for having two keys is that you can use one key for your
long-running services that use that storage account and the other to allow another
entity temporary access. By regenerating that second key, you terminate access by
the third party.
Azure storage accounts are, by default, private. You can also set up a public
storage account, as we show in an example on the next page, and grant li mi ted,
temporary access to a private account by creating, from the portal, a
Storage Ac-
cess Signature
for the account. Various access right properties can be configured
in this signature, including the period for which it is valid.
Figure 3.3: Azure portal after the storage account has been created.
43
3.3. Using Microsoft Azure Storage S ervices
Having created the storage account and installed the SD K (see section 3.8 on
page 57), we can proceed to the initialization as follows. The
create_table()
function returns true if a new table was created and false if the table already exists.
import azure. storage
from azure. storage.table import TableService , Entity
from azure. storage.blob import BlockBlobService
from azure. storage.blob import PublicAccess
# First , access the blob service
block_blob_service = BlockBlobService(account_name=' escistore ',
account_key=' your storage key')
block_blob_service.create_container(' datacont ',
public_access=PublicAccess.Container)
# Next , create the table in the same storage account
table_service = TableService(account_name=' escistore ',
account_key=' your account key')
if table_service.create_table(' DataTable'):
print("Table created")
else:
print("Table already there")
The code to upload the data blobs and to build the table is almost identical to
that used for Amazon S3 and Dynamo DB. The only dierences are the lines that
first manage the upload and then insert items into the table. To upload the data
to the blob storage we use the
create_blob_from_path()
function, which takes
three parameters: the container, the name to give the blob an d the path to the
source, as shown i n the following.
import csv
with open( '\path-to-your-data\experiments.csv' , ' rb ')ascsvfile:
csvf = csv.reader (csvfile , delimiter =' ,',quotechar='| ')
for item in csvf :
print(item)
block_blob_service.create_blob_from_path(
' datacont ',item[3],
"\path -to-your-files\datafiles\\"+item[3]
)
url=" https :// escistore .blob. core .windows .net/ datacont /"+ item [3]
metadata_item = {' PartitionKey ':item[0], ' RowKey ':item[1],
' description ' :item[4],' date ' :item[2],' url ':url}
table_service.insert_entity(' DataTable',metadata_item)
A nice desktop tool called Azure Storage Explorer is available for Windows,
Macs, and Linux. We tested the code above wi th a s in gle CSV file with only four
lines and four small blobs. Figure 3.4 shows the Storage Explorer views of the
blob container and tabl e contents.
44
Chapter 3. Using Cloud Storage Services
Figure 3.4: Azure Storage Explorer view of the conten