Chapter 3
Using Cloud Storage Services
“Collecting data is only the first step toward wisdom, but sharing data
is the first step toward community.”
—Henry Louis Gates Jr.
We introduced in chapter 2 a set of important cloud storage concepts and a range
of cloud provider services that implement these concepts in practice. While the
services of dierent cloud providers are often similar in outline, they invariably
dier in the details. Thus here we describe, and use examples to illustrate the use
of the services used in three major public clouds (Amazon, Azure, Google) and in
one major open source cloud system, OpenStack. And because your science will
often involve data that exist outs ide the cloud, we also describe how you can use
Globus to move data between the clou d and oth er data centers and to share data
with collaborators.
3.1 Two Access Methods: Portals and A PIs
As we discussed in section 1.4 on page 8, cloud providers make available two main
methods for managing data and services: portals and REST APIs.
Each cloud provider’s web portal typically allows you to accomplish anything
that you want to do with a few mouse clicks. We provide several examples of such
web portals to illustrate how they work.
While such portals are good for performing simple actions, they are not ideal
for the repetitive tasks, such as managing hu nd reds of data objects, that scientists
3.2. Using Amazon Cloud Storage Services
need to do on a daily basi s. For such tasks, we need an interface to the cloud that
we can program. Cloud providers make this possible by providing REST APIs that
programmers can use to acces s their services programmatically. For programming
convenience, you will usually access these APIs via software development kits
(SDKs), which give programmers language-specific functions for interacting with
cloud services. We discuss the Python SDKs here. The code below is all for
Python 2.7, but is easily converted to Python 3.
Each cloud has special features that make it unique, and thus the dierent
cloud provider’s REST AP Is and SDKs are not identical. Two eorts are under
way to create a standard Python SDK:
CloudBridge
[
11
]and
Apache Libcloud
libcloud.apache.org
. While both aim to support the standard tasks for all clouds,
those tas ks are only the lowest common denominator of cloud capabilities; many
unique cap abi liti es of each cloud are available only through the REST API and
SDK for that platform. At the time of this writing, Libcloud is not complete; we
will provide an online update when it is ready an d fully docum ented. However, we
do make use of CloudBrid ge in o ur OpenS tack examples.
Building a data sample collection in the cloud
. We use the following simple
example throughout this chapter to illustrate the use of Amazon, Azure, and Google
cloud storage services. We have a collection of data samples stored on our personal
computer and for each sample we have four items of metadata: item number, creation
date, experiment id, and a text string comment. To enable access to these samples by
our collaborators, we want to upload them to cloud storage and to create a searchable
table, also hosted in the cloud, containing the metadata and cloud storage URL for
each object, as shown in figure 3.1 on the following page.
We assume that each data sample is in a binary le on our personal computer
and that the ass ociated metadata are contained in a comma separated value (CSV)
file, with one line per item, also on our personal computer. Each line in this CSV file
has the following format:
item id, experiment id, date, filename, comment string
3.2 Using Amazon Cloud Storage Services
Our Amazon solution to the example problem uses S3 to store the blobs and
DynamoDB to store the table. We first need our Amazon key pair, i.e., access
key plus secret key, which we can obtain from the Amazon
IAM Management
Console
. Having created a new user, we select the create access key button to
create our security credentials, which we can then download, as shown in figure 3.2
on the following page.
38
Chapter 3. Using Cloud Storage Services
Figure 3.1: The simple cloud storage use case that we employ in this chapter involves the
upload of a collection of data blobs to cloud storage and the creation of a NoSQL table
containing metadata.
Figure 3.2: Downloading security credentials from the Amazon IAM Management Console.
39
3.2. Using Amazon Cloud Storage Services
We can proceed to create the required S3 bucket, upload our blobs to that
bucket, and so forth, all from the Amazon web portal. (We showed in figure 1.5
on page 11 the use of this portal to create a bucket.) However, there are a lot of
blobs, so we instead use the Amazon Python Boto3 SDK for these tasks. Details
on how to in stall this SDK are found in the link provided in the Resources section
at the end of this chapter.
Boto3 considers each service to be a
resource
. Thus, to u se the S3 system, we
need to create an S3 resource object. To do that, we need to specify the credentials
that we obtained from the IAM Management Console. Several ways can be used to
provide these credentials to our Python program. The s imp les t is to provide them
as special named parameters to the resource instance creation function, as follows.
import boto3
s3 = boto3 .resource ( 's3 ',
aws_access_key_id=' YOUR ACCESS KEY ',
aws_secret_access_key=' your secret key ' )
This approach is not recommended from a security perspective, since credentials
in code have a habit of leaking, if, for example, code is pushed to a shared repository:
see section 15.1 on page 315. Fortunately, this method is needed only if your
Python program is runni ng as a separate service on a machine or a container
that does not have access to your security keys. If you are running on your own
machine, the proper solution is to have a home directory
.aws
that contains two
protected files:
config
, containing your default Amazon region, and
credentials
,
containing your access and secret keys. If we have this directory in place, then the
access key and secret key parameters are not needed.
Having created the S3 resource object, we can now create the S3 bucket,
datacont
, in which we will store our data objects. The following code performs
this action. Note the (optional) second argument to the
create_bucket
call, which
specifies the geographic region in which the bucket should be created. At the time
of writing, Amazon operates i n 13 regions; region
us-west-2
is located in Oregon.
import boto3
s3 = boto3 .resource ( 's3 ')
s3 . create_bucke t ( Bucket = ' datacont ',CreateBucketConfiguration={
' LocationConstraint': 'us-w est -2' })
Now that we have created the new bucket, we can l oad our data objects into it
with a command such as the following.
40
Chapter 3. Using Cloud Storage Services
# Upload a file , 'test . jpg ' into the newly created bucket
s3 . Object ( ' datacont ', 'test .jpg' ). put (
Body= open( '/home/mydata/test.jpg' , 'rb '))
Having learned how to upload a file to S3, we can now create the DynamoDB
table in which we will store metadata and references to S3 objects. We create this
table by defining a special key that is composed of a
PartitionKey
and a
RowKey
.
NoSQL systems such as DynamoDB are distributed over multiple storage devices,
which enable constructing extremely large tables that can then be accessed in
parallel by many servers, each access in g one storage device. Hence the table’s
aggregate bandwidth is multiplied by the number of storage devices. DynamoDB
distributes data via row: for any row, every element in that row is map ped to the
same device. Thus, to determine the device on which a data value is located, you
need only look up the
PartitionKey
, which is hashed to an index that determines
the physical storage device in which the row resides. The
RowKey
specifies that
items are stored in order so rted by the
RowKey
value. While it is not n ecessary to
have both keys, we also illustrate the use of
RowKey
here. We can use the following
code to create the DynamoDB table.
dyndb = boto3. resource(' dynamodb ',region_name='us-w est -2' )
# The first time that we define a table , we use
table = dyndb. create_table (
TableName=' DataTable ',
KeySchema =[
{ ' AttributeName ': ' PartitionKey' , ' KeyType': ' HASH '},
{ ' AttributeName ': ' RowKey', ' KeyType ': ' RANGE' }
],
AttributeDefinitions=[
{ ' AttributeName ': ' PartitionKey' , ' AttributeType ': 'S ' },
{ ' AttributeName ': ' RowKey', ' AttributeType': 'S ' }
]
)
# Wait for the table to be created
table. meta. client. get_waiter(' table_exists ')
.wait(TableName=' DataTable')
# If the table has been previously defined , use:
# table = dyndb. Table (" DataTable ")
We are now ready to read the metadata from the CSV file, move the data
objects into the blob store, and enter the metadata row into the table. We do
this as follows. Recall that our CSV file format has
item[3]
as filename,
item[0]
41
3.3. Using Microsoft Azure Storage S ervices
as itemID,
item[1]
as experimentID,
item[2]
as date, and
item[4]
as comment.
Note that we n eed to state explicitly, via
ACL='public-read'
, that the URL for
the data file is to be publicly readable. The complete code is in notebook 2.
import csv
urlbase = "https ://s3 -us -west -2. amazonaws.com/datacont/"
with open( '\path-to-your-data\experiments.csv' , ' rb ')ascsvfile:
csvf = csv.reader (csvfile , delimiter =' ,',quotechar='| ')
for item in csvf :
body = open( 'path - to - your- data \ datafiles\\ '+item[3], 'rb ')
s3 . Object ( ' datacont ',item[3]).put(Body=body)
md = s3 . Object ( ' datacont ',item[3]).Acl()
.put(ACL='public - read ')
url= urlbase +item [3]
metadata_item ={' PartitionKey':item[0],'RowKey ':item[1],
' description ' :item[4],' date ' :item[2],' url ':url}
table. put_item(Item= metadata_item )
3.3 Using Microsoft Azure Storage Services
We first note some basic dierences between your Amazon and Azure accounts.
As we described above, your Amazon account ID i s defined by a pair consisting of
your access key and your secret key. Similarly, your Azure account is defined by
your personal ID and a subscripti on ID. Your personal ID is probably your email
address, so that is public; the subscription ID is something to keep secret.
We use Azure’s standard blob storage an d Table service to implement the
example. The dierences between Amazon DynamoDB and the Azure Table service
are subtle. With the Azure Table service, each row has the fields
PartitionKey
,
RowKey
,
comments
,
date
,and
URL
as before, but this time the
RowKey
is a unique
integer for each row. The
PartitionKey
is used as a hash to locate the row into a
specific storage device, and the RowKey is a uni que gl obal i dentifier for the row.
In addition to these semantic dierences between DynamoDB and Azure Tables,
there are fun dam ental dierences between the Amazon and Azure object storage
services. In S3, you create buckets and then create blobs within a bucket. S3 also
provides an illusion of folders, although these are actually just blob name prefixes
(e.g.,
folder1/
). In contrast, Azure storage is based on
Storage Accounts
,a
higher level abstraction than buckets. You can create as many storage accounts as
you want; each can contain five di erent types of objects: blobs, containers, file
shares, tables, and queues. Blobs are stored in bucket-like containers that can also
have a pseudo directory-like structure, similar to S3 buckets.
42
Chapter 3. Using Cloud Storage Services
Given your user ID and subscription ID , you can use the Azure P ython SDK
to create storage accounts, much as we create buckets in S3. However, we find it
easier to use the Azure portal. Login and click on
storage account
in the menu
on the left to bring up a panel for storage accounts. To add a new account, click
on the
+
sign at the top of the panel. You need to supply a name and some
additional p arameters such as location, duplication, and d istrib ution . Figure 3.3
shows the storage account that we added, called escistore.
One big dierence between S3 and Azure Storage accounts is that each storage
account co mes with two unique access keys, either of wh ich can be used to access
and modify the storage account. Unlike S3, you do not need the subscription
ID or user ID to add containers, tables, blobs or queues; you only need a valid
key. You can also invalidate either key and generate new keys at any time from
the portal. The reason for having two keys is that you can use one key for your
long-running services that use that storage account and the other to allow another
entity temporary access. By regenerating that second key, you terminate access by
the third party.
Azure storage accounts are, by default, private. You can also set up a public
storage account, as we show in an example on the next page, and grant li mi ted,
temporary access to a private account by creating, from the portal, a
Storage Ac-
cess Signature
for the account. Various access right properties can be configured
in this signature, including the period for which it is valid.
Figure 3.3: Azure portal after the storage account has been created.
43
3.3. Using Microsoft Azure Storage S ervices
Having created the storage account and installed the SD K (see section 3.8 on
page 57), we can proceed to the initialization as follows. The
create_table()
function returns true if a new table was created and false if the table already exists.
import azure. storage
from azure. storage.table import TableService , Entity
from azure. storage.blob import BlockBlobService
from azure. storage.blob import PublicAccess
# First , access the blob service
block_blob_service = BlockBlobService(account_name=' escistore ',
account_key=' your storage key')
block_blob_service.create_container(' datacont ',
public_access=PublicAccess.Container)
# Next , create the table in the same storage account
table_service = TableService(account_name=' escistore ',
account_key=' your account key')
if table_service.create_table(' DataTable'):
print("Table created")
else:
print("Table already there")
The code to upload the data blobs and to build the table is almost identical to
that used for Amazon S3 and Dynamo DB. The only dierences are the lines that
first manage the upload and then insert items into the table. To upload the data
to the blob storage we use the
create_blob_from_path()
function, which takes
three parameters: the container, the name to give the blob an d the path to the
source, as shown i n the following.
import csv
with open( '\path-to-your-data\experiments.csv' , ' rb ')ascsvfile:
csvf = csv.reader (csvfile , delimiter =' ,',quotechar='| ')
for item in csvf :
print(item)
block_blob_service.create_blob_from_path(
' datacont ',item[3],
"\path -to-your-files\datafiles\\"+item[3]
)
url=" https :// escistore .blob. core .windows .net/ datacont /"+ item [3]
metadata_item = {' PartitionKey ':item[0], ' RowKey ':item[1],
' description ' :item[4],' date ' :item[2],' url ':url}
table_service.insert_entity(' DataTable',metadata_item)
A nice desktop tool called Azure Storage Explorer is available for Windows,
Macs, and Linux. We tested the code above wi th a s in gle CSV file with only four
lines and four small blobs. Figure 3.4 shows the Storage Explorer views of the
blob container and tabl e contents.
44
Chapter 3. Using Cloud Storage Services
Figure 3.4: Azure Storage Explorer view of the contents of container
datacont
(above)
and the table DataTable (below) in storage account escidata.
Queries in the Azure table service Python SDK are limited to searching for
PartitionKey
; however, simple filters and projections are possible. For example,
you can search for all rows associated with
experiment1
and then select the
url
column as shown below.
tasks = table_service .query_entities (' DataTable ',
filter="PartitionKey eq 'experiment1'", select=' url ')
for task in tasks :
print(task.url)
Our trivial dataset has two data blobs associated wi th
experiment1
,andthus
the query yields two results:
https://escistore.blob.core.windows.net/datacont/exp1
https://escistore.blob.core.windows.net/datacont/exp2
Similar queries a re possible with Amazon DynamoDB. The Azure Python code
for this example is in notebook 3.
45
3.4. Using Google Cloud Storage Services
3.4 Using Google Cloud Storage Serv ices
The Google Cloud has long been a central part of Google’s operations, but it is
relatively new as a public platform to rival Amazon and Azure. Google’s Google
Docs, Gmail, and data storag e services are well known, but their first foray into
providing computing and data services of potential interest to science was
Google
AppEngine
, which is not discussed here. Recently they have pull ed many of their
internally used services together into a public platform called
Google Cloud
,
which includ es their data storage services, their NoSQL services Cloud Datastore
and Bigtable, and various computational services that we describe in part III.
To use Google Cloud, you need an account. Google currently oers a small but
free 60-day trial account. Once you have an account, you can install the Google
Cloud SDK which consists of the
Google Cloud
command-line tool and the
gsutil
package. These tool s are ava ila ble for Li nux, Mac OS, and Windows.
To get going, you need to install the
gsutil
package and then execute
gcloud
init
, which prompts you to log into your Google account. You also need to create
and/or select a project to work on. Once this is done, your desktop machine i s
authenticated and authorized to access the Google Cl oud Platform services. While
this is convenient, you must do a bit more work to write Python scripts that can
access your resource from anywhere. We discuss this topi c la ter.
For now, if we bring up our Jupyter notebook on our local machine, it is
authenticated automatically. Creating a storage bucket and uploading data are
now easy. You can create a bucket from the console or programmatically. Note
that your bucket name must be unique across all of Google Cloud, so when creating
a bucket programmatically, you may wish to use a un iversally unique identifier
(UUID) as the name. For simplicity, we do not d o th at here.
from gcloud import storage
client = storage.Client()
# Create a bucket with name ' afirstbucket'
bucket = client. create_bucket (' afirstbucket')
# Create a blob with name 'my- test - file. txt ' and load some data
blob = bucket.blob ('my -t est -f il e .txt ')
blob. upload_from_string(' this is test content !')
blob. make_public ()
The blob is now created and can be accessed at the following address.
https://storage.googleapis.com/afirstbucket/my-test-file.txt
Google Cloud has several NoSQL table storage services. We illustrate the use
of two here: Bigtable and Datastore.
46
Chapter 3. Using Cloud Storage Services
3.4.1 Google Bigtable
Bigtable is the progenitor of Apache HBase, the NoSQL store built on the Hadoop
Distributed File System (HDFS). Bigtable and HBase are designed fo r large data
collections that must b e accessed quickly by major data analytics jobs. Provisioning
a Bigtabl e instance requires provisioning a cluster of servers . This task is most
easily performed from the console. Figure 3.5 illustrates the creation of an instance
called cloud-book-instance, which we provision on a cluster of three nodes.
Figure 3.5: Using the Google Cloud console to create a Bigtable instance.
The following code illustrates how we can build a Bigtable instance and then
create first a table and then a column family and a row in that table. Column
families are the groups of columns tha t organize the contents of a row; each row
has a single unique key.
47
3.4. Using Google Cloud Storage Services
from gcloud import bigtable
clientbt = bigtable. Client(admin=True)
clientbt.start()
instance = clientbt. instance('cloud -book- instance')
table = instance .table('book - table ')
table. create ()
# Table has been created
column_family = table.column_family('cf ')
column_family.create()
# now insert a row with key ' key1 ' and columns ' experiment ', 'date ',
# ' link '
row = table .row (' key1 ')
row. set_cell ('cf ' , ' experiment ', ' exp1 ')
row. set_cell ('cf ' , ' date ', ' 6/6/16 ')
row. set_cell ('cf ' , ' link ', ' http :// some_location ')
row. commit()
Bigtable is important because it i s used with other Google Cloud services
for data analysis and is an ideal solution for trul y massive data collections. The
Python API for B igtabl e is a bit awkward to use because the operations are
all asynchronous remote procedure calls (RPCs): when a Python
create()
call
returns, the object created may not yet be available. For those who want to
experiment, we provide in notebook 4 code illustrating the use of Bigtable.
3.4.2 Google Cloud Datastore
Notebook 5 shows how to i mp lement equivalent functionality by using Datastore.
This service i s repli cated, supports ACID transactions and SQL-like queries, and
is easier to use from the Python SDK than Bigtable. We create a table as follows.
from gcloud import datastore
clientds = datastore.Client ()
key = clientds .key (' blobtable ')
To add an entity to the table, we write the following.
entity = datastore.Entity (key=key)
entity[' experiment - name ']=' experiment name'
entity[' date ']=' the date '
entity[' description']='the text describing the experiment '
entity[' url ']=' the url '
clientds.put(entity)
48
Chapter 3. Using Cloud Storage Services
You can now implement the use case with ease. You first create a bucket
book-datacont
on the Google Cloud portal. Again, it is easiest to perform this
operation on the portal because the name is unique across Google Cloud. (If you
include a fixed name in your code, then the create call will fail when you rerun
your program.) Furthermore, the bucket n ame
book-datacont
is now taken, so
when you try this you need to choose your own unique name. You can then create
adatastoretablebook-table as follows.
from gcloud import storage
from gcloud import datastore
import csv
client = storage.Client()
clientds = datastore.Client ()
bucket = client. bucket('book -datacont ')
key = clientds .key ('book - table')
Your blob uploader and table builder can now be written as fo llows. The code
is the same as that provided for the other systems, wi th the excep tion that we d o
not use a partition or row key.
with open( '\path-to-your-data\experiments.csv' , ' rb ')ascsvfile:
csvf = csv.reader (csvfile , delimiter =' ,',quotechar='| ')
for item in csvf :
print(item)
blob = bucket.blob (item [3])
data = open("\path-to-your-data\datafiles\\"+item[3], 'rb ')
blob. upload_from_file (data )
blob. make_public ()
url = " https :// storage .googleapis .com/ book - datacont /"+ item [3]
entity = datastore.Entity (key=key)
entity[' experiment - name ']=item[0]
entity[' experiment -id']=item[1]
entity[' date ']=item[2]
entity[' description']=item[4]
entity[' url ']=url
clientds.put(entity)
Datastore has an extensive query interface that can be used from the portal;
some, but not all, features are also available from the Python API. Fo r example,
we can write the following to find the URLs for experiment1.
query = clientds .query(kind=u' book - table')
query. add_filter(u' experiment -name', '= ', ' experiment1')
results = list(query.fetch())
urls = [ result ['url '] for result in results]
49
3.5. Using OpenStack Cloud Storage Services
3.5 Using OpenStack Cloud Storage Services
OpenStack is used by IBM and Rackspace for their publ ic cloud oering and also
by many private clouds. We focus our OpenStack examples in this book on the
NSF Jetstream cloud. As we discussed in chapter 2, Jetstream is not intended to
duplicate the current public clouds, but rather to oer services that are tuned to
the specific needs of the science community. One missing component is a standard
NoSQL database service, so we cannot implement the data catalog example that
we presented for the other clouds.
A Python SDK called CloudBridge works with Jetstream and other OpenStack-
based clouds. (CloudBridge also works with Amazon but it is less comprehensive
than Boto3.) To use CloudBridge, you first create a provider object, identifying
the cloud with which you want to work and supplying your credentials, for exam pl e
as follows.
from cloudbridge.cloud.factory import CloudProviderFactory , \
ProviderList
js_config =
{"os_username": "your user name",
"os_password": "your password",
"os_auth_url": "https://jblb.jetstream-cloud.org:35357/v3",
"os_user_domain_name": "tacc",
"os_tenant_name": "tenant name",
"os_project_domain_name": "tacc",
"os_project_name": "tenant name"
}
js = Clo udProvid erFactor y ()\
.create_provider(ProviderList.OPENSTACK, js_config)
You may now use the provider object reference to first create a
bucket
—also
called a container—and then upload a b inary
object
to that new bucket, as follows.
# Create new bucket
bucket = js. object_store. create(' my_bucket_name ')
# Create new object within bucket
buckobj = bucket[0]. create_object(' stuff ')
fo = open ('\path to your data\stuff.txt', 'rb ')
# Upload file contents to new object
buckobj.upload(fo)
To verify that these actions worked, you can log into the OpenStack portal
and check the current container state, as shown in figure 3.6 on the following page.
The complete code is in notebook 6.
50
Chapter 3. Using Cloud Storage Services
Figure 3.6: View of containers in the OpenStack object store.
3.6 Transferring and Sharing Data with G l obus
When using cloud resources in science or engineering, we often need to copy data
between cloud and non-cloud systems. For example, we may n eed to move genome
sequence files from a sequencing center to the cloud for analysis, and analysis
results to our laboratory. Figure 3.7 shows how Globus services can be used
for such purposes. We see in this figure three dierent storage systems: one
associated with a sequencing center, a cloud storage service, and one located on a
personal comp uter. Each runs a lightweight Globus Connect agent that allows it
to participate in Globus transfers.
The Globus Connect agent
enables a computer system to interact with the
Globus file transfer and sharing service. With it, the user can easily create a Globus
endpoint on practically any system, from a personal laptop to a cloud or national
supercomputer. Globus Connect comes in two flavors:
Globus Connect Personal
for use by a single user on a personal machine, and
Globus Connect Server
for
use on multiuser computing and storage resources.
Globus Connect supports a wide variety of storage systems, including both
various POSIX-compliant storage systems (Linux, Windows, MacOS; Lustre, GPFS,
OrangeFS, etc.) and various specialized systems (HPSS, HDFS, S3, Ceph RadosGW
via the S3 API, Spectra Logic BlackPearl, and Google Drive). It also interfaces to a
variety of dierent user identity and authentication mechanisms.
Note that all of the public cloud examples presented earlier in this chapter
involved client-server i nteractions: in each case, we had to run the data upload
to the cloud from the machine that mounts the storage system with the data. In
51
3.6. Transferring and Sharing Data with Globus
Researcher(
initiates(transfe r(
request;(o r(requested(
automatically(by(script,(
science( gateway
Researcher(
selects(files(to(
share,(selects(user(
or(group,(and(sets(
access(permissions(
Collaborator(logs(in(to(
access(shared(files;(no(local(
account(needed;(
download(via(Globus
Personal)computer
1
3
Share
5
Google)Drive
cloud)storage
Globus(transfers(
files(reliably,(
securely
2
Transfer
Sequencing)center
Globus(controls(
access(to(shared(
files(on(existing(
storage
4
Figure 3.7: Globus transfer and sharing services used to exchange data among sequencing
center, remote storage system (in this case, Google Drive), and a personal computer.
contrast, Globus allows third-party transfers, meaning that you can drive from a
computer
A
a transfer from one endpoint
B
to another endpoint
C
. This capability
is often important when automati ng scienti fic workflows.
The figure depicts a series of five data manipulation operations. (1) A researcher
requests, for example via the Globus web interface, that a set of files be transferred
from a sequencing center to another storage system, in this case the Google Drive
cloud storage system. (2) The transfer then proceeds without further engagement
by the requesting researcher, who can shut down her laptop, go to lunch, or do
whatever else is needed. The Globus cloud service (not shown in the figure) is
responsible for completing the transfer, retrying failed transfers if required, and
notifying the user of failure only if repeated retries are not successful. The user
requires only a web browser to access the service, can transfer data to and from
any storage system that runs the Globus Connect software, and can authenticate
using an institutional credential. Steps 3–5 are concerned with data sharing, which
we discuss in secti on 3.6. 2 on p age 54.
3.6.1 Tr ansferring Data with Globus
Figure 3.8 shows the Globus web interface being used to transfer a file. Cloud
services are being used here in two ways: Data are being transferred from cloud
storage, in this case Amazon S3; and the Globus service is cloud-hosted software
52
Chapter 3. Using Cloud Storage Services
as a service—running, as we d iscu ss in chapter 14, on the Amazon cloud.
Figure 3.8: Globus transfer web interface. We have selected a single file on the Globus
endpoint Globus Vault to transfer to the endpoint My laboratory workstation.
Globus a lso provides a REST API and a Python SDK for that API, allowing
you to drive transfers entirely from Python program s. We u se the code in figure 3.9
on page 55 to illustrate how the Python SDK can be u sed to perform and then
monitor a transfer. The first lines of code, labeled (a), create a transfer client
instance. This handles connection management, security, and other administrative
issues. The code then (b ) specifies the identifiers for the source and destination
endpoints. Each Globus endpoint and user are named by a universally unique
identifier (UUID). An endpoint’s identifier can be determined via the Globus web
client or programmatically, via an endpoint search API. For example, with the
Python SDK you can write:
tc = globus_sdk. TransferClient (...)
for ep in tc . endpo int_s earch ( 'String to search for' ):
print(ep[' display_name '])
53
3.6. Transferring and Sharing Data with Globus
In figure 3.9, we hard-code the identifiers of two endpoints that the Globus
team operates for use in tutorials. The code also specifies (c) the source and
destination paths for the transfer.
Next, the code (d) ensures that the endpoi nts are activated. In order for the
transfer service to perform operations on endpoint file systems, it must have a
credential to authenticate to the endpoint as a specific local user. The process
of providing such a credential to the service is called
endpoint activation
[
22
].
The
endpoint_autoactivate
function checks if the endpoint is activated for the
calling user, or can be automatically activated using a cached credential that will
not expire for at least a specified period of time. Otherwise it returns a failure
condition, in which case the user can use the Globus web interface to authenticate
and thus provide a credential that Globus can use for some time period. We show
code that handles that case for the destination endpoint.
In (e)–(g), we assemble and submit the transfer request, providing the endpoint
identifiers, the source and destination paths, and (since we want to transfer a
directory) the recursive flag. In (h), we check for task status. This blocking call
returns after a specified timeout or when the task terminates, whichever is sooner.
In this case, we choose to (i) terminate the task if it has not completed by the
timeout; we could instead repeat the wait.
More examples of the Globus Python SDK are in notebook 8. Globus also
provides a command line interface (CLI), im plem ented via the Python SDK, that
can be used to perform the operations just described. For example, the following
command transfers a directory from one endpoint to another. More details on how
to use the CLI are available online docs.globus.org .
globus transfer --recursive \
"ddb59aef-6d04-11e5-ba46-22000b92c6ec":shared_dir \
"ddb59af0-6d04-11e5-ba46-22000b92c6ec":destination_directory
3.6.2 Sharing Data with Globus
Globus also makes it easy to share data with colleagues, as shown in figure 3.7
on page 52, steps 3–5. A
shared endpoint
is a dynamically created construct
that enables a folder on an existing endpoint to be shared with others. To use this
feature, you first create a shared endpoint, designating an existing endpoint and
folder, and then grant read and/or write permissions on that shared endpoint to
the Globus user(s) and/or group(s) that you want to be able to access it. Shared
endpoints can be created and managed both via the Globus web interface (see
figure 3.10 on page 56) and programmatically, as we show in chapter 11.
54
Chapter 3. Using Cloud Storage Services
# ( a) Prepare transfer client
import globus_sdk
tc = globus_sdk. TransferClient()
# ( b) Define the source and destination endpoints for the transfer
source_endpoint_id = 'ddb59aef -6d04 -11e5-ba46 -22000 b92c6ec '
dest_endpoint_id = ' ddb59af0 -6d04 -11e5 -ba46 -22000 b92c6ec '
# ( c) Define the source and destination paths for the transfer
source_path = '/share/godata/'
dest_path = ' /~/ '
# ( d) Ensure endpoints are activated
tc . end point_au toactiva te( source_endpoint_id , if_expires _in =3600)
r=tc.endpoint_autoactivate(dest_endpoint_id,if_expires_in=3600)
while (r[' code']==' AutoActivationFailed'):
print( 'To activate endpoint , open URL in browser : ')
print( ' https ://www. globus.org/app/endpoints /%s/activate '
%dest_endpoint_id)
# For python 2.X , use raw_input () instead
input( ' Press ENTER after activating the endpoint :')
r=tc.endpoint_autoactivate(ep_id,if_expires_in=3600)
# ( e) Start transfer set up
tdata = globus_sdk .TransferData(tc , source_endpoint_id ,
dest_endpoint_id ,
label=' My test transfer ')
# ( f) Specify a recursive transfer of directory contents
tdata. add_item(source_path , dest_path , recursive=True)
# ( g) Submit transfer request
r=tc.submit_transfer(tdata)
print( 'Task ID:' ,submit_result[' task_id'])
# ( h) Wait for transfer to complete , with timeout
done = tc. task_wait(r[' task_id '], timeout=1000)
# ( i) Check for success; cancel if not completed by timeout
if done:
print( 'Task completed ')
else:
cancel_task(r[' task_id '])
print( 'Task did not complete in time ')
Figure 3.9: Using the Globus Python SDK to initiate and monitor a transfer request.
55
3.7. Summary
Figure 3.10: Globus web interface for creating a shared endpoint.
3.7 Summary
We have introduced in this chapter fundamental methods for interacting with cloud
storage services. We focused, in particular, on blob services and NoSQL table
services. As we noted in chapter 2, these are far from being the only cloud storage
oerings. Indeed, as we show in the next chapter, POSIX file storage systems
attached to virtual machines are particularly important for high-performance
applications. So too are data analytic warehouses, which we discuss in pa rt III.
We also delved more deeply in this chapter into cloud access methods first
introduced in section 1.4 on page 8. We showed how cloud provider portals can be
used for interactive access to resources and services, and also how Python SDKs
can be used to script data management and analysis workflows. We used them,
in p articul ar, to script the task of uploading a set of data objects and building a
searchable NoSQL table of metadata where each row of the table corresponds to
one of the uploaded blobs of data.
As you studied the Python code that we provided fo r Amazon, Azu re, and
Google, you may have been frustrated that the programs for the dierent versions
are almost but not quite the same. Why not have a single API and corresponding
SDK for all clouds? As we h ave said, various attempts to create such a uniform
56
Chapter 3. Using Cloud Storage Services
SDK are under way, but these attempts can cover only the common intersection
of each cloud’s capabilities. Each cloud service grew out of a dierent culture in
a di erent company, and so are not identical. The resulti ng creative ferment has
contributed to the explosion of tools and concepts that we describe in this book.
3.8 Resources
We provide al l examples in this chapter as Jupyter notebooks, as described in
chapter 17. You first need to install the SDKs for each cloud. The SDKs and links
to documents are here:
Amazon Boto3 aws.amazon.com/sdk-for-python/
Azure azure.microsoft.com/en-us/develop/python/
Google’s Cloud cloud.google.com/sdk/
Openstack CloudBridge cloudbridge.readthedocs.io/en/latest/
The Globus Python SDK
github.com/globus/globus-sdk-python
and the
Globus CLI github.com/globus/globus-cli
You also need an account on each cloud. Chapter 1 provides l inks to the portals
where you can obtain a trial accou nt.
57