Chapter 3

Using Cloud Storage Services

“Collecting data is only the ﬁrst step toward wisdom, but sharing data

is the ﬁrst step toward community.”

—Henry Louis Gates Jr.

We introduced in chapter 2 a set of important cloud storage concepts and a range

of cloud provider services that implement these concepts in practice. While the

services of diﬀerent cloud providers are often similar in outline, they invariably

diﬀer in the details. Thus here we describe, and use examples to illustrate the use

of the services used in three major public clouds (Amazon, Azure, Google) and in

one major open source cloud system, OpenStack. And because your science will

often involve data that exist outs ide the cloud, we also describe how you can use

Globus to move data between the clou d and oth er data centers and to share data

with collaborators.

3.1 Two Access Methods: Portals and A PIs

As we discussed in section 1.4 on page 8, cloud providers make available two main

methods for managing data and services: portals and REST APIs.

Each cloud provider’s web portal typically allows you to accomplish anything

that you want to do with a few mouse clicks. We provide several examples of such

web portals to illustrate how they work.

While such portals are good for performing simple actions, they are not ideal

for the repetitive tasks, such as managing hu nd reds of data objects, that scientists

3.2. Using Amazon Cloud Storage Services

need to do on a daily basi s. For such tasks, we need an interface to the cloud that

we can program. Cloud providers make this possible by providing REST APIs that

programmers can use to acces s their services programmatically. For programming

convenience, you will usually access these APIs via software development kits

(SDKs), which give programmers language-speciﬁc functions for interacting with

cloud services. We discuss the Python SDKs here. The code below is all for

Python 2.7, but is easily converted to Python 3.

Each cloud has special features that make it unique, and thus the diﬀerent

cloud provider’s REST AP Is and SDKs are not identical. Two eﬀorts are under

way to create a standard Python SDK:

CloudBridge

[

]and

Apache Libcloud

libcloud.apache.org

. While both aim to support the standard tasks for all clouds,

those tas ks are only the lowest common denominator of cloud capabilities; many

unique cap abi liti es of each cloud are available only through the REST API and

SDK for that platform. At the time of this writing, Libcloud is not complete; we

will provide an online update when it is ready an d fully docum ented. However, we

do make use of CloudBrid ge in o ur OpenS tack examples.

Building a data sample collection in the cloud

. We use the following simple

example throughout this chapter to illustrate the use of Amazon, Azure, and Google

cloud storage services. We have a collection of data samples stored on our personal

computer and for each sample we have four items of metadata: item number, creation

date, experiment id, and a text string comment. To enable access to these samples by

our collaborators, we want to upload them to cloud storage and to create a searchable

table, also hosted in the cloud, containing the metadata and cloud storage URL for

each object, as shown in ﬁgure 3.1 on the following page.

We assume that each data sample is in a binary ﬁle on our personal computer

and that the ass ociated metadata are contained in a comma separated value (CSV)

ﬁle, with one line per item, also on our personal computer. Each line in this CSV ﬁle

has the following format:

item id, experiment id, date, filename, comment string

3.2 Using Amazon Cloud Storage Services

Our Amazon solution to the example problem uses S3 to store the blobs and

DynamoDB to store the table. We ﬁrst need our Amazon key pair, i.e., access

key plus secret key, which we can obtain from the Amazon

IAM Management

Console

. Having created a new user, we select the create access key button to

create our security credentials, which we can then download, as shown in ﬁgure 3.2

on the following page.

Chapter 3. Using Cloud Storage Services

Figure 3.1: The simple cloud storage use case that we employ in this chapter involves the

upload of a collection of data blobs to cloud storage and the creation of a NoSQL table

containing metadata.

Figure 3.2: Downloading security credentials from the Amazon IAM Management Console.

3.2. Using Amazon Cloud Storage Services

We can proceed to create the required S3 bucket, upload our blobs to that

bucket, and so forth, all from the Amazon web portal. (We showed in ﬁgure 1.5

on page 11 the use of this portal to create a bucket.) However, there are a lot of

blobs, so we instead use the Amazon Python Boto3 SDK for these tasks. Details

on how to in stall this SDK are found in the link provided in the Resources section

at the end of this chapter.

Boto3 considers each service to be a

resource

. Thus, to u se the S3 system, we

need to create an S3 resource object. To do that, we need to specify the credentials

that we obtained from the IAM Management Console. Several ways can be used to

provide these credentials to our Python program. The s imp les t is to provide them

as special named parameters to the resource instance creation function, as follows.

import boto3

s3 = boto3 .resource ( 's3 ',

aws_access_key_id=' YOUR ACCESS KEY ',

aws_secret_access_key=' your secret key ' )

This approach is not recommended from a security perspective, since credentials

in code have a habit of leaking, if, for example, code is pushed to a shared repository:

see section 15.1 on page 315. Fortunately, this method is needed only if your

Python program is runni ng as a separate service on a machine or a container

that does not have access to your security keys. If you are running on your own

machine, the proper solution is to have a home directory

.aws

that contains two

protected ﬁles:

config

, containing your default Amazon region, and

credentials

containing your access and secret keys. If we have this directory in place, then the

access key and secret key parameters are not needed.

Having created the S3 resource object, we can now create the S3 bucket,

datacont

, in which we will store our data objects. The following code performs

this action. Note the (optional) second argument to the

create_bucket

call, which

speciﬁes the geographic region in which the bucket should be created. At the time

of writing, Amazon operates i n 13 regions; region

us-west-2

is located in Oregon.

import boto3

s3 = boto3 .resource ( 's3 ')

s3 . create_bucke t ( Bucket = ' datacont ',CreateBucketConfiguration={

' LocationConstraint': 'us-w est -2' })

Now that we have created the new bucket, we can l oad our data objects into it

with a command such as the following.

Chapter 3. Using Cloud Storage Services

# Upload a file , 'test . jpg ' into the newly created bucket

s3 . Object ( ' datacont ', 'test .jpg' ). put (

Body= open( '/home/mydata/test.jpg' , 'rb '))

Having learned how to upload a ﬁle to S3, we can now create the DynamoDB

table in which we will store metadata and references to S3 objects. We create this

table by deﬁning a special key that is composed of a

PartitionKey

and a

RowKey

NoSQL systems such as DynamoDB are distributed over multiple storage devices,

which enable constructing extremely large tables that can then be accessed in

parallel by many servers, each access in g one storage device. Hence the table’s

aggregate bandwidth is multiplied by the number of storage devices. DynamoDB

distributes data via row: for any row, every element in that row is map ped to the

same device. Thus, to determine the device on which a data value is located, you

need only look up the

PartitionKey

, which is hashed to an index that determines

the physical storage device in which the row resides. The

RowKey

speciﬁes that

items are stored in order so rted by the

RowKey

value. While it is not n ecessary to

have both keys, we also illustrate the use of

RowKey

here. We can use the following

code to create the DynamoDB table.

dyndb = boto3. resource(' dynamodb ',region_name='us-w est -2' )

# The first time that we define a table , we use

table = dyndb. create_table (

TableName=' DataTable ',

KeySchema =[

{ ' AttributeName ': ' PartitionKey' , ' KeyType': ' HASH '},

{ ' AttributeName ': ' RowKey', ' KeyType ': ' RANGE' }

AttributeDefinitions=[

{ ' AttributeName ': ' PartitionKey' , ' AttributeType ': 'S ' },

{ ' AttributeName ': ' RowKey', ' AttributeType': 'S ' }

]

)

# Wait for the table to be created

table. meta. client. get_waiter(' table_exists ')

.wait(TableName=' DataTable')

# If the table has been previously defined , use:

# table = dyndb. Table (" DataTable ")

We are now ready to read the metadata from the CSV ﬁle, move the data

objects into the blob store, and enter the metadata row into the table. We do

this as follows. Recall that our CSV ﬁle format has

item[3]

as ﬁlename,

item[0]

3.3. Using Microsoft Azure Storage S ervices

as itemID,

item[1]

as experimentID,

item[2]

as date, and

item[4]

as comment.

Note that we n eed to state explicitly, via

ACL='public-read'

, that the URL for

the data ﬁle is to be publicly readable. The complete code is in notebook 2.

import csv

urlbase = "https ://s3 -us -west -2. amazonaws.com/datacont/"

with open( '\path-to-your-data\experiments.csv' , ' rb ')ascsvfile:

csvf = csv.reader (csvfile , delimiter =' ,',quotechar='| ')

for item in csvf :

body = open( 'path - to - your- data \ datafiles\\ '+item[3], 'rb ')

s3 . Object ( ' datacont ',item[3]).put(Body=body)

md = s3 . Object ( ' datacont ',item[3]).Acl()

.put(ACL='public - read ')

url= urlbase +item [3]

metadata_item ={' PartitionKey':item[0],'RowKey ':item[1],

' description ' :item[4],' date ' :item[2],' url ':url}

table. put_item(Item= metadata_item )

3.3 Using Microsoft Azure Storage Services

We ﬁrst note some basic diﬀerences between your Amazon and Azure accounts.

As we described above, your Amazon account ID i s deﬁned by a pair consisting of

your access key and your secret key. Similarly, your Azure account is deﬁned by

your personal ID and a subscripti on ID. Your personal ID is probably your email

address, so that is public; the subscription ID is something to keep secret.

We use Azure’s standard blob storage an d Table service to implement the

example. The diﬀerences between Amazon DynamoDB and the Azure Table service

are subtle. With the Azure Table service, each row has the ﬁelds

PartitionKey

RowKey

comments

date

,and

URL

as before, but this time the

RowKey

is a unique

integer for each row. The

PartitionKey

is used as a hash to locate the row into a

speciﬁc storage device, and the RowKey is a uni que gl obal i dentiﬁer for the row.

In addition to these semantic diﬀerences between DynamoDB and Azure Tables,

there are fun dam ental diﬀerences between the Amazon and Azure object storage

services. In S3, you create buckets and then create blobs within a bucket. S3 also

provides an illusion of folders, although these are actually just blob name preﬁxes

(e.g.,

folder1/

). In contrast, Azure storage is based on

Storage Accounts

higher level abstraction than buckets. You can create as many storage accounts as

you want; each can contain ﬁve di ﬀerent types of objects: blobs, containers, ﬁle

shares, tables, and queues. Blobs are stored in bucket-like containers that can also

have a pseudo directory-like structure, similar to S3 buckets.

Chapter 3. Using Cloud Storage Services

Given your user ID and subscription ID , you can use the Azure P ython SDK

to create storage accounts, much as we create buckets in S3. However, we ﬁnd it

easier to use the Azure portal. Login and click on

storage account

in the menu

on the left to bring up a panel for storage accounts. To add a new account, click

on the

sign at the top of the panel. You need to supply a name and some

additional p arameters such as location, duplication, and d istrib ution . Figure 3.3

shows the storage account that we added, called escistore.

One big diﬀerence between S3 and Azure Storage accounts is that each storage

account co mes with two unique access keys, either of wh ich can be used to access

and modify the storage account. Unlike S3, you do not need the subscription

ID or user ID to add containers, tables, blobs or queues; you only need a valid

key. You can also invalidate either key and generate new keys at any time from

the portal. The reason for having two keys is that you can use one key for your

long-running services that use that storage account and the other to allow another

entity temporary access. By regenerating that second key, you terminate access by

the third party.

Azure storage accounts are, by default, private. You can also set up a public

storage account, as we show in an example on the next page, and grant li mi ted,

temporary access to a private account by creating, from the portal, a

Storage Ac-

cess Signature

for the account. Various access right properties can be conﬁgured

in this signature, including the period for which it is valid.

Figure 3.3: Azure portal after the storage account has been created.

3.3. Using Microsoft Azure Storage S ervices

Having created the storage account and installed the SD K (see section 3.8 on

page 57), we can proceed to the initialization as follows. The

create_table()

function returns true if a new table was created and false if the table already exists.

import azure. storage

from azure. storage.table import TableService , Entity

from azure. storage.blob import BlockBlobService

from azure. storage.blob import PublicAccess

# First , access the blob service

block_blob_service = BlockBlobService(account_name=' escistore ',

account_key=' your storage key')

block_blob_service.create_container(' datacont ',

public_access=PublicAccess.Container)

# Next , create the table in the same storage account

table_service = TableService(account_name=' escistore ',

account_key=' your account key')

if table_service.create_table(' DataTable'):

print("Table created")

else:

print("Table already there")

The code to upload the data blobs and to build the table is almost identical to

that used for Amazon S3 and Dynamo DB. The only diﬀerences are the lines that

ﬁrst manage the upload and then insert items into the table. To upload the data

to the blob storage we use the

create_blob_from_path()

function, which takes

three parameters: the container, the name to give the blob an d the path to the

source, as shown i n the following.

import csv

with open( '\path-to-your-data\experiments.csv' , ' rb ')ascsvfile:

csvf = csv.reader (csvfile , delimiter =' ,',quotechar='| ')

for item in csvf :

print(item)

block_blob_service.create_blob_from_path(

' datacont ',item[3],

"\path -to-your-files\datafiles\\"+item[3]

)

url=" https :// escistore .blob. core .windows .net/ datacont /"+ item [3]

metadata_item = {' PartitionKey ':item[0], ' RowKey ':item[1],

' description ' :item[4],' date ' :item[2],' url ':url}

table_service.insert_entity(' DataTable',metadata_item)

A nice desktop tool called Azure Storage Explorer is available for Windows,

Macs, and Linux. We tested the code above wi th a s in gle CSV ﬁle with only four

lines and four small blobs. Figure 3.4 shows the Storage Explorer views of the

blob container and tabl e contents.

Chapter 3. Using Cloud Storage Services

Figure 3.4: Azure Storage Explorer view of the contents of container

datacont

(above)

and the table DataTable (below) in storage account escidata.

Queries in the Azure table service Python SDK are limited to searching for

PartitionKey

; however, simple ﬁlters and projections are possible. For example,

you can search for all rows associated with

experiment1

and then select the

url

column as shown below.

tasks = table_service .query_entities (' DataTable ',

filter="PartitionKey eq 'experiment1'", select=' url ')

for task in tasks :

print(task.url)

Our trivial dataset has two data blobs associated wi th

experiment1

,andthus

the query yields two results:

https://escistore.blob.core.windows.net/datacont/exp1

https://escistore.blob.core.windows.net/datacont/exp2

Similar queries a re possible with Amazon DynamoDB. The Azure Python code

for this example is in notebook 3.

3.4. Using Google Cloud Storage Services

3.4 Using Google Cloud Storage Serv ices

The Google Cloud has long been a central part of Google’s operations, but it is

relatively new as a public platform to rival Amazon and Azure. Google’s Google

Docs, Gmail, and data storag e services are well known, but their ﬁrst foray into

providing computing and data services of potential interest to science was

Google

AppEngine

, which is not discussed here. Recently they have pull ed many of their

internally used services together into a public platform called

Google Cloud

which includ es their data storage services, their NoSQL services Cloud Datastore

and Bigtable, and various computational services that we describe in part III.

To use Google Cloud, you need an account. Google currently oﬀers a small but

free 60-day trial account. Once you have an account, you can install the Google

Cloud SDK which consists of the

Google Cloud

command-line tool and the

gsutil

package. These tool s are ava ila ble for Li nux, Mac OS, and Windows.

To get going, you need to install the

gsutil

package and then execute

gcloud

init

, which prompts you to log into your Google account. You also need to create

and/or select a project to work on. Once this is done, your desktop machine i s

authenticated and authorized to access the Google Cl oud Platform services. While

this is convenient, you must do a bit more work to write Python scripts that can

access your resource from anywhere. We discuss this topi c la ter.

For now, if we bring up our Jupyter notebook on our local machine, it is

authenticated automatically. Creating a storage bucket and uploading data are

now easy. You can create a bucket from the console or programmatically. Note

that your bucket name must be unique across all of Google Cloud, so when creating

a bucket programmatically, you may wish to use a un iversally unique identiﬁer

(UUID) as the name. For simplicity, we do not d o th at here.

from gcloud import storage

client = storage.Client()

# Create a bucket with name ' afirstbucket'

bucket = client. create_bucket (' afirstbucket')

# Create a blob with name 'my- test - file. txt ' and load some data

blob = bucket.blob ('my -t est -f il e .txt ')

blob. upload_from_string(' this is test content !')

blob. make_public ()

The blob is now created and can be accessed at the following address.

https://storage.googleapis.com/afirstbucket/my-test-file.txt

Google Cloud has several NoSQL table storage services. We illustrate the use

of two here: Bigtable and Datastore.

Chapter 3. Using Cloud Storage Services

3.4.1 Google Bigtable

Bigtable is the progenitor of Apache HBase, the NoSQL store built on the Hadoop

Distributed File System (HDFS). Bigtable and HBase are designed fo r large data

collections that must b e accessed quickly by major data analytics jobs. Provisioning

a Bigtabl e instance requires provisioning a cluster of servers . This task is most

easily performed from the console. Figure 3.5 illustrates the creation of an instance

called cloud-book-instance, which we provision on a cluster of three nodes.

Figure 3.5: Using the Google Cloud console to create a Bigtable instance.

The following code illustrates how we can build a Bigtable instance and then

create ﬁrst a table and then a column family and a row in that table. Column

families are the groups of columns tha t organize the contents of a row; each row

has a single unique key.

3.4. Using Google Cloud Storage Services

from gcloud import bigtable

clientbt = bigtable. Client(admin=True)

clientbt.start()

instance = clientbt. instance('cloud -book- instance')

table = instance .table('book - table ')

table. create ()

# Table has been created

column_family = table.column_family('cf ')

column_family.create()

# now insert a row with key ' key1 ' and columns ' experiment ', 'date ',

# ' link '

row = table .row (' key1 ')

row. set_cell ('cf ' , ' experiment ', ' exp1 ')

row. set_cell ('cf ' , ' date ', ' 6/6/16 ')

row. set_cell ('cf ' , ' link ', ' http :// some_location ')

row. commit()

Bigtable is important because it i s used with other Google Cloud services

for data analysis and is an ideal solution for trul y massive data collections. The

Python API for B igtabl e is a bit awkward to use because the operations are

all asynchronous remote procedure calls (RPCs): when a Python

create()

call

returns, the object created may not yet be available. For those who want to

experiment, we provide in notebook 4 code illustrating the use of Bigtable.

3.4.2 Google Cloud Datastore

Notebook 5 shows how to i mp lement equivalent functionality by using Datastore.

This service i s repli cated, supports ACID transactions and SQL-like queries, and

is easier to use from the Python SDK than Bigtable. We create a table as follows.

from gcloud import datastore

clientds = datastore.Client ()

key = clientds .key (' blobtable ')

To add an entity to the table, we write the following.

entity = datastore.Entity (key=key)

entity[' experiment - name ']=' experiment name'

entity[' date ']=' the date '

entity[' description']='the text describing the experiment '

entity[' url ']=' the url '

clientds.put(entity)

Chapter 3. Using Cloud Storage Services

You can now implement the use case with ease. You ﬁrst create a bucket

book-datacont

on the Google Cloud portal. Again, it is easiest to perform this

operation on the portal because the name is unique across Google Cloud. (If you

include a ﬁxed name in your code, then the create call will fail when you rerun

your program.) Furthermore, the bucket n ame

book-datacont

is now taken, so

when you try this you need to choose your own unique name. You can then create

adatastoretablebook-table as follows.

from gcloud import storage

from gcloud import datastore

import csv

client = storage.Client()

clientds = datastore.Client ()

bucket = client. bucket('book -datacont ')

key = clientds .key ('book - table')

Your blob uploader and table builder can now be written as fo llows. The code

is the same as that provided for the other systems, wi th the excep tion that we d o

not use a partition or row key.

with open( '\path-to-your-data\experiments.csv' , ' rb ')ascsvfile:

csvf = csv.reader (csvfile , delimiter =' ,',quotechar='| ')

for item in csvf :

print(item)

blob = bucket.blob (item [3])

data = open("\path-to-your-data\datafiles\\"+item[3], 'rb ')

blob. upload_from_file (data )

blob. make_public ()

url = " https :// storage .googleapis .com/ book - datacont /"+ item [3]

entity = datastore.Entity (key=key)

entity[' experiment - name ']=item[0]

entity[' experiment -id']=item[1]

entity[' date ']=item[2]

entity[' description']=item[4]

entity[' url ']=url

clientds.put(entity)

Datastore has an extensive query interface that can be used from the portal;

some, but not all, features are also available from the Python API. Fo r example,

we can write the following to ﬁnd the URLs for experiment1.

query = clientds .query(kind=u' book - table')

query. add_filter(u' experiment -name', '= ', ' experiment1')

results = list(query.fetch())

urls = [ result ['url '] for result in results]

3.5. Using OpenStack Cloud Storage Services

3.5 Using OpenStack Cloud Storage Services

OpenStack is used by IBM and Rackspace for their publ ic cloud oﬀering and also

by many private clouds. We focus our OpenStack examples in this book on the

NSF Jetstream cloud. As we discussed in chapter 2, Jetstream is not intended to

duplicate the current public clouds, but rather to oﬀer services that are tuned to

the speciﬁc needs of the science community. One missing component is a standard

NoSQL database service, so we cannot implement the data catalog example that

we presented for the other clouds.

A Python SDK called CloudBridge works with Jetstream and other OpenStack-

based clouds. (CloudBridge also works with Amazon but it is less comprehensive

than Boto3.) To use CloudBridge, you ﬁrst create a provider object, identifying

the cloud with which you want to work and supplying your credentials, for exam pl e

as follows.

from cloudbridge.cloud.factory import CloudProviderFactory , \

ProviderList

js_config =

{"os_username": "your user name",

"os_password": "your password",

"os_auth_url": "https://jblb.jetstream-cloud.org:35357/v3",

"os_user_domain_name": "tacc",

"os_tenant_name": "tenant name",

"os_project_domain_name": "tacc",

"os_project_name": "tenant name"

}

js = Clo udProvid erFactor y ()\

.create_provider(ProviderList.OPENSTACK, js_config)

You may now use the provider object reference to ﬁrst create a

bucket

—also

called a container—and then upload a b inary

object

to that new bucket, as follows.

# Create new bucket

bucket = js. object_store. create(' my_bucket_name ')

# Create new object within bucket

buckobj = bucket[0]. create_object(' stuff ')

fo = open ('\path to your data\stuff.txt', 'rb ')

# Upload file contents to new object

buckobj.upload(fo)

To verify that these actions worked, you can log into the OpenStack portal

and check the current container state, as shown in ﬁgure 3.6 on the following page.

The complete code is in notebook 6.

Chapter 3. Using Cloud Storage Services

Figure 3.6: View of containers in the OpenStack object store.

3.6 Transferring and Sharing Data with G l obus

When using cloud resources in science or engineering, we often need to copy data

between cloud and non-cloud systems. For example, we may n eed to move genome

sequence ﬁles from a sequencing center to the cloud for analysis, and analysis

results to our laboratory. Figure 3.7 shows how Globus services can be used

for such purposes. We see in this ﬁgure three diﬀerent storage systems: one

associated with a sequencing center, a cloud storage service, and one located on a

personal comp uter. Each runs a lightweight Globus Connect agent that allows it

to participate in Globus transfers.

The Globus Connect agent

enables a computer system to interact with the

Globus ﬁle transfer and sharing service. With it, the user can easily create a Globus

endpoint on practically any system, from a personal laptop to a cloud or national

supercomputer. Globus Connect comes in two ﬂavors:

Globus Connect Personal

for use by a single user on a personal machine, and

Globus Connect Server

for

use on multiuser computing and storage resources.

Globus Connect supports a wide variety of storage systems, including both

various POSIX-compliant storage systems (Linux, Windows, MacOS; Lustre, GPFS,

OrangeFS, etc.) and various specialized systems (HPSS, HDFS, S3, Ceph RadosGW

via the S3 API, Spectra Logic BlackPearl, and Google Drive). It also interfaces to a

variety of diﬀerent user identity and authentication mechanisms.

Note that all of the public cloud examples presented earlier in this chapter

involved client-server i nteractions: in each case, we had to run the data upload

to the cloud from the machine that mounts the storage system with the data. In

3.6. Transferring and Sharing Data with Globus

Researcher(

initiates(transfe r(

request;(o r(requested(

automatically(by(script,(

science( gateway

Researcher(

selects(files(to(

share,(selects(user(

or(group,(and(sets(

access(permissions(

Collaborator(logs(in(to(

access(shared(files;(no(local(

account(needed;(

download(via(Globus

Personal)computer

Google)Drive

cloud)storage

Globus(transfers(

files(reliably,(

securely

Transfer

Sequencing)center

Globus(controls(

access(to(shared(

files(on(existing(

storage

Figure 3.7: Globus transfer and sharing services used to exchange data among sequencing

center, remote storage system (in this case, Google Drive), and a personal computer.

contrast, Globus allows third-party transfers, meaning that you can drive from a

computer

a transfer from one endpoint

to another endpoint

. This capability

is often important when automati ng scienti ﬁc workﬂows.

The ﬁgure depicts a series of ﬁve data manipulation operations. (1) A researcher

requests, for example via the Globus web interface, that a set of ﬁles be transferred

from a sequencing center to another storage system, in this case the Google Drive

cloud storage system. (2) The transfer then proceeds without further engagement

by the requesting researcher, who can shut down her laptop, go to lunch, or do

whatever else is needed. The Globus cloud service (not shown in the ﬁgure) is

responsible for completing the transfer, retrying failed transfers if required, and

notifying the user of failure only if repeated retries are not successful. The user

requires only a web browser to access the service, can transfer data to and from

any storage system that runs the Globus Connect software, and can authenticate

using an institutional credential. Steps 3–5 are concerned with data sharing, which

we discuss in secti on 3.6. 2 on p age 54.

3.6.1 Tr ansferring Data with Globus

Figure 3.8 shows the Globus web interface being used to transfer a ﬁle. Cloud

services are being used here in two ways: Data are being transferred from cloud

storage, in this case Amazon S3; and the Globus service is cloud-hosted software

Chapter 3. Using Cloud Storage Services

as a service—running, as we d iscu ss in chapter 14, on the Amazon cloud.

Figure 3.8: Globus transfer web interface. We have selected a single ﬁle on the Globus

endpoint Globus Vault to transfer to the endpoint My laboratory workstation.

Globus a lso provides a REST API and a Python SDK for that API, allowing

you to drive transfers entirely from Python program s. We u se the code in ﬁgure 3.9

on page 55 to illustrate how the Python SDK can be u sed to perform and then

monitor a transfer. The ﬁrst lines of code, labeled (a), create a transfer client

instance. This handles connection management, security, and other administrative

issues. The code then (b ) speciﬁes the identiﬁers for the source and destination

endpoints. Each Globus endpoint and user are named by a universally unique

identiﬁer (UUID). An endpoint’s identiﬁer can be determined via the Globus web

client or programmatically, via an endpoint search API. For example, with the

Python SDK you can write:

tc = globus_sdk. TransferClient (...)

for ep in tc . endpo int_s earch ( 'String to search for' ):

print(ep[' display_name '])

3.6. Transferring and Sharing Data with Globus

In ﬁgure 3.9, we hard-code the identiﬁers of two endpoints that the Globus

team operates for use in tutorials. The code also speciﬁes (c) the source and

destination paths for the transfer.

Next, the code (d) ensures that the endpoi nts are activated. In order for the

transfer service to perform operations on endpoint ﬁle systems, it must have a

credential to authenticate to the endpoint as a speciﬁc local user. The process

of providing such a credential to the service is called

endpoint activation

[

The

endpoint_autoactivate

function checks if the endpoint is activated for the

calling user, or can be automatically activated using a cached credential that will

not expire for at least a speciﬁed period of time. Otherwise it returns a failure

condition, in which case the user can use the Globus web interface to authenticate

and thus provide a credential that Globus can use for some time period. We show

code that handles that case for the destination endpoint.

In (e)–(g), we assemble and submit the transfer request, providing the endpoint

identiﬁers, the source and destination paths, and (since we want to transfer a

directory) the recursive ﬂag. In (h), we check for task status. This blocking call

returns after a speciﬁed timeout or when the task terminates, whichever is sooner.

In this case, we choose to (i) terminate the task if it has not completed by the

timeout; we could instead repeat the wait.

More examples of the Globus Python SDK are in notebook 8. Globus also

provides a command line interface (CLI), im plem ented via the Python SDK, that

can be used to perform the operations just described. For example, the following

command transfers a directory from one endpoint to another. More details on how

to use the CLI are available online docs.globus.org .

globus transfer --recursive \

"ddb59aef-6d04-11e5-ba46-22000b92c6ec":shared_dir \

"ddb59af0-6d04-11e5-ba46-22000b92c6ec":destination_directory

3.6.2 Sharing Data with Globus

Globus also makes it easy to share data with colleagues, as shown in ﬁgure 3.7

on page 52, steps 3–5. A

shared endpoint

is a dynamically created construct

that enables a folder on an existing endpoint to be shared with others. To use this

feature, you ﬁrst create a shared endpoint, designating an existing endpoint and

folder, and then grant read and/or write permissions on that shared endpoint to

the Globus user(s) and/or group(s) that you want to be able to access it. Shared

endpoints can be created and managed both via the Globus web interface (see

ﬁgure 3.10 on page 56) and programmatically, as we show in chapter 11.

Chapter 3. Using Cloud Storage Services

# ( a) Prepare transfer client

import globus_sdk

tc = globus_sdk. TransferClient()

# ( b) Define the source and destination endpoints for the transfer

source_endpoint_id = 'ddb59aef -6d04 -11e5-ba46 -22000 b92c6ec '

dest_endpoint_id = ' ddb59af0 -6d04 -11e5 -ba46 -22000 b92c6ec '

# ( c) Define the source and destination paths for the transfer

source_path = '/share/godata/'

dest_path = ' /~/ '

# ( d) Ensure endpoints are activated

tc . end point_au toactiva te( source_endpoint_id , if_expires _in =3600)

r=tc.endpoint_autoactivate(dest_endpoint_id,if_expires_in=3600)

while (r[' code']==' AutoActivationFailed'):

print( 'To activate endpoint , open URL in browser : ')

print( ' https ://www. globus.org/app/endpoints /%s/activate '

%dest_endpoint_id)

# For python 2.X , use raw_input () instead

input( ' Press ENTER after activating the endpoint :')

r=tc.endpoint_autoactivate(ep_id,if_expires_in=3600)

# ( e) Start transfer set up

tdata = globus_sdk .TransferData(tc , source_endpoint_id ,

dest_endpoint_id ,

label=' My test transfer ')

# ( f) Specify a recursive transfer of directory contents

tdata. add_item(source_path , dest_path , recursive=True)

# ( g) Submit transfer request

r=tc.submit_transfer(tdata)

print( 'Task ID:' ,submit_result[' task_id'])

# ( h) Wait for transfer to complete , with timeout

done = tc. task_wait(r[' task_id '], timeout=1000)

# ( i) Check for success; cancel if not completed by timeout

if done:

print( 'Task completed ')

else:

cancel_task(r[' task_id '])

print( 'Task did not complete in time ')

Figure 3.9: Using the Globus Python SDK to initiate and monitor a transfer request.

3.7. Summary

Figure 3.10: Globus web interface for creating a shared endpoint.

3.7 Summary

We have introduced in this chapter fundamental methods for interacting with cloud

storage services. We focused, in particular, on blob services and NoSQL table

services. As we noted in chapter 2, these are far from being the only cloud storage

oﬀerings. Indeed, as we show in the next chapter, POSIX ﬁle storage systems

attached to virtual machines are particularly important for high-performance

applications. So too are data analytic warehouses, which we discuss in pa rt III.

We also delved more deeply in this chapter into cloud access methods ﬁrst

introduced in section 1.4 on page 8. We showed how cloud provider portals can be

used for interactive access to resources and services, and also how Python SDKs

can be used to script data management and analysis workﬂows. We used them,

in p articul ar, to script the task of uploading a set of data objects and building a

searchable NoSQL table of metadata where each row of the table corresponds to

one of the uploaded blobs of data.

As you studied the Python code that we provided fo r Amazon, Azu re, and

Google, you may have been frustrated that the programs for the diﬀerent versions

are almost but not quite the same. Why not have a single API and corresponding

SDK for all clouds? As we h ave said, various attempts to create such a uniform

Chapter 3. Using Cloud Storage Services

SDK are under way, but these attempts can cover only the common intersection

of each cloud’s capabilities. Each cloud service grew out of a diﬀerent culture in

a di ﬀerent company, and so are not identical. The resulti ng creative ferment has

contributed to the explosion of tools and concepts that we describe in this book.

3.8 Resources

We provide al l examples in this chapter as Jupyter notebooks, as described in

chapter 17. You ﬁrst need to install the SDKs for each cloud. The SDKs and links

to documents are here:

• Amazon Boto3 aws.amazon.com/sdk-for-python/

• Azure azure.microsoft.com/en-us/develop/python/

• Google’s Cloud cloud.google.com/sdk/

• Openstack CloudBridge cloudbridge.readthedocs.io/en/latest/

•

The Globus Python SDK

github.com/globus/globus-sdk-python

and the

Globus CLI github.com/globus/globus-cli

You also need an account on each cloud. Chapter 1 provides l inks to the portals

where you can obtain a trial accou nt.