Chapter 11

The Globus Research Data

Management Platform

“Give me where to stand, and I will move the earth.”

—Archimedes

We have seen how powerful cloud-based data storage and analysis services can

simplify working with large data. But not all science and engineering data live in

the cloud. Research is highly collaborative and distributed, and frequently requires

specialized resources: data stores, supercomputers, instruments. Thus data are

created, consumed, and stored in a variety of locations, including specialized

scientiﬁc laboratories , national facilities , and in stitutional computer centers.

Data

movement and sharing

and

authentication and authorization

are perennial

challenges that can impose considerable friction on research and collaboration.

We describe in this chapter a set of p latform services that address these

challenges. The Globus cloud service provides data movement, data sharing, and

credential and identity management capabilities. We described brieﬂy in section 3.6

on page 51 how these services can be accessed as software as a service, via web

interfaces. Here, we introduce more details on these services and describe the

Python SDKs that permit their use from within applications. We focus in particular

on how the Globus Auth service makes it straightforward to build science services

that can accept identities from diﬀerent identity providers, use standard protocols

for authentication and authorization, and thus integrate naturally into a global

ecosystem of service providers and consumers. As a use case, we show how these

capabilities can be used to build research data portals.

11.1. Challenges and Opp ortunities of Distributed Data

11.1 Challenges and Opportunities of Distributed Data

Data movement is central to many research activities, including analysis, collab-

oration, publication, an d data preservation. However, given its importance and

ubiquity, this task remains surprisingly challenging in practice: storage system s

have diﬀerent security conﬁgurations, achieving good transfer performance is non-

trivial, and as data sizes increase the likelihood of errors increases. Scientists and

engineers frequently struggle with such mundane tasks as authenticating and au-

thorizing user access to storage systems, establishing high-speed data connections,

and recovering from faults whil e a transfer proceeds.

Authentication and authorization are simila rly central to science and engineer-

ing, and for related reasons. Researchers often ﬁnd themselves needing to navigate

a complex world of diﬀerent identities, authentication methods, and credentials

as they access resources in diﬀerent locations. For example, say tha t you need to

transfer data repeatedly from two sites

and

to a storage system at your home

institution

. You have accounts at

and

, with identities

and

;site

will accept your home institution id entity, thanks to the InCommon identity

management federation [

]. You will commonly need to authenticate once for

each transfer: a painful process and one that prevents scripting. You would prefer

to instead authenticate as

and

just once, and then perform subsequent

transfers from A and B to H without further authentications.

The data sharing problem sits at the intersection of these two challenges. Say

you want to grant a collaborator access to data at your home institution. Setting

up an account just for that purpose is typically a time consuming process, if it is

possible at all. And it forces your collaborator to deal with yet another username

and password. You need to be able to enable access without a local account.

Globus services address these and other related challenges that arise when our

work requires the integration of resources across diﬀerent loca tions . As well as an

easy-to-use, web-browser based i nterface, Globus provides REST APIs and Python

SDKs to enable the integration of Globus solutions into applications in ways that

reduce development costs and i ncrease security, performance, and reliability.

11.2 The Globus Platform

Globus was ﬁrst introduced in 2010 as a software-as-a-service solution to the

problem of moving data between pairs of storage systems or

endpoints

[

123

(An endpoint is a storage system that has been connected to the Globus cloud

services by using software called

Globus Connect

.) The Amazon-hosted Globus

226

Chapter 11. The Globus Research Data Management Platform

software handles the complexity involved in transfers, such as authenticating and

authorizing user access to endpoints, creating a high-speed data connection between

endpoints, and recovering from faults while a transfer proceeds. Importantly, it

implements a thi rd-party transfer model in which no data are transferred via

the Globus service: instead, data are transferred directly between endpoint pairs

by using a protocol called GridFTP that provides specialized support for high

performance and reliability [

]. Globus can also perform rsync-like updates when

doing repeated transfers, allowing transfer of only new or modiﬁed ﬁles from the

source to the destination. Direct HTTPS transfers to and from endpoints are also

supported, allowing web browser access to data stored on Globus endpoints.

The Glob us team has subsequently built on this initial

Globus Transfer

service by adding

Auth

for identity and credential management,

Groups

for group

management,

Sharing

for data sharing, and

Publication

and

Data Search

for

data management. Importantly, the Globus team also created REST APIs and

Python SDKs to allow these capabilities to be used programmatically, from within

applications. It is these platform capabilities that we describe in this chapter,

building on the introductory material in section 3.6.1 on page 52, where we showed

how to use the Globus Python SDK to initiate, monitor, and co ntrol data transfers.

We ﬁrst provide additional details on the programmatic use of Glob us Sharing

capabilities, then i ntrodu ce the use of Gl obu s Auth, and ﬁnally present illustrative

examples of the use of these capabilities.

11.2.1 Globus Transfer and Sharing

We introduced Globus Sharing capabilities in section 3.6.2 on page 54. Here we

show how to use the Python SDK to manage sharing programmatically. Recall

that Globus Sharing allows a user to make a speciﬁed folder on a Globus endpoint

accessible to other Globus users. Figure 11.1 shows the idea. Bob has enabled

sharing of folder ~/shared_dir on Regular endpoint by creating Shared endpoint,

and then granting Jane access to that shared endpoint. Jane can then use Globus

Transfer to read and/or write ﬁles in the shared fol der, depending on what rights

she has been granted.

As is the case with the Globus Transfer service presented in chapter 3, all data

sharing capabilities oﬀered by the Globus web interface are also accessible via

the Python SDK. The code in ﬁgure 11.2 on page 229 illustrates their use. We

explain each of the two functions in the ﬁgure in turn. We use both functions in

section 11.5.3 on page 247 as part of a research data portal implementation.

227

11.2. The Globus Platform

/~/

…

shared_dir

Files)for)user

Regular)

endpoint

Shared)

endpoint

Bob

Jane

Figure 11.1: The Globus shared endpoint construct allows an authorized administrator of

aGlobusendpoint(sayBob)tocreatea

shared endpoint

granting access to a folder

within that endp oint, to which they can then authorize access by others (say Jane).

First, the

create_share

function: We assume that we have previously initiated

a transfer object,

, in the manner illustrated in the ﬁrst lines of ﬁgure 3.9

on page 55, and that this object is passed to the function, along with the end-

point identiﬁer and path for the folder that is to be shared (“Regular endpoint”

and

~/shared_dir

, respectively, in ﬁgure 11.1). The function uses the Globus

SDK function

operation_mkdir

to request creation of the speciﬁed

folder

the speciﬁed

endpoint

. It then creates a parameter structure, calls the Globus

SDK function

create_shared_endpoint

to create a shared endpoint for the new

directory, and ﬁnally returns the identiﬁer for the new endpoi nt.

Second, the

grant_access

function: This function requires both

and a

Globus Auth client reference,

(we introduce Auth in the next section); identiﬁers

for the shared endpoint (

share_id

) for which sharing is to be enabled and the user

(

user_id

: a UUID, as with endpoint identiﬁers) to whom a ccess is to be granted;

the type of access to be granted (

atype

: can be

'r'

'w'

,or

'rw'

); and a message

to be emailed to the user upon completion. The function uses the Globus Auth

SDK function

get_identities

to determine the identities that are associated with

the user for whom sharing is to be enabled, and extracts from this list an email

address. It then uses the Globus Transfer SDK function

add_endpoint_acl_rule

to add an access control rule to the shared endpoint, granting the speciﬁed access

type to the speciﬁed user.

11.2.2 The rule_data St ructure

Our example program passes a

rule_data

structure to the

add_endpoint_acl_rule

function. The various elements specify, among other things:

• 'principal_type': the type of principal to which the rule applies;

228

Chapter 11. The Globus Research Data Management Platform

# Create a shared endpoint on specified ' endpoint ' and ' folder ';

# Return the endpoint id for new endpoint .

# Supplied ' tc ' is Globus transfer client reference.

def create_share(tc, endpoint , folder):

# Create directory to be shared

tc . op eration _mkdir( endpoint , path = folder )

# Create the shared endpoint on specified folder

shared_ep_data = {

' DATA_TYPE' : ' shared_endpoint' ,

' host_endpoint':endpoint,

' host_path' :folder,

' display_name' : ' Share ' +folder,

' description' : 'New shared endpoint '

}

r=tc.create_shared_endpoint(shared_ep_data)

# Return identifier of the newly created shared endpoint

return(r[' id '])

# Grant ' user_id' access ' atype ' on ' share_id '; email ' message'

# Supplied ' tc ' and 'ac ' are Globus Transfer and Auth client refs.

def grant_access(tc, ac, share_id , user_id , atype , message):

# (1)

r=ac.get_identities(ids=user_id)

email = r[' identities '][0][ 'email ']

rule_data = {

' DATA_TYPE' : 'access ',

' principal_type': ' identity', # To whom is access granted?

' principal' :user_id, # To an individual user

' path ' : '/ ', # Grant access to this path

' permissions' :atype, # Grant specified access

' notify_email' :email, # Email invite to this address

' notify_message':message # Include this message in email

}

r=tc.add_endpoint_acl_rule(share_id,rule_data)

return(r)

Figure 11.2: A function that uses the Globus Python SDK to create a shared endpoint.

229

11.3. Identity and Credential Management

• 'principal'

:asthe

'principal_type'

'identity'

, thi s is the user id

with whom sharing is to be enabled;

• 'permissions'

: the type of access being granted: in this case read-only

('r'), but could a lso be read and write ('rw');

• 'notify_email'

: an email address to which an invitation to access the

shared endpoint should be sent; and

• 'notify_message': a message to include in the invitation email.

The

'principal_type'

element can al so take the value

'group'

, i n which case

the

'principal'

element must be a group id. Alternatively, it can take the values

'all_authenticated_users'

'anonymous'

, in which cases the

'principal'

element must be an empty string.

11.3 Identity and Credential Management

We noted above the challenges that users face when authenticating to diﬀerent

sites and services in the course of their work. Similarly, service developers need

mechanisms for establishing the identity of a requesting user and for determin ing

what that user is authorized to do. Figure 11.3 illustrates some of the concepts

and issues involved. An end user wants to run an application that makes requests

to remote services on her behalf. Those remote services may themselves want to

make further calls to other

dependent services

. For consistency with commonly

used terminology, we refer to the us er as the

resource owner

, the application as

the client, and each remote and dependent service as a resource server.

Two interrelated problems frequently arise i n such contexts. The ﬁrst concerns

the use of

alternative identity providers

. A resource server frequently wants

to establish the identity of the user (i.e., resource owner) who issued an incoming

request, often to determine whether to grant access and sometimes simply to log who

is using their service. In the past, developers of resource servers often implemented

their own username-password authentication systems, but such approaches are

inconvenient and insecure. Instead, we wa nt to allow a resource server to accept

credentials from other identity providers: for example, that associated with a user’s

home institution. Furthermore, diﬀerent resource servers may require diﬀerent

credentials. For examp le, to transfer a ﬁle from the University of Chicago to

Lawrence Berkeley National Laboratory, I must authenticate with both my Chicago

and my Berkeley identities to establish my credential s to access ﬁle systems at

Chicago and Berkeley, respectively.

230

Chapter 11. The Globus Research Data Management Platform

Figure 11.3: A schematic of the entities and interactions that engage in distributed

resource accesses, using the terminology of OAuth2.

The second problem concerns

(restricted) delegation

. A resource server

may need to perform actions on behalf of a requesting user. For example, it may

need to transfer ﬁles or perform computations. It may then need credentials that

allow it to establish its authority to perform such actions. (This requirement

is especially important if the resource server needs to operate in an unattended

manner, for example so that it can continue ﬁle tran sfers or computations while

the user eats lunch.) However, users may not want to grant unlimited rights to a

remote service to perform actions on their behalf, due to the potential for harm

if a credential is compromised. Thus, the ability to restrict the rights that are

delegated is important. For example, you might be ok with a service reading, but

not writing, ﬁles on a certain server. And you certainly do not want a compromised

service to be able to use oth er services that you have not authorized.

As we describe in the following, the cloud-hosted Globus Auth service addresses

these and other related concerns.

11.3.1 Globus Auth Is an Authorization Service

Globus Auth leverages two widely used web standards, the OAuth 2.0 Authoriza-

tion Framework (OAuth 2) [

149

] and OpenID Connect Core 1. 0 (OIDC) [

230

], to

implement solutions to these problems. OAuth2 is a widely used proto col that

applications can use to provide client applications with

secure delegated access

the delegation that we spoke about above. It works over HTTP and uses

access

tokens

to authorize servers, applications, and other entities. OIDC is a simple

identity layer on top of the OAuth protocol.

The cloud-hosted Globus Auth service is what OAuth2 calls an

authorization

server

. As su ch, it can issue access tokens to a

client

after s uccess full y authen-

ticating the

resource owner

and obtaining authorization from that resource

owner for the client to access resources provided by a

resource server

. (This

231

11.3. Identity and Credential Management

Figure 11.4: A Globus Auth cons ent request, in this case for the Globus web application.

authorization process typically involves a request for consent, such as those shown

in ﬁgure 11.4.) The resource owner in this scen ario is typically an end user, who

authenticates to a Globus Auth-managed Globus acco unt us in g an identity issued

by one of an extensible set of (federated) id entity providers supported by Globus

Auth. A resource owner could also be a robot, agent, or service acting on its own

behalf, rather than on behalf of a user; the client may be either an application

(e.g., web, mobile, desktop, command line) or another service acting as a client, as

we explain in subsequent discussion.

Having obtained an a ccess token, the client can then present that token as part

of a request to the resource server for which the token applies, to demonstrate that

it is authorized to make the request. The token is included in the request via the

HTTPS Authorization header.

Access tokens are thus the key to OAuth2 and Globus Auth. An access token

represents an authorization issued by a resource owner to a client, authorizing

the client to request access to a speciﬁed resource s erver on the resource owner’s

behalf. As we describe later, the resource server can then ask the Globus Auth

authorization service for details on what rights have been granted: a process that

is referred to as “introsp ection .” For example, if the resource owner in ﬁgure 11.3

wants to allow a client (e.g., a web portal) to access a remote service but only for

232

Chapter 11. The Globus Research Data Management Platform

purposes of reading during the next hour, introspection of the associated token can

reveal those restrictions. Globus Auth thus addresses the problems of (restricted)

delegation. It also supports the linking of multiple identities, as we discuss below,

to address the problem o f alternative identity providers.

A resource server receiving a token from a client can thus determine that the

resource owner has authorized it to perform certain actions on the resource owner’s

behalf. What if the resource server then wants to reach out to other resource servers,

for example to Globus Transfer to request a data transfer? A problem arises: the

resource server has a token that authorizes it to perform actions itself, but it has

no token that it can present to the Globus Transfer service to demonstrate that

the resource owner (the end user in our example) has authorized transfers.

This is where

dependent services

come i n. When a resource server

registered with Globus Auth, it can specify services that it needs to access to

perform its functions: its dependent services, say

and

. A request from

Globus Auth for authorization then causes Globus Auth to request consent from

the user not only for

but also for the dependent services

and

.Wesawan

example of this scenario in ﬁgure 11.4: the Globus web application has registered

Globus Transfer and Globus Groups as dependent services, and thus you see the

user being asked to consent to those uses. Once consent has been granted, the

resource server can request additional dependent access tokens, as required, that

it can then includ e in requests to other services that it makes on the authorizing

resource own er’s behalf.

OAuth2 and Globus Auth incorporate various complexities and subtleties,

but the basic steps are simple. A user accesses an application; Globus Auth

authenticates and requests consents from the end user; Globus Auth provides

access tokens to the application; the application uses access tokens to access other

services; a service receiving an access token can validate it and use it to request

dependent access tokens to access other services. Importantly, di ﬀerent actors can

play diﬀerent roles at diﬀerent times: your web browser can be a client to a web

service, that itself can act as a client to other services, and so on.

11.3.2 A Typical Globus Auth Workﬂow

We use ﬁgure 11.5 on page 235 to illustrate how Globus Auth works. Th e ﬁgure looks

complicated, but please bear with us: the u nderl ying concepts are straightforward.

We describe each of the 12 steps shown in the ﬁgure in turn.

The end user accesses the application to make a request to a remote service.

233

11.3. Identity and Credential Management

The application might be a Web client or, alternatively, an application

running on the user’s desktop or some other computer.

The application contacts Globus Auth to request authorization for the use

of a set of

scopes

. A scope represents a set of capabilities provided by a

resource server for which an access token is to be granted. In this case, the

application requests two scopes: one for access to login information and one

for HTTPS/REST API access.

Globus Auth arranges for authentication of the user, using an identity provider

that is mutually acceptable to the user and the application. Because the user

only authenticates with the authorization server, the user’s credentials are

never shared with the client or with Globus Auth.

4. Globus Auth returns an authorization code to the user.

The user requests access tokens from Glob us Auth, passing the previously

acquired authorization code to establish their right to obtain these tokens.

Access tokens are returned, one per requested scope. The issuance of multiple

tokens enhances security by limiting the impact of a compromise.

The cli ent can now use the access token in an HTTPS/REST request to a

resource server, by setting an HTTPS

Authorization: Bearer

header with

the appropriate token. (For concreteness, the remote service is here shown

as Globus Transfer, but it could be anything.)

Using a recent OAuth2 extension [

226

], the resource server can contact

Globus Auth to “introspect” the token and thus obtain answers to questions

such as “is the token valid?,” “which resource owner is it for?,” “what client is

making the request?,” and “which scope is it for?”

Globus Auth responds to the introspection request. The resource server can

use the provided information to make an authorization decision as to how it

responds to the client request.

10.

The resource server can also use its access token to request dependent access

tokens for any dependent services. For example, Globus Transfer can retrieve

an access token for the Globus Groups resource server, so that it can check

if the requesting user is a member of a particular group before taking some

action like allowing access to a shared endpoint.

11. Globus Auth returns requested dependent tokens.

234

Chapter 11. The Globus Research Data Management Platform

Figure 11.5: Entities and interactions involved in Globus Auth-mediated distributed

resource requests. Details are provided in the text.

12.

The resource server uses a newly issued dependent a ccess token in an HTTP-

S/REST request to th e second resource server.

There are other OAuth2 and Globu s Auth details that are not covered here:

for exampl e, refresh tokens (because an access token’s lifetime may be less than

that of an application) and the somewhat diﬀerent methods used in the case of a

long-lived application rather than a web browser. Also, an alternative protocol is

used for rich clients such as the Javas cript-based Globus Transfer client that avoids

the needs for steps 4 and 5; a variant of this ﬂow supports mobile, command line,

and desktop applications: “native apps.” But we have covered the essentials.

11.3.3 Globus Auth Identities

Globus Auth maintains information about the identities that its users may use to

authenticate. A Globus Auth identity has a unique, case-insensitive username (for

example,

user@example.org

), iss ued by an identity provider (e.g., a University,

research laboratory, or Google), for which a user or client can prove possession via

an authentication process (e.g., presenting a password to the identity provider).

Globus Auth manages the use of identities (e.g., to login to clients and services),

their properties (e.g., contact information), and relationships among identities

(e.g., allowing login to an identity by using another linked, “federated” identity).

235

11.3. Identity and Credential Management

Globus Auth neither deﬁnes its own identity usernames nor veriﬁes authen-

tication (e.g., via passwords) with identities. Rather, it acts as an intermediary

between external identity providers, on the one han d, and clients and services that

want to leverage identities issued by those providers, on the other. Globus Auth

assigns each identity that it encounters an identiﬁer: a UUID that is guaranteed to

be unique among all Globus Au th identities, and that will never be reused. This

ID is what resource servers and clients should use as the canonical identiﬁer for a

Globus Auth identity. Associated with this ID are an identity provider, a username

given to the identity by the provider, and other provider-supplied information such

as display name and contact email address.

An example Globus Auth identity

. The following is an example of the informa-

tion that m ay be associated with a Globus Auth identity:

username : ro cky@wossamotta.edu

id : de305d54-75b4-431b-adb2-eb6b9e546014

identity_provider : wossamotta.edu

display_name : Rocket J. Squirrel

email : rocky@wossamotta.edu

Globus supports more than 100 identity providers, and more are being added all

the time. Examples include the many US and international uni versities and oth er

institutions that support InCommon; various identity providers that support the

OpenID Connect protocol; Google; and the Open Researcher and Contributor ID

(ORCID). The process of integrating a new identity provider is beyond the scope

of this book, but it is a straightforward process. See the Globus documentation

for more information.

11.3.4 Globus Accounts

An identity can be used with Globu s Auth to create a

Globus account

. A Globus

account has a primary identity, but can also have any number of other identities

linked to it as well. Thus, for example, Mr. Squirrel may create a Globus account

with the identity above and then link to that account a Google identity, his ORCID,

and an identity provided by a scientiﬁc facility to which he has access.

A Glob us account is not an identity itself. It does not have its own name or

identiﬁer. Rather, a Globu s account is identiﬁed by its primary id entity. Similarly,

proﬁle information and other metadata are tied to identities, not to accounts. A

Globus account is simply a set of identities comprising the primary identity and

all identities linked to that primary identity.

236

Chapter 11. The Globus Research Data Management Platform

11.3.5 Using Globus Auth Identities

Clients and resource servers should always use the Globus Auth-provided identity

ID when referring to an identity, for example in access control lists, and when

referring to identiti es in a REST API. Clients and resource servers can use the

Globus Auth REST API to map any identity username to its (current) identity

ID, and request in formation about an identi ty ID (e.g., username, display_name,

provider, email), for example as follows:

import globus_sdk

# Obtain reference to Globus Auth client

ac = globus_sdk .AuthClient()

# Get identifies associated with username ' globus@globus. org '

id =ac.get_identities(usernames=' globus@globus.org')

# Return zero or more UUIDs

# Get identities associated with a UUID

r=ac.get_identities(ids=id )

The last command returns a JSON document containing a list of identities,

such as the following. (This example document contains just one identity.)

{ ' identities ':

[{' email ' :None,

'id ' : '46bd0f56 -e24f -11 e5 - a510 -131bef46955c' ,

' identity_provider': '7daddf46-70c5-45ee-9f0f-7244fe7c8707',

' name ' :None,

' organization' :None,

' status' : 'unused ',

' username' : ' globus@globus.org'}

]

}

11.3.6 Use of Globus Auth by Resource Servers

Having introduced various details of the Globus Auth server, Globus Auth identities,

and Globu s accounts, we can now turn to the practical ques tion of what we can

do with these mechanisms. In particular, we describe how resource servers can

use Globus Auth as an authorization server and thus both support sophisticated

OAuth2 a nd OpenID Connect functionality, and leverage other resource servers

that use Globus Auth.

Let us consider, for example, a research data service that accepts user requests

to analyze genomic sequence data. (We describe an example of such a system,

Globus Genomics, in section 14.4 on page 303.) This service is basically a data

237

11.3. Identity and Credential Management

and code repository wi th a REST API, whi ch other applications can leverage to

access this repository programmatically.

This service is a resource server in the Globus Auth context. It needs to be able

to authenticate users, validate user requests , and make requests to other services

(e.g., to cloud or institutional storage to retrieve sequence data and store results,

and to computing facilities to perform computations) on a u ser’s behalf. Globus

Auth allows us to program each of these capabilities via manipulation of identities,

access tokens , and OAuth2 protocol messages.

Assume that some client to this service has al ready followed steps 1–7 in

ﬁgure 11.5 on page 235 and thus possesses the necessary access tokens. (The “client”

may be a web client to the data server, or so me other web, mobile, desktop, or

command line application.) Interactions may then proceed as follows.

The client makes an HTTPS request to the resource server (the research

data service proper: in the following we refer to it as the “data service”) with

Authorization: Bearer

header containing an access token. (Step 8 in

ﬁgure 11.5.)

The data service calls the function

oauth2_token_introspect

provided by

the Globus Auth SDK, authorized by the data service’s client identiﬁer

and client secret (see below), to validate the request access token, and

obtain additional information related to that token (scopes, eﬀective identity,

identities set, etc.). If the token is not valid, or is not intended for use with

this resource server, Globus Auth returns an error.

The data service veriﬁes that the request from its client conforms to the

scopes associated with the request access token.

The data service veriﬁes the identity of the resource owner (typically an end

user) on whose behalf the client is acting . Th e data service may use this

identity as its local account identiﬁer for this user.

The data service uses the set of identities associated with the account referred

to by the request access token to determine what the request is al lowed to

do. For example, if the request is to access a resource that is sh ared with

particular identities, the data service should compare all o f the account’s

identities (primary and linked identity ids) with the resource access control

permissions to determine if the request should be granted .

The data service may need to act as a client to other (dependent) resource

servers, as discussed above. In that case, the data service uses the Globus SDK

238

Chapter 11. The Globus Research Data Management Platform

oauth2_get_dependent_tokens

function to get dependent access tokens for

use with downstream resource servers, based on the request access token that

it received from the client.

The data service uses a dependent access token to make a request to a

dependent resource server.

8. The data service responds to its client with an appropriate response.

A note regarding the client identiﬁer and client secret mentioned in Step 2:

Each client and resource server must register with Globus Auth and obtain a

client id

and

client secret

, which they can subsequently use with Globus Auth

to prove who it is in the various OAuth2 messages: fo r example, when swapping an

authorization token for a n access token, calling token introspect, calling dependent

token grant, and using a refresh token to obtain a new access token.

11.3.7 Other Globus Capabilities

Globus also supports a growing set of other capabilities beyond those described

here. For example, tabl e 11.1 lists additional functions supported by the Globus

Transfer Python SDK.

Table 11.1: Some of the close to 50 functions supp orted by the Globus Transfer Python

SDK. (Others mostly implement endpoint administration functions.)

Type Function Description

Endpoint

information

endpoint_search Search on name, keywords, etc.

get_endpoint Get endpoint information

my_shared_endpoint_list Get endpoints that I manage

File system

operations

operation_mkdir Create a folder on endpoint

operation_ls List contents of endpoint

operation_rename Rename folder or directory

Task

management

submit_transfer Submit a transfer request

submit_delete Submit a delete request

cancel_task Cancel submitted request

task_wait Wait for task to complete

Task

information

task_list Get information about tasks

get_task Get information about a task

task_event_list Get event info for a tas k

task_successful_transfers Get successful transfers for task

task_pause_info Get info on why task paused

239

11.4. Building a Remotely Accessible Service

Other Globus services provide other capab il ities. Globus Publ icatio n, for

example, provi des user-conﬁgurable, cloud-hosted data publication pipelines that

can be used to automate the workﬂows used to make data accessible to others,

workﬂows that will typically include steps such as providing and collecting metadata,

moving data to long-term storage, assigning persistent identiﬁers (e.g., a Digital

Object Identiﬁer or DOI [

218

]), and verifying data correctness [

]. Globus Data

Search can be used to search for data on endpoints to which a user has access. See

the Globus documentation docs.globus.org for information on these services.

Data delivery at the Advanced Photon Source

. The

Advanced Photon

Source

(APS) at Argonne National Laboratory is typical of many experimental

facilities worldwide in that it serves large numbers (thousands) of researchers every

year, most of whom visit ju st for a few days to collect data and then return to their

home institution. In the past, data produced during an experiment was invariably

carried back on physical media. However, as data sizes have grown and experiments

have become more collaborative, that approach has become less eﬀective. Data

transfer via network is preferred; the challenge is to integrate data transfer into the

experimental workﬂow of the facility in a way that is fully automated, secure, reliable,

and scalable to thou sands of users and datasets.

Francesco De Carlo uses Globus APIs to do just that at the APS. His

DMagic

system

[

107

] implements a variant of the program in ﬁ gure 11.9 that integrates with

APS administrative and facility systems to deliver data to experimental users. When

an experiment is approved at the APS, a set of associated researchers are registered

in the APS administrative datab as e as approved participants. DMagic leverages this

information as follows. Before the experiment begins, it creates a shared endpoint on

alargestoragesystemmaintainedbyArgonne’scomputingfacility. DMagicthen

retrieves from the APS scheduling system the list of approved users for the experiment,

and adds permissions for those users to the shared endpoint. It then monitors the

experiment data directory at the APS facility and copies new ﬁles automatically to

that shared endp oint, from which it can be retrieved by any approved user.

11.4 Building a Remotely Accessible Service

Say you want to build a service that can be invoked remotely via a REST API

call. Building and invoking a service in this way is straightforward in principle:

many libraries exist for deﬁning, implementing, and usin g REST APIs. Security

is perhaps the on e major source of complexity, and here Globus Auth can help.

The basic issue is that when a remote user makes a request to the service, the

service author needs to be able to determine who is making the request and what

rights the requestor is passing with the request. For example, the service may

240

Chapter 11. The Globus Research Data Management Platform

want to know if it is permitted to make Globus transfer requests on behalf of the

requesting user. It may also want to know the identity of the requestor, so that

the requestor can be given access to a shared en dpoint created by the service.

To illustrate how Globus Auth can be used to address these concerns, we present

asimple

Graph service

that accepts requests to generate graphs of temperature

data. In response to a request, it retrieves data from a web server, generates graphs,

and uses Globus Transfer to tran sfer the graphs to the requestor. It thus need s

to authenticate and authorize the requestor and obtain dependent access tokens

for a web server and Globus Transfer. A complete Python implementation of this

example service is ava ilable at

github.com/globus/globus-sample-data-portal

,in

the fold er

service

. We use extracts (some simpliﬁed) from this implementation

to illustrate how the Graph service works with Globus Auth.

The relevant authorization code is in ﬁgure 11.6 on the next page. The Graph

service receives a HTTPS request with a header containing the access token in

the form

Authorization: Bearer <request-access-token>

. It then uses the

following code to (1) retrieve the access token, (2) call out to Globus Auth to

retrieve information about the token, including its validity, client, scope, and

eﬀective identity. The Graph service can then (3–5) verify the token information

and (6) authorize the request. (In our example, every request is accepted.)

This sample code has been written so that it only (5) accepts requests from

an entity that can supply a

PORTAL_CLIENT_ID

, a service that we introduce later

in the chapter. As we show in the next pa ragraph, it then requests and obtains

dependent access tokens that allow it to transfer data on behalf of that entity.

An alternative implementation, to be preferred if we want the Graph service to

be more broadly useful, would have it look for the original resource owner’s (end

user’s) token and then perform operations on their behalf.

As the Graph service needs to act as a client to the data service on which the

datasets a re located, it next requests dependent tokens from Globus Auth. This

and subsequent code fragments in this section are from the ﬁle

service/view.py

241

11.4. Building a Remotely Accessible Service

# (1) Get the access token from the request

token = get_token (request .headers [' Authorization' ])

# (2) Introspect token to extract

client = load_auth_client ()

token_meta = client.oauth2_token_introspect(token)

# (3) Verify that the token is active

if not token_meta.get(' active '):

raise ForbiddenError()

# (4) Verify that " audience" for this token is our service

if ' Graph Service ' not in token_meta.get(' aud ',[]):

raise ForbiddenError()

# (5) Verify that identities_set in token includes portal client

if app .config [' PORTAL_CLIENT_ID']!=token_meta.get(' sub ' ):

raise ForbiddenError()

# (6) Token has passed verification: stash in request global object

g.req_token = token

Figure 11.6: Selected and somewhat simpliﬁed code from the ﬁle

service/decorators.py

in the Graph service example.

client = load_auth_client ()

dependent_tokens = client.oauth2_get_dependent_tokens(token)

Having retrieved these dependent tokens, it extracts from them the two access

tokens that all ow it to itself act as a client to the Globus Transfer service and to

an HTTPS endpoint service from which it will retrieve datasets.

transfer_token = dependent_tokens.by_resource_server[

' transfer.api.globus.org' ][ 'access_token ']

http_token = dependent_tokens.by_resource_server[

' tutorial - https -endpoint .globus .org'][ ' access_token ']

The service also extracts from the request detail s of the datasets to be graphed,

and the identity of the requesting user for use when conﬁguring the shared endpoint:

selected_ids = request.form.getlist(' datasets ')

selected_year = request.form.get(' year')

user_identity_id = request.form.get(' user_identity_id')

242

Chapter 11. The Globus Research Data Management Platform

The Graph service next fetches each

dataset

via an HTTPS request to the

data server, using code like the following. The previously obtained

http_token

provides the credentials required to authenticate to the data server.

response = requests .get(dataset ,

headers=dict (Authorization='Bearer ' +http_token))

A graph is generated for each dataset. Then, the Globus SDK functions

operation_mkdir

and

add_endpoint_acl_rule

are used , as in section 11.2.1 on

page 227, to request that Globus Transfer create a new shared endpoint accessi-

ble by the user identity that was previously extracted from the request header,

user_identity_id

. (The

transfer_token

previously obtained from Globus Auth

provides the credentials required to authenticate to Globus Transfer.) Finally, the

graph ﬁles are transferred to the newly created directory via HTTP, using the same

http_token

as previously, and the Graph server sends a response to the requester,

specifying the number and location of the graph ﬁles.

This example shows how Globus Auth allows you to outsource all identity

management and authentication functions. Identities can be provided by federated

identity providers, such as InCommon and Google. All REST AP I security

functions, including consent and token issuan ce, validation, and revocation, are

provided by Globus Auth . Your service needs only to provide service-speciﬁc

authorization, which can be performed on the basis of identity or group membership.

And because all interactions are compliant with OAuth2 and OIDC standards, any

application that speaks these protocols can use your service as they would any

other; your service can seamlessly leverage other services; and other services can

leverage your service. You can easily build a service to be made available to others

as part of the national cyberinfrastructure; equally, yo u can build a service that

dispatches requests to other elements of that cyberinfrastructure.

11.5 The Research Data Portal Design Pattern

To further illustrate the use of Globus platform services in scientiﬁc applications

and workﬂows, we describ e how they may be used to realize a design pattern

that Eli Dart calls the

research data portal

. In this design pattern, specialized

properties of modern research networks are exploited to enable high-speed, secure

delivery of data to remote users. In particular, the control logic used to manage

data access and delivery is separated from the machinery used to deliver data over

high-speed networks. In this way, order-of-magnitude performance improvements

can be achieved relative to traditional portal architectures in which control logic

243

11.5. The Research Data Portal Design Pattern

and data servers are co-located behind performance-limiting ﬁrewalls and on

low-performance web servers.

11.5.1 The Vital Role of Science DMZs and DTNs

A growing number of research universities and laboratories worldwide are connected

in a network fabric that links data stores, scientiﬁc instruments, and computational

facilities at unprecedented speeds: 10 or even 100 gigabits per second (Gb/s).

Increasingly, research networks are themselves connected to cloud providers at

comparable speeds. Thus, in principle, it should be possible to move data between

any element of science and engineering infrastructure with great rapidity.

In practice, real transfers often achieve nothing like these theoretically achievable

peak speeds. One common reason for poor performance is ﬁrewalls or other

bottlenecks in the network connection between the outside world and the device

from/to which data are to be transferred: the so-called “last mile”—or, outside the

US, the l ast kilometer. The ﬁrewalls are often there for a good reason, such as

protecting the sensitive data contained on the administrative computers that are

also connected to the global Internet. But they get in the way of high-bandwidth

science and engineering traﬃc. The other comm on reason for poor performance is

using tools not designed for performance, like secure copy (SCP).

Two concepts, the Science DMZ and Data Transfer Node, are now being widely

deployed to overcome this problem. The

Science DMZ

overcomes the challenges

associated with multipurpose enterprise network archi tectures by placing resources

that need high-performance connectivity in a special subnetwork that is close

(from a network architecture perspective) to the border router that connects the

institution to the high-speed wide area network. Traﬃc between those resources

and the outside world can then bypass internal ﬁrewalls.

Note that the point here is not to circumvent security by putting it outside

the ﬁrewall. Rather, it is about recognizing that there is certain traﬃc for which

ﬁrewalls not only slow things down, but are not needed. The Science DMZ uses

alternative network security approaches that are appropriate for such traﬃc. For

example, the DTN is not wide open: the Science DMZ router blocks most ports.

But the ports necessary for secure, high-performance data transfer are open, and

avoid the packet-inspecting ﬁrewalls.

Data Transfer Node

(DTN) is a specialized device dedicated to data

transfer functions. These devices are typically Linux servers constructed with

high quality components, conﬁgured for both high-speed wide area data transfer

and high-speed access to local storage resources. DTNs run the high-performance

244

Chapter 11. The Globus Research Data Management Platform

Globus Connect data transfer software, introduced in section 3.6 on page 51,

to connect their storage to the Globus cloud and thus to the rest of the world.

General-purpose computing and business productivity applications, such as email

clients and document edi tors, are not installed; this restriction produces more

consistent data transfer behavior and makes security pol icies easier to enforce.

The Science DMZ design pattern also includes other elements, such as inte-

grated perfSONAR monitoring devices [

244

] for performance debugging, specialized

security conﬁgurations, and variants used to integrate supercomputers and other

resources. But this brief description covers the essentials. The U.S. Department of

Energy’s Energy Sciences Network (ESnet) has produced detailed conﬁguration

and tuning guides for Science DMZs and DTNs fasterdata.es.net .

11.5.2 The Research Data Portal Application

Eli Dart coined the term research data portal to indicate a web service designed

primarily to serve data to remote users. (The variant that accepts data from

remote users, for example for analysis or publication, has similar properties.) A

research data portal must be able to authenticate and authorize remote users,

allow tho se users to browse and query a potentially large collection of data, and

return selected data (perhaps after subsetting) to remote users. In other words, a

research data portal is like a web server, except that the data that it serves may

be orders of magnitude larger than typical web pages.

Figure 11.7 shows how research data portals have often been architected in

the past. A single

data portal server

both runs portal logic and serves data

from local storage. This architecture is simple but cannot easily achieve high

performance. The problem is that the control logic, being co ncerned with sensitive

topics such as authentication and authorization, needs to sit behind the enterprise

ﬁrewall. But this arrangement means that all data served by the portal also pass

through the ﬁrewall, which typically means that they are delivered at a small

fraction of the th eoretical peak performance of the available networks.

As ﬁgure 11.8 on the next page shows, S cien ce DMZs and DTNs allows for new

architectural approa ches that combine high-speed access and secure operations.

The basic idea is to separate what we may call the portal

control channel

communications (i.e., those concerned with such tasks as user authentication and

data search) and

data channel

communications (i.e., those concerned with data

upload and download delivery). The former can be located on a modestly sized

web server computer protected by the institution’s ﬁrewall, with modest capacity

networks, while the latter can be performed via high-speed DTNs and can use

245

11.5. The Research Data Portal Design Pattern

specialized protocols such as GridFTP. The research data portal design pattern

thus deﬁnes distinct roles for the web server, which manages who is allowed to do

what, and the Science D MZ, where authorized operations are performed.

Figure 11.7: A legacy data portal, in which both control traﬃc (queries, etc.) and data

traﬃc must pass through the enterp rise ﬁrewall. Figure courtesy Eli Dart.

Figure 11.8: A modern research data portal, showing the high-speed data path through

the border router and to the DTN in green and the control path through the enterprise

ﬁrewall to the portal server in red. Multiple DTNs provide for high-speed transfer between

network and storage. Figure courtesy Eli Dart.

246

Chapter 11. The Globus Research Data Management Platform

11.5.3 Implementing the Design Pattern with Globus

We now need mechanisms to allow research co d e running on the portal server

to manage access to, and drive transfers to and from, the DTNs. This is where

Globus SDKs come in, as we discuss next. We consider a use case similar to the

NCAR Research Data Archive example that follows. A user requests data for

download; the portal makes the data available via four steps: (1) create a shared

endpoint; (2) copy the requested data to that shared endpoint; (3) set permissions

on the shared endpoint to enable access by the requesting user, and email the user

a URL that they can use to retrieve data from the shared endpoint; and ultimately

(perhaps after several days or weeks), (4) delete the new shared endpoint.

The NCAR Research Data Archive

(RDA) [

]

rda.ucar.edu

operated by

the U.S. National Center for Atmospheric Research illustrates some of the issues

that can arise when implementing a research data portal. This system contains more

than 600 data collections, ranging in size from gigabytes to tens of terabytes, and

including meteorological and oceanographic observations, ope ration al and reanalysis

model outputs, and remote sensing datasets, along with ancillary datasets, such as

topography/bathymetry, vegetation , and land use.

The RDA data portal allows users to browse and search catalogs of environmental

datasets, place datasets that they wish to download into a “shopping basket,” and then

download selected datasets to their personal computer or other location. (RDA users

are primarily res earchers at federal and academic research laboratories. In 2014 alone,

more than 11,000 people downloaded more than 1.1 petabytes.) The portal must thus

implement a range of diﬀe rent fun c tions, s ome totally domain-independent (e.g., user

identities, authentication, and data transfer) and others more domain-speciﬁc (e.g., a

catalog of environmental data collections). As we see later in the chapter, the beauty

of the Globus approach is that much of the domain-independent logic—in particu lar,

that associated with identity management, authentication, data movement, and data

sharing—can be outsourced to cloud services.

We present in ﬁgure 11.9 a function

rdp

that implements these actions. As

shown in the following, this function takes as arguments the identiﬁer for the

endpoint on which the shared endpoint is to be created; the folder on that endpoint

for which sharing is to be enabled (h ere,

Share123

,or

shared_dir

in ﬁgure 11.1

on page 228); the folder on that endpoint from which the contents of the shared

folder are to be copied; the identiﬁer for the user to be granted access to the new

endpoint; and an email address to send a n otiﬁcation of the new share.

247

11.5. The Research Data Portal Design Pattern

rdp('b0254878-6d04-11e5-ba46-22000b92c6ec',

'Share123',

'~/TEST/',

'cce13ca1-493a-46e1-a1f0-08bc219638de',

'foster@anl.gov')

As noted in section 3.6 on pa ge 51 and shown in this example, each Globu s

endpoint and user is named by a universally unique identiﬁer (UUID). An endpoint’s

identiﬁer can be determined via the Globus web client or programmatically; a

user’s identiﬁer can be determined programmatically, as we show in noteboo k 8.

The code in ﬁgure 11.9 proceeds as follows. In steps 1 and 2, we obtain

Transfer an d Auth client references and use

endpoint_autoactivate

, a Globus

SDK function, to ensure that the research data portal admin has a credential that

permits access to the endpoint identiﬁed by

host_id

. (See section 3.6.1 on page 52

for more discussion of endpoint_autoactivate.)

In step 3, we call the function

create_share

of ﬁgure 11.2 on page 229, passing

as parameters the Trans fer client reference, the identiﬁer for the endpoint on which

the shared endpoint is to be created, and the path for the folder that is to be

shared: in our example call, the directory

/~/Share123

. As discussed earlier, that

function creates a shared endpoint for the new directory. At this point, the new

shared endpoint exists and is associated with this directory. However, only the

creating user has access to this new shared endpoint at this point.

In step 4, we use a Globus transfer to copy the contents of the folder

source_path

to the new shared endpoint. (The transfer here is from the endpoint on which the

new shared endpoint has been created, but it could be from any Globus endpoint

that the research data portal admin is authorized to acces s.) We have already

introduced the Globus Transfer SDK functions used here in section 3.6 on page 51.

In step 5, we call the

grant_access

function deﬁned in ﬁgure 11.2 to grant

our user access to the new shared endpoint. The function call speciﬁes the type

of access to be granted (

'r'

: read only) and the message to be included in a

notiﬁcation email:

'Your data are available'

. The invitation letter sent to the

user by the Globus SDK function

add_endpoint_acl_rule

is shown in ﬁgure 11.10.

The us er is now authorized to download data from the new shared endpoint.

That endpoint will typically be left operational for some period, after which it

can be deleted, as shown in step 6. Note that del eting a shared endpoint does not

delete the data that it contains: The research data portal administrator may want

to retai n the data for other purposes. If the data are not to be retained, we can

use the Globus SDK fu nction submit_delete to delete the folder.

248

Chapter 11. The Globus Research Data Management Platform

from globus_sdk import TransferClient , TransferData , AuthClient

import sys , ra ndom

def rdp (host_id , # Endpoint on which to create shared endpoint

source_path , # Directory to copy shared data from

shared_dir , # Directory name for shared endpoint

user_id): # User to share with

# (1) Obtain Transfer and Auth client references

tc = Tr ansferCli ent()

ac = AuthClient()

# (2) Activate host endpoint

tc . endpoin t_au toact ivat e ( host_id )

# (3) Create shared endpoint

share_id = create_share(tc , host_id , ' /~/ ' +shared_dir+'/ ')

# (4) Copy data into the shared endpoint

tc . endpoin t_au toact ivat e ( share_id )

tdata = TransferData (tc , host_id , share_id ,

label=' Copy to share ',sync_level=' checksum ')

tdata.add_item( source_path , ' /~/ ',recursive=True)

r=tc.submit_transfer(tdata)

tc . task_wait (r [' task_id'], timeout=1000, polling_interval=10)

# (5) Set access control to enable access by user

grant_access(tc , ac, share_id , user_id , 'r ',

' Your data are available ')

# (6) Ultimately , delete the shared endpoint

tc . de lete_en dpoint( share_id )

Figure 11.9: Globus code to implement research data portal design pattern.

249

11.6. The Portal Design Pattern Revisited

From: Globus Notiﬁcation <noreply@globus.org>

To: Portal server user <user@user.org>

Subject: Portal server admin (admin@therdp.org) shared folder "/" on "Share123" with you

Globus user Portal server admin (admin@therdp.org) shared the folder "/" on the endpoint

"Share123" (endpoint id: 698062fa-88ed-11e6-b029-22000b92c261) with user@user.org,

with the message:

Your data are available.

Use this URL to access the share:

https://www.globus.org/app/transfer?&origin_id=698062fa-88ed-11e6-b029-

22000b92c261&origin_path=/&add_identity=cce13ca1-493a-46e1-a1f0-08bc219638de

The Globus Team

support@globus.org

Figure 11.10: Invitation email sent by the program in ﬁ gure 11.9.

A variant of this approach, with certain administrative advantages, is as follows.

Rather than having the portal server create a new shared endpoint for each request,

a single sh ared endpoint is created once and the portal is given the access manager

role on the shared endpoint so that it can set ACL rules. Then, for each request it

creates a folder on the shared endpoint, puts the data in that location, and sets an

ACL rule to manage access. Cleanup is then simpler: the portal just removes the

ACL rule and deletes the folder.

11.6 The Portal Design Pattern Revisited

The preceding example shows the essentials of a Globus impl ementation of the

research data portal design pattern. We provide in ﬁgure 11.11 a more abstract

picture of the architecture that makes clear the components involved and their

relationships. To recap, the

portal web server

at the center of the ﬁgure is where

all custom logic associated with the research data portal sits. This portal server

acts as a client, in the Globus A uth/OAuth2 sense, to the other services that it uses

to handle the heavy lifting of authentication and authorization (Globus Auth), data

transfer and sharing (Globus Transfer), and other computations (Other services).

The user accesses portal capabilities via a web browser, and data transfers occur

between Globus Connect servers at various locations.

250

Chapter 11. The Globus Research Data Management Platform

Many variants of this basic research data portal design pattern can be imagined.

A minor variant is to prompt the user for where they want their data placed; the

portal then submits a transfer on the user’s behalf to copy the data to the speciﬁed

endpoint and path, hence automating yet another step. Or, the data that users

access may come from experimental facilities rather than a data archive, in which

case data may be deleted after successful download. Access may be granted to

groups of us ers rather than individuals. A portal may allow its users to upload

datasets for analysis and then retrieve analysis results. A data publication portal

may accept data submissions from users, and load data that pass quality control

procedures into a public archive. We give examples of several such variants in the

following, and show that each can naturally be expressed i n terms of the same

basic design pattern.

Similarly, while we have described the research data portal in the context of

an institutional Science DMZ, in which (as shown in ﬁgure 11.7) the portal server

and data store both sit within the research institution, other distributions are

also possible and can have advantages. For example, the portal can be deployed

on the public cloud for high availa bility, while the data sits in the Science DMZ

to enable di rect access from high-speed research networks and/or to avoid public

cloud storage ch arges. Alternatively, the portal can be in the research institution

and data in cloud storage. Or both components can be run on cloud resources.

Regardless of the speciﬁcs, a research data portal typically n eeds to perform

mundane but important tasks such as determining the identity of a user who wants

to access the service; controlling which users are able to access diﬀerent data and

other services within the portal; uploading data reliably, securely, and eﬃciently

from a variety of locations to storage systems within the Science D MZ; downloading

data reliably, securely, and eﬃciently from storage systems within the Science

DMZ to a variety of locations; dispatching requests to other services on behalf of

users; and logging all actions performed for purposes of audit, accounting, and

reporting. Each task is modestly complex to implement and operate reliably and

well. Building on top of existing services can not only greatly reduce development

costs, but also increase code quality and interoperability via use of standards.

As ﬁgure 11.9 shows, the beneﬁts of this approach lie not only in the separation

of concerns between control logi c and data movement. In addition, the portal

developer and admin both beneﬁt from the ability to hand oﬀ the management of

ﬁle access and transfers to the Globus service. The use of Globus APIs makes it

easy to implement a wide range of behaviors via simple programs; Globus handles

the heavy lifting of high-quality, reliable, and secure authentication, authorizati on,

and data management.

251

11.7. Closing the Loop: From Portal to Graph Service

Figure 11.11: The research data portal architecture, showing principal components. Only

the portal web server logic needs to be provided by the portal developer. Not shown are

other applications that, like the Browser on the left, may access the portal server: for

example, command line, thick client, or mobile applications.

It i s these capabilities that made it easy to realize the example systems men-

tioned in this chapter: the NCAR Research Data Archive, which provides h igh -

speed delivery of research data to geoscientists; the DMagic data sharing system for

data di stribution from light sources; and the Sanger Imputation Service (described

on the next pa ge), which supports online analysis of user-provided genomic data.

11.7 Closing the Loop: From Portal to Graph Service

We have already shown in section 11.4 on page 240 how to use the Globus Auth SDK

to implement a service that responds to requests from a portal server: the arrow

labeled REST from the

Portal web server

Other services

in ﬁgure 11.11.

Such calls might be used in a research data portal for several reasons. You might

want to organize your portal as a lightweight front end (e.g., pure Javascript) that

interacts with one or more remote backend services. Ano ther reason is that you

might want to provide a public REST API for the ma in portal machinery, so that

other app and service developers can integrate with and build on your portal.

252

Chapter 11. The Globus Research Data Management Platform

Now we look at the logic and code involved in generating such requests. Our

research data service skeleton illustrates this capability. When a user s elects the

Graph

option to request that datasets be graphed, the portal does not perform

those graphing operations itself but instead sends a request to a separate Graph

service. The request provides the names of the datasets to be graphed. The Graph

service retrieves these datasets from a speciﬁed loca tion, runs the graphing program,

and uploads the resulting graphs to a dynamically created shared endpoint for

subsequent retrieval. We describe in the following both the portal server and

Graph server code used to implement this behavior.

Figure 11.12 shows a slightly simpliﬁed version of the portal code that sets

up, send s, and processes the response from the graph request, using the Python

Requests lib rary [

225

]. The code (1) retrieves the access tokens obtained during

authentication and extracts the access token for the graph service. (The graph

service scope is requested during this ﬂow.) It then (2) assembles the URL, (3)

header (containing the Graph service access token), and (4) data for the REST

call (including information about the requesting user), and (5) dispatches the call.

The remainder of the code (6) checks for a valid response, (7) extracts the location

of the newly created graph ﬁles from the response, and (8) and directs the user to

a Globus transfer browser to access the ﬁles.

Sanger Institute Imputation Service imputation.sanger.ac.uk

.Operated

by the Sanger Institute in the UK, this service allows you to upload ﬁles containing

genome wide association study (GWAS) data from the 23andMe genotyping service

and receive back the results of imputation and other analyses that identify genes

that you are likely to possess based on those data. The service uses Globus APIs to

implement a variant of the research data service design pattern, as follows.

A user who wants to use the service ﬁrst registers an imputation job. As part of

this process, they are prompted for their name, email addres s, and Globus identity,

and the type of analysis to be performed. The Sanger service then requests Globus to

create a shared en dpoint, share that endpoint with the Globus id entity provided by

the user, and email a link to this endpoint to the user. Th e user clicks on th at link to

upload their GWAS data ﬁle and the corresponding imputation task is added to the

imputation queue at the Sanger Institute. Once the imputation task is completed,

the Sanger service requests Globus to create a second shared endpoint to contain the

output and to email the user a link to that new endpoint for download. The overall

process diﬀers from that of ﬁgure 11.9 only in that a shared endpoint is used for data

upload as well as download.

253

11.7. Closing the Loop: From Portal to Graph Service

# (1) Get access tokens for the Graph service

tokens = get_portal_tokens ()

gs_token = tokens.get(' Graph Service ')[ ' token']

# (2) Assemble URL for REST call

gs_url = ' {}/{} '. format(app.config[' SERVICE_URL_BASE '], ' api/doit ')

# (3) Assemble request headers

req_headers = dict(Authorization= 'Bearer {}'. format(gs_token))

# (4) Assemble request data. Note retrieval of user info .

req_data = dict(datasets=selected_ids,

year= selected_year ,

user_identity_id=session.get(' primary_identity'),

user_identity_name=session.get(' primary_username'))

# (5) Post request to the Graph service

resp = requests .post (gs_url ,

headers=req_headers ,

data=req_data ,

verify=False)

# (6) Check for valid response

resp. raise_for_status ()

# (7) Extract information from response

resp_data = resp.json()

dest_ep = resp_data.get(' dest_ep ')

dest_path = resp_data.get(' dest_path ')

# (8) Show Globus endpoint browser for new data

return redirect(url_for(' browse' ,endpoint_id=dest_ep,

endpoint_path=dest_path.lstrip('/ ')))

Figure 11.12: Slightly simpliﬁed version of the function

graph()

in ﬁle

portal/view.py

at github.com/globus/globus-sample-data-portal .

254

Chapter 11. The Globus Research Data Management Platform

11.8 Summary

In the distributed, collaborative, data-rich world of mod ern science, the abilities to

transfer, share, an d analyze data regardless of location, and to navigate complex

security regim es in so doing, are frequently essential to progress. We h ave described

cloud-hosted platform services, Globus Auth, Transfer, and Sharing, to which

developers of appl icati ons and tools that must operate in this world can ou tsource

responsibility for such tasks. We have used the example of a research data portal

to illustrate their use, but they can be used in many diﬀerent conﬁgurations.

11.9 Resources

Globus has extensive online documentation for their REST APIs, Python SDK,

and command line interface

dev.globus.org

. Dart et al. [

106

]provideadditional

information on the Science D MZ concept and design pattern.

255