Chapter 11
The Globus Research Data
Management Platform
“Give me where to stand, and I will move the earth.”
—Archimedes
We have seen how powerful cloud-based data storage and analysis services can
simplify working with large data. But not all science and engineering data live in
the cloud. Research is highly collaborative and distributed, and frequently requires
specialized resources: data stores, supercomputers, instruments. Thus data are
created, consumed, and stored in a variety of locations, including specialized
scientific laboratories , national facilities , and in stitutional computer centers.
Data
movement and sharing
and
authentication and authorization
are perennial
challenges that can impose considerable friction on research and collaboration.
We describe in this chapter a set of p latform services that address these
challenges. The Globus cloud service provides data movement, data sharing, and
credential and identity management capabilities. We described briefly in section 3.6
on page 51 how these services can be accessed as software as a service, via web
interfaces. Here, we introduce more details on these services and describe the
Python SDKs that permit their use from within applications. We focus in particular
on how the Globus Auth service makes it straightforward to build science services
that can accept identities from dierent identity providers, use standard protocols
for authentication and authorization, and thus integrate naturally into a global
ecosystem of service providers and consumers. As a use case, we show how these
capabilities can be used to build research data portals.
11.1. Challenges and Opp ortunities of Distributed Data
11.1 Challenges and Opportunities of Distributed Data
Data movement is central to many research activities, including analysis, collab-
oration, publication, an d data preservation. However, given its importance and
ubiquity, this task remains surprisingly challenging in practice: storage system s
have dierent security configurations, achieving good transfer performance is non-
trivial, and as data sizes increase the likelihood of errors increases. Scientists and
engineers frequently struggle with such mundane tasks as authenticating and au-
thorizing user access to storage systems, establishing high-speed data connections,
and recovering from faults whil e a transfer proceeds.
Authentication and authorization are simila rly central to science and engineer-
ing, and for related reasons. Researchers often find themselves needing to navigate
a complex world of dierent identities, authentication methods, and credentials
as they access resources in dierent locations. For example, say tha t you need to
transfer data repeatedly from two sites
A
and
B
to a storage system at your home
institution
H
. You have accounts at
A
and
H
, with identities
U
A
and
U
H
;site
B
will accept your home institution id entity, thanks to the InCommon identity
management federation [
68
]. You will commonly need to authenticate once for
each transfer: a painful process and one that prevents scripting. You would prefer
to instead authenticate as
U
A
and
U
H
just once, and then perform subsequent
transfers from A and B to H without further authentications.
The data sharing problem sits at the intersection of these two challenges. Say
you want to grant a collaborator access to data at your home institution. Setting
up an account just for that purpose is typically a time consuming process, if it is
possible at all. And it forces your collaborator to deal with yet another username
and password. You need to be able to enable access without a local account.
Globus services address these and other related challenges that arise when our
work requires the integration of resources across dierent loca tions . As well as an
easy-to-use, web-browser based i nterface, Globus provides REST APIs and Python
SDKs to enable the integration of Globus solutions into applications in ways that
reduce development costs and i ncrease security, performance, and reliability.
11.2 The Globus Platform
Globus was first introduced in 2010 as a software-as-a-service solution to the
problem of moving data between pairs of storage systems or
endpoints
[
62
,
123
].
(An endpoint is a storage system that has been connected to the Globus cloud
services by using software called
Globus Connect
.) The Amazon-hosted Globus
226
Chapter 11. The Globus Research Data Management Platform
software handles the complexity involved in transfers, such as authenticating and
authorizing user access to endpoints, creating a high-speed data connection between
endpoints, and recovering from faults while a transfer proceeds. Importantly, it
implements a thi rd-party transfer model in which no data are transferred via
the Globus service: instead, data are transferred directly between endpoint pairs
by using a protocol called GridFTP that provides specialized support for high
performance and reliability [
61
]. Globus can also perform rsync-like updates when
doing repeated transfers, allowing transfer of only new or modified files from the
source to the destination. Direct HTTPS transfers to and from endpoints are also
supported, allowing web browser access to data stored on Globus endpoints.
The Glob us team has subsequently built on this initial
Globus Transfer
service by adding
Auth
for identity and credential management,
Groups
for group
management,
Sharing
for data sharing, and
Publication
and
Data Search
for
data management. Importantly, the Globus team also created REST APIs and
Python SDKs to allow these capabilities to be used programmatically, from within
applications. It is these platform capabilities that we describe in this chapter,
building on the introductory material in section 3.6.1 on page 52, where we showed
how to use the Globus Python SDK to initiate, monitor, and co ntrol data transfers.
We first provide additional details on the programmatic use of Glob us Sharing
capabilities, then i ntrodu ce the use of Gl obu s Auth, and finally present illustrative
examples of the use of these capabilities.
11.2.1 Globus Transfer and Sharing
We introduced Globus Sharing capabilities in section 3.6.2 on page 54. Here we
show how to use the Python SDK to manage sharing programmatically. Recall
that Globus Sharing allows a user to make a specified folder on a Globus endpoint
accessible to other Globus users. Figure 11.1 shows the idea. Bob has enabled
sharing of folder ~/shared_dir on Regular endpoint by creating Shared endpoint,
and then granting Jane access to that shared endpoint. Jane can then use Globus
Transfer to read and/or write files in the shared fol der, depending on what rights
she has been granted.
As is the case with the Globus Transfer service presented in chapter 3, all data
sharing capabilities oered by the Globus web interface are also accessible via
the Python SDK. The code in figure 11.2 on page 229 illustrates their use. We
explain each of the two functions in the figure in turn. We use both functions in
section 11.5.3 on page 247 as part of a research data portal implementation.
227
11.2. The Globus Platform
/~/
shared_dir
Files)for)user
Regular)
endpoint
Shared)
endpoint
Bob
Jane
Figure 11.1: The Globus shared endpoint construct allows an authorized administrator of
aGlobusendpoint(sayBob)tocreatea
shared endpoint
granting access to a folder
within that endp oint, to which they can then authorize access by others (say Jane).
First, the
create_share
function: We assume that we have previously initiated
a transfer object,
tc
, in the manner illustrated in the first lines of figure 3.9
on page 55, and that this object is passed to the function, along with the end-
point identifier and path for the folder that is to be shared (“Regular endpoint”
and
~/shared_dir
, respectively, in figure 11.1). The function uses the Globus
SDK function
operation_mkdir
to request creation of the specified
folder
on
the specified
endpoint
. It then creates a parameter structure, calls the Globus
SDK function
create_shared_endpoint
to create a shared endpoint for the new
directory, and finally returns the identifier for the new endpoi nt.
Second, the
grant_access
function: This function requires both
tc
and a
Globus Auth client reference,
ac
(we introduce Auth in the next section); identifiers
for the shared endpoint (
share_id
) for which sharing is to be enabled and the user
(
user_id
: a UUID, as with endpoint identifiers) to whom a ccess is to be granted;
the type of access to be granted (
atype
: can be
'r'
,
'w'
,or
'rw'
); and a message
to be emailed to the user upon completion. The function uses the Globus Auth
SDK function
get_identities
to determine the identities that are associated with
the user for whom sharing is to be enabled, and extracts from this list an email
address. It then uses the Globus Transfer SDK function
add_endpoint_acl_rule
to add an access control rule to the shared endpoint, granting the specified access
type to the specified user.
11.2.2 The rule_data St ructure
Our example program passes a
rule_data
structure to the
add_endpoint_acl_rule
function. The various elements specify, among other things:
'principal_type': the type of principal to which the rule applies;
228
Chapter 11. The Globus Research Data Management Platform
#
# Create a shared endpoint on specified ' endpoint ' and ' folder ';
# Return the endpoint id for new endpoint .
# Supplied ' tc ' is Globus transfer client reference.
#
def create_share(tc, endpoint , folder):
# Create directory to be shared
tc . op eration _mkdir( endpoint , path = folder )
# Create the shared endpoint on specified folder
shared_ep_data = {
' DATA_TYPE' : ' shared_endpoint' ,
' host_endpoint':endpoint,
' host_path' :folder,
' display_name' : ' Share ' +folder,
' description' : 'New shared endpoint '
}
r=tc.create_shared_endpoint(shared_ep_data)
# Return identifier of the newly created shared endpoint
return(r[' id '])
#
# Grant ' user_id' access ' atype ' on ' share_id '; email ' message'
# Supplied ' tc ' and 'ac ' are Globus Transfer and Auth client refs.
#
def grant_access(tc, ac, share_id , user_id , atype , message):
# (1)
r=ac.get_identities(ids=user_id)
email = r[' identities '][0][ 'email ']
rule_data = {
' DATA_TYPE' : 'access ',
' principal_type': ' identity', # To whom is access granted?
' principal' :user_id, # To an individual user
' path ' : '/ ', # Grant access to this path
' permissions' :atype, # Grant specified access
' notify_email' :email, # Email invite to this address
' notify_message':message # Include this message in email
}
r=tc.add_endpoint_acl_rule(share_id,rule_data)
return(r)
Figure 11.2: A function that uses the Globus Python SDK to create a shared endpoint.
229
11.3. Identity and Credential Management
'principal'
:asthe
'principal_type'
is
'identity'
, thi s is the user id
with whom sharing is to be enabled;
'permissions'
: the type of access being granted: in this case read-only
('r'), but could a lso be read and write ('rw');
'notify_email'
: an email address to which an invitation to access the
shared endpoint should be sent; and
'notify_message': a message to include in the invitation email.
The
'principal_type'
element can al so take the value
'group'
, i n which case
the
'principal'
element must be a group id. Alternatively, it can take the values
'all_authenticated_users'
or
'anonymous'
, in which cases the
'principal'
element must be an empty string.
11.3 Identity and Credential Management
We noted above the challenges that users face when authenticating to dierent
sites and services in the course of their work. Similarly, service developers need
mechanisms for establishing the identity of a requesting user and for determin ing
what that user is authorized to do. Figure 11.3 illustrates some of the concepts
and issues involved. An end user wants to run an application that makes requests
to remote services on her behalf. Those remote services may themselves want to
make further calls to other
dependent services
. For consistency with commonly
used terminology, we refer to the us er as the
resource owner
, the application as
the client, and each remote and dependent service as a resource server.
Two interrelated problems frequently arise i n such contexts. The first concerns
the use of
alternative identity providers
. A resource server frequently wants
to establish the identity of the user (i.e., resource owner) who issued an incoming
request, often to determine whether to grant access and sometimes simply to log who
is using their service. In the past, developers of resource servers often implemented
their own username-password authentication systems, but such approaches are
inconvenient and insecure. Instead, we wa nt to allow a resource server to accept
credentials from other identity providers: for example, that associated with a user’s
home institution. Furthermore, dierent resource servers may require dierent
credentials. For examp le, to transfer a file from the University of Chicago to
Lawrence Berkeley National Laboratory, I must authenticate with both my Chicago
and my Berkeley identities to establish my credential s to access file systems at
Chicago and Berkeley, respectively.
230
Chapter 11. The Globus Research Data Management Platform
Figure 11.3: A schematic of the entities and interactions that engage in distributed
resource accesses, using the terminology of OAuth2.
The second problem concerns
(restricted) delegation
. A resource server
may need to perform actions on behalf of a requesting user. For example, it may
need to transfer files or perform computations. It may then need credentials that
allow it to establish its authority to perform such actions. (This requirement
is especially important if the resource server needs to operate in an unattended
manner, for example so that it can continue file tran sfers or computations while
the user eats lunch.) However, users may not want to grant unlimited rights to a
remote service to perform actions on their behalf, due to the potential for harm
if a credential is compromised. Thus, the ability to restrict the rights that are
delegated is important. For example, you might be ok with a service reading, but
not writing, files on a certain server. And you certainly do not want a compromised
service to be able to use oth er services that you have not authorized.
As we describe in the following, the cloud-hosted Globus Auth service addresses
these and other related concerns.
11.3.1 Globus Auth Is an Authorization Service
Globus Auth leverages two widely used web standards, the OAuth 2.0 Authoriza-
tion Framework (OAuth 2) [
149
] and OpenID Connect Core 1. 0 (OIDC) [
230
], to
implement solutions to these problems. OAuth2 is a widely used proto col that
applications can use to provide client applications with
secure delegated access
:
the delegation that we spoke about above. It works over HTTP and uses
access
tokens
to authorize servers, applications, and other entities. OIDC is a simple
identity layer on top of the OAuth protocol.
The cloud-hosted Globus Auth service is what OAuth2 calls an
authorization
server
. As su ch, it can issue access tokens to a
client
after s uccess full y authen-
ticating the
resource owner
and obtaining authorization from that resource
owner for the client to access resources provided by a
resource server
. (This
231
11.3. Identity and Credential Management
Figure 11.4: A Globus Auth cons ent request, in this case for the Globus web application.
authorization process typically involves a request for consent, such as those shown
in figure 11.4.) The resource owner in this scen ario is typically an end user, who
authenticates to a Globus Auth-managed Globus acco unt us in g an identity issued
by one of an extensible set of (federated) id entity providers supported by Globus
Auth. A resource owner could also be a robot, agent, or service acting on its own
behalf, rather than on behalf of a user; the client may be either an application
(e.g., web, mobile, desktop, command line) or another service acting as a client, as
we explain in subsequent discussion.
Having obtained an a ccess token, the client can then present that token as part
of a request to the resource server for which the token applies, to demonstrate that
it is authorized to make the request. The token is included in the request via the
HTTPS Authorization header.
Access tokens are thus the key to OAuth2 and Globus Auth. An access token
represents an authorization issued by a resource owner to a client, authorizing
the client to request access to a specified resource s erver on the resource owner’s
behalf. As we describe later, the resource server can then ask the Globus Auth
authorization service for details on what rights have been granted: a process that
is referred to as “introsp ection .” For example, if the resource owner in figure 11.3
wants to allow a client (e.g., a web portal) to access a remote service but only for
232
Chapter 11. The Globus Research Data Management Platform
purposes of reading during the next hour, introspection of the associated token can
reveal those restrictions. Globus Auth thus addresses the problems of (restricted)
delegation. It also supports the linking of multiple identities, as we discuss below,
to address the problem o f alternative identity providers.
A resource server receiving a token from a client can thus determine that the
resource owner has authorized it to perform certain actions on the resource owner’s
behalf. What if the resource server then wants to reach out to other resource servers,
for example to Globus Transfer to request a data transfer? A problem arises: the
resource server has a token that authorizes it to perform actions itself, but it has
no token that it can present to the Globus Transfer service to demonstrate that
the resource owner (the end user in our example) has authorized transfers.
This is where
dependent services
come i n. When a resource server
R
is
registered with Globus Auth, it can specify services that it needs to access to
perform its functions: its dependent services, say
S
and
T
. A request from
R
to
Globus Auth for authorization then causes Globus Auth to request consent from
the user not only for
R
but also for the dependent services
S
and
T
.Wesawan
example of this scenario in figure 11.4: the Globus web application has registered
Globus Transfer and Globus Groups as dependent services, and thus you see the
user being asked to consent to those uses. Once consent has been granted, the
resource server can request additional dependent access tokens, as required, that
it can then includ e in requests to other services that it makes on the authorizing
resource own er’s behalf.
OAuth2 and Globus Auth incorporate various complexities and subtleties,
but the basic steps are simple. A user accesses an application; Globus Auth
authenticates and requests consents from the end user; Globus Auth provides
access tokens to the application; the application uses access tokens to access other
services; a service receiving an access token can validate it and use it to request
dependent access tokens to access other services. Importantly, di erent actors can
play dierent roles at dierent times: your web browser can be a client to a web
service, that itself can act as a client to other services, and so on.
11.3.2 A Typical Globus Auth Workflow
We use figure 11.5 on page 235 to illustrate how Globus Auth works. Th e figure looks
complicated, but please bear with us: the u nderl ying concepts are straightforward.
We describe each of the 12 steps shown in the figure in turn.
1.
The end user accesses the application to make a request to a remote service.
233
11.3. Identity and Credential Management
The application might be a Web client or, alternatively, an application
running on the user’s desktop or some other computer.
2.
The application contacts Globus Auth to request authorization for the use
of a set of
scopes
. A scope represents a set of capabilities provided by a
resource server for which an access token is to be granted. In this case, the
application requests two scopes: one for access to login information and one
for HTTPS/REST API access.
3.
Globus Auth arranges for authentication of the user, using an identity provider
that is mutually acceptable to the user and the application. Because the user
only authenticates with the authorization server, the user’s credentials are
never shared with the client or with Globus Auth.
4. Globus Auth returns an authorization code to the user.
5.
The user requests access tokens from Glob us Auth, passing the previously
acquired authorization code to establish their right to obtain these tokens.
6.
Access tokens are returned, one per requested scope. The issuance of multiple
tokens enhances security by limiting the impact of a compromise.
7.
The cli ent can now use the access token in an HTTPS/REST request to a
resource server, by setting an HTTPS
Authorization: Bearer
header with
the appropriate token. (For concreteness, the remote service is here shown
as Globus Transfer, but it could be anything.)
8.
Using a recent OAuth2 extension [
226
], the resource server can contact
Globus Auth to “introspect” the token and thus obtain answers to questions
such as “is the token valid?,” “which resource owner is it for?,” “what client is
making the request?,” and “which scope is it for?”
9.
Globus Auth responds to the introspection request. The resource server can
use the provided information to make an authorization decision as to how it
responds to the client request.
10.
The resource server can also use its access token to request dependent access
tokens for any dependent services. For example, Globus Transfer can retrieve
an access token for the Globus Groups resource server, so that it can check
if the requesting user is a member of a particular group before taking some
action like allowing access to a shared endpoint.
11. Globus Auth returns requested dependent tokens.
234
Chapter 11. The Globus Research Data Management Platform
Figure 11.5: Entities and interactions involved in Globus Auth-mediated distributed
resource requests. Details are provided in the text.
12.
The resource server uses a newly issued dependent a ccess token in an HTTP-
S/REST request to th e second resource server.
There are other OAuth2 and Globu s Auth details that are not covered here:
for exampl e, refresh tokens (because an access token’s lifetime may be less than
that of an application) and the somewhat dierent methods used in the case of a
long-lived application rather than a web browser. Also, an alternative protocol is
used for rich clients such as the Javas cript-based Globus Transfer client that avoids
the needs for steps 4 and 5; a variant of this flow supports mobile, command line,
and desktop applications: “native apps.” But we have covered the essentials.
11.3.3 Globus Auth Identities
Globus Auth maintains information about the identities that its users may use to
authenticate. A Globus Auth identity has a unique, case-insensitive username (for
example,
user@example.org
), iss ued by an identity provider (e.g., a University,
research laboratory, or Google), for which a user or client can prove possession via
an authentication process (e.g., presenting a password to the identity provider).
Globus Auth manages the use of identities (e.g., to login to clients and services),
their properties (e.g., contact information), and relationships among identities
(e.g., allowing login to an identity by using another linked, “federated” identity).
235
11.3. Identity and Credential Management
Globus Auth neither defines its own identity usernames nor verifies authen-
tication (e.g., via passwords) with identities. Rather, it acts as an intermediary
between external identity providers, on the one han d, and clients and services that
want to leverage identities issued by those providers, on the other. Globus Auth
assigns each identity that it encounters an identifier: a UUID that is guaranteed to
be unique among all Globus Au th identities, and that will never be reused. This
ID is what resource servers and clients should use as the canonical identifier for a
Globus Auth identity. Associated with this ID are an identity provider, a username
given to the identity by the provider, and other provider-supplied information such
as display name and contact email address.
An example Globus Auth identity
. The following is an example of the informa-
tion that m ay be associated with a Globus Auth identity:
username : ro cky@wossamotta.edu
id : de305d54-75b4-431b-adb2-eb6b9e546014
identity_provider : wossamotta.edu
display_name : Rocket J. Squirrel
email : rocky@wossamotta.edu
Globus supports more than 100 identity providers, and more are being added all
the time. Examples include the many US and international uni versities and oth er
institutions that support InCommon; various identity providers that support the
OpenID Connect protocol; Google; and the Open Researcher and Contributor ID
(ORCID). The process of integrating a new identity provider is beyond the scope
of this book, but it is a straightforward process. See the Globus documentation
for more information.
11.3.4 Globus Accounts
An identity can be used with Globu s Auth to create a
Globus account
. A Globus
account has a primary identity, but can also have any number of other identities
linked to it as well. Thus, for example, Mr. Squirrel may create a Globus account
with the identity above and then link to that account a Google identity, his ORCID,
and an identity provided by a scientific facility to which he has access.
A Glob us account is not an identity itself. It does not have its own name or
identifier. Rather, a Globu s account is identified by its primary id entity. Similarly,
profile information and other metadata are tied to identities, not to accounts. A
Globus account is simply a set of identities comprising the primary identity and
all identities linked to that primary identity.
236
Chapter 11. The Globus Research Data Management Platform
11.3.5 Using Globus Auth Identities
Clients and resource servers should always use the Globus Auth-provided identity
ID when referring to an identity, for example in access control lists, and when
referring to identiti es in a REST API. Clients and resource servers can use the
Globus Auth REST API to map any identity username to its (current) identity
ID, and request in formation about an identi ty ID (e.g., username, display_name,
provider, email), for example as follows:
import globus_sdk
# Obtain reference to Globus Auth client
ac = globus_sdk .AuthClient()
# Get identifies associated with username ' globus@globus. org '
id =ac.get_identities(usernames=' globus@globus.org')
# Return zero or more UUIDs
# Get identities associated with a UUID
r=ac.get_identities(ids=id )
The last command returns a JSON document containing a list of identities,
such as the following. (This example document contains just one identity.)
{ ' identities ':
[{' email ' :None,
'id ' : '46bd0f56 -e24f -11 e5 - a510 -131bef46955c' ,
' identity_provider': '7daddf46-70c5-45ee-9f0f-7244fe7c8707',
' name ' :None,
' organization' :None,
' status' : 'unused ',
' username' : ' globus@globus.org'}
]
}
11.3.6 Use of Globus Auth by Resource Servers
Having introduced various details of the Globus Auth server, Globus Auth identities,
and Globu s accounts, we can now turn to the practical ques tion of what we can
do with these mechanisms. In particular, we describe how resource servers can
use Globus Auth as an authorization server and thus both support sophisticated
OAuth2 a nd OpenID Connect functionality, and leverage other resource servers
that use Globus Auth.
Let us consider, for example, a research data service that accepts user requests
to analyze genomic sequence data. (We describe an example of such a system,
Globus Genomics, in section 14.4 on page 303.) This service is basically a data
237
11.3. Identity and Credential Management
and code repository wi th a REST API, whi ch other applications can leverage to
access this repository programmatically.
This service is a resource server in the Globus Auth context. It needs to be able
to authenticate users, validate user requests , and make requests to other services
(e.g., to cloud or institutional storage to retrieve sequence data and store results,
and to computing facilities to perform computations) on a u ser’s behalf. Globus
Auth allows us to program each of these capabilities via manipulation of identities,
access tokens , and OAuth2 protocol messages.
Assume that some client to this service has al ready followed steps 1–7 in
figure 11.5 on page 235 and thus possesses the necessary access tokens. (The “client”
may be a web client to the data server, or so me other web, mobile, desktop, or
command line application.) Interactions may then proceed as follows.
1.
The client makes an HTTPS request to the resource server (the research
data service proper: in the following we refer to it as the “data service”) with
an
Authorization: Bearer
header containing an access token. (Step 8 in
figure 11.5.)
2.
The data service calls the function
oauth2_token_introspect
provided by
the Globus Auth SDK, authorized by the data service’s client identifier
and client secret (see below), to validate the request access token, and
obtain additional information related to that token (scopes, eective identity,
identities set, etc.). If the token is not valid, or is not intended for use with
this resource server, Globus Auth returns an error.
3.
The data service verifies that the request from its client conforms to the
scopes associated with the request access token.
4.
The data service verifies the identity of the resource owner (typically an end
user) on whose behalf the client is acting . Th e data service may use this
identity as its local account identifier for this user.
5.
The data service uses the set of identities associated with the account referred
to by the request access token to determine what the request is al lowed to
do. For example, if the request is to access a resource that is sh ared with
particular identities, the data service should compare all o f the account’s
identities (primary and linked identity ids) with the resource access control
permissions to determine if the request should be granted .
6.
The data service may need to act as a client to other (dependent) resource
servers, as discussed above. In that case, the data service uses the Globus SDK
238
Chapter 11. The Globus Research Data Management Platform
oauth2_get_dependent_tokens
function to get dependent access tokens for
use with downstream resource servers, based on the request access token that
it received from the client.
7.
The data service uses a dependent access token to make a request to a
dependent resource server.
8. The data service responds to its client with an appropriate response.
A note regarding the client identifier and client secret mentioned in Step 2:
Each client and resource server must register with Globus Auth and obtain a
client id
and
client secret
, which they can subsequently use with Globus Auth
to prove who it is in the various OAuth2 messages: fo r example, when swapping an
authorization token for a n access token, calling token introspect, calling dependent
token grant, and using a refresh token to obtain a new access token.
11.3.7 Other Globus Capabilities
Globus also supports a growing set of other capabilities beyond those described
here. For example, tabl e 11.1 lists additional functions supported by the Globus
Transfer Python SDK.
Table 11.1: Some of the close to 50 functions supp orted by the Globus Transfer Python
SDK. (Others mostly implement endpoint administration functions.)
Type Function Description
Endpoint
information
endpoint_search Search on name, keywords, etc.
get_endpoint Get endpoint information
my_shared_endpoint_list Get endpoints that I manage
File system
operations
operation_mkdir Create a folder on endpoint
operation_ls List contents of endpoint
operation_rename Rename folder or directory
Task
management
submit_transfer Submit a transfer request
submit_delete Submit a delete request
cancel_task Cancel submitted request
task_wait Wait for task to complete
Task
information
task_list Get information about tasks
get_task Get information about a task
task_event_list Get event info for a tas k
task_successful_transfers Get successful transfers for task
task_pause_info Get info on why task paused
239
11.4. Building a Remotely Accessible Service
Other Globus services provide other capab il ities. Globus Publ icatio n, for
example, provi des user-configurable, cloud-hosted data publication pipelines that
can be used to automate the workflows used to make data accessible to others,
workflows that will typically include steps such as providing and collecting metadata,
moving data to long-term storage, assigning persistent identifiers (e.g., a Digital
Object Identifier or DOI [
218
]), and verifying data correctness [
89
]. Globus Data
Search can be used to search for data on endpoints to which a user has access. See
the Globus documentation docs.globus.org for information on these services.
Data delivery at the Advanced Photon Source
. The
Advanced Photon
Source
(APS) at Argonne National Laboratory is typical of many experimental
facilities worldwide in that it serves large numbers (thousands) of researchers every
year, most of whom visit ju st for a few days to collect data and then return to their
home institution. In the past, data produced during an experiment was invariably
carried back on physical media. However, as data sizes have grown and experiments
have become more collaborative, that approach has become less eective. Data
transfer via network is preferred; the challenge is to integrate data transfer into the
experimental workflow of the facility in a way that is fully automated, secure, reliable,
and scalable to thou sands of users and datasets.
Francesco De Carlo uses Globus APIs to do just that at the APS. His
DMagic
system
[
107
] implements a variant of the program in fi gure 11.9 that integrates with
APS administrative and facility systems to deliver data to experimental users. When
an experiment is approved at the APS, a set of associated researchers are registered
in the APS administrative datab as e as approved participants. DMagic leverages this
information as follows. Before the experiment begins, it creates a shared endpoint on
alargestoragesystemmaintainedbyArgonnescomputingfacility. DMagicthen
retrieves from the APS scheduling system the list of approved users for the experiment,
and adds permissions for those users to the shared endpoint. It then monitors the
experiment data directory at the APS facility and copies new files automatically to
that shared endp oint, from which it can be retrieved by any approved user.
11.4 Building a Remotely Accessible Service
Say you want to build a service that can be invoked remotely via a REST API
call. Building and invoking a service in this way is straightforward in principle:
many libraries exist for defining, implementing, and usin g REST APIs. Security
is perhaps the on e major source of complexity, and here Globus Auth can help.
The basic issue is that when a remote user makes a request to the service, the
service author needs to be able to determine who is making the request and what
rights the requestor is passing with the request. For example, the service may
240
Chapter 11. The Globus Research Data Management Platform
want to know if it is permitted to make Globus transfer requests on behalf of the
requesting user. It may also want to know the identity of the requestor, so that
the requestor can be given access to a shared en dpoint created by the service.
To illustrate how Globus Auth can be used to address these concerns, we present
asimple
Graph service
that accepts requests to generate graphs of temperature
data. In response to a request, it retrieves data from a web server, generates graphs,
and uses Globus Transfer to tran sfer the graphs to the requestor. It thus need s
to authenticate and authorize the requestor and obtain dependent access tokens
for a web server and Globus Transfer. A complete Python implementation of this
example service is ava ilable at
github.com/globus/globus-sample-data-portal
,in
the fold er
service
. We use extracts (some simplified) from this implementation
to illustrate how the Graph service works with Globus Auth.
The relevant authorization code is in figure 11.6 on the next page. The Graph
service receives a HTTPS request with a header containing the access token in
the form
Authorization: Bearer <request-access-token>
. It then uses the
following code to (1) retrieve the access token, (2) call out to Globus Auth to
retrieve information about the token, including its validity, client, scope, and
eective identity. The Graph service can then (3–5) verify the token information
and (6) authorize the request. (In our example, every request is accepted.)
This sample code has been written so that it only (5) accepts requests from
an entity that can supply a
PORTAL_CLIENT_ID
, a service that we introduce later
in the chapter. As we show in the next pa ragraph, it then requests and obtains
dependent access tokens that allow it to transfer data on behalf of that entity.
An alternative implementation, to be preferred if we want the Graph service to
be more broadly useful, would have it look for the original resource owner’s (end
user’s) token and then perform operations on their behalf.
As the Graph service needs to act as a client to the data service on which the
datasets a re located, it next requests dependent tokens from Globus Auth. This
and subsequent code fragments in this section are from the file
service/view.py
.
241
11.4. Building a Remotely Accessible Service
# (1) Get the access token from the request
token = get_token (request .headers [' Authorization' ])
# (2) Introspect token to extract
client = load_auth_client ()
token_meta = client.oauth2_token_introspect(token)
# (3) Verify that the token is active
if not token_meta.get(' active '):
raise ForbiddenError()
# (4) Verify that " audience" for this token is our service
if ' Graph Service ' not in token_meta.get(' aud ',[]):
raise ForbiddenError()
# (5) Verify that identities_set in token includes portal client
if app .config [' PORTAL_CLIENT_ID']!=token_meta.get(' sub ' ):
raise ForbiddenError()
# (6) Token has passed verification: stash in request global object
g.req_token = token
Figure 11.6: Selected and somewhat simplified code from the file
service/decorators.py
in the Graph service example.
client = load_auth_client ()
dependent_tokens = client.oauth2_get_dependent_tokens(token)
Having retrieved these dependent tokens, it extracts from them the two access
tokens that all ow it to itself act as a client to the Globus Transfer service and to
an HTTPS endpoint service from which it will retrieve datasets.
transfer_token = dependent_tokens.by_resource_server[
' transfer.api.globus.org' ][ 'access_token ']
http_token = dependent_tokens.by_resource_server[
' tutorial - https -endpoint .globus .org'][ ' access_token ']
The service also extracts from the request detail s of the datasets to be graphed,
and the identity of the requesting user for use when configuring the shared endpoint:
selected_ids = request.form.getlist(' datasets ')
selected_year = request.form.get(' year')
user_identity_id = request.form.get(' user_identity_id')
242
Chapter 11. The Globus Research Data Management Platform
The Graph service next fetches each
dataset
via an HTTPS request to the
data server, using code like the following. The previously obtained
http_token
provides the credentials required to authenticate to the data server.
response = requests .get(dataset ,
headers=dict (Authorization='Bearer ' +http_token))
A graph is generated for each dataset. Then, the Globus SDK functions
operation_mkdir
and
add_endpoint_acl_rule
are used , as in section 11.2.1 on
page 227, to request that Globus Transfer create a new shared endpoint accessi-
ble by the user identity that was previously extracted from the request header,
user_identity_id
. (The
transfer_token
previously obtained from Globus Auth
provides the credentials required to authenticate to Globus Transfer.) Finally, the
graph files are transferred to the newly created directory via HTTP, using the same
http_token
as previously, and the Graph server sends a response to the requester,
specifying the number and location of the graph files.
This example shows how Globus Auth allows you to outsource all identity
management and authentication functions. Identities can be provided by federated
identity providers, such as InCommon and Google. All REST AP I security
functions, including consent and token issuan ce, validation, and revocation, are
provided by Globus Auth . Your service needs only to provide service-specific
authorization, which can be performed on the basis of identity or group membership.
And because all interactions are compliant with OAuth2 and OIDC standards, any
application that speaks these protocols can use your service as they would any
other; your service can seamlessly leverage other services; and other services can
leverage your service. You can easily build a service to be made available to others
as part of the national cyberinfrastructure; equally, yo u can build a service that
dispatches requests to other elements of that cyberinfrastructure.
11.5 The Research Data Portal Design Pattern
To further illustrate the use of Globus platform services in scientific applications
and workflows, we describ e how they may be used to realize a design pattern
that Eli Dart calls the
research data portal
. In this design pattern, specialized
properties of modern research networks are exploited to enable high-speed, secure
delivery of data to remote users. In particular, the control logic used to manage
data access and delivery is separated from the machinery used to deliver data over
high-speed networks. In this way, order-of-magnitude performance improvements
can be achieved relative to traditional portal architectures in which control logic
243
11.5. The Research Data Portal Design Pattern
and data servers are co-located behind performance-limiting firewalls and on
low-performance web servers.
11.5.1 The Vital Role of Science DMZs and DTNs
A growing number of research universities and laboratories worldwide are connected
in a network fabric that links data stores, scientific instruments, and computational
facilities at unprecedented speeds: 10 or even 100 gigabits per second (Gb/s).
Increasingly, research networks are themselves connected to cloud providers at
comparable speeds. Thus, in principle, it should be possible to move data between
any element of science and engineering infrastructure with great rapidity.
In practice, real transfers often achieve nothing like these theoretically achievable
peak speeds. One common reason for poor performance is firewalls or other
bottlenecks in the network connection between the outside world and the device
from/to which data are to be transferred: the so-called “last mile”—or, outside the
US, the l ast kilometer. The firewalls are often there for a good reason, such as
protecting the sensitive data contained on the administrative computers that are
also connected to the global Internet. But they get in the way of high-bandwidth
science and engineering trac. The other comm on reason for poor performance is
using tools not designed for performance, like secure copy (SCP).
Two concepts, the Science DMZ and Data Transfer Node, are now being widely
deployed to overcome this problem. The
Science DMZ
overcomes the challenges
associated with multipurpose enterprise network archi tectures by placing resources
that need high-performance connectivity in a special subnetwork that is close
(from a network architecture perspective) to the border router that connects the
institution to the high-speed wide area network. Trac between those resources
and the outside world can then bypass internal firewalls.
Note that the point here is not to circumvent security by putting it outside
the firewall. Rather, it is about recognizing that there is certain trac for which
firewalls not only slow things down, but are not needed. The Science DMZ uses
alternative network security approaches that are appropriate for such trac. For
example, the DTN is not wide open: the Science DMZ router blocks most ports.
But the ports necessary for secure, high-performance data transfer are open, and
avoid the packet-inspecting firewalls.
A
Data Transfer Node
(DTN) is a specialized device dedicated to data
transfer functions. These devices are typically Linux servers constructed with
high quality components, configured for both high-speed wide area data transfer
and high-speed access to local storage resources. DTNs run the high-performance
244
Chapter 11. The Globus Research Data Management Platform
Globus Connect data transfer software, introduced in section 3.6 on page 51,
to connect their storage to the Globus cloud and thus to the rest of the world.
General-purpose computing and business productivity applications, such as email
clients and document edi tors, are not installed; this restriction produces more
consistent data transfer behavior and makes security pol icies easier to enforce.
The Science DMZ design pattern also includes other elements, such as inte-
grated perfSONAR monitoring devices [
244
] for performance debugging, specialized
security configurations, and variants used to integrate supercomputers and other
resources. But this brief description covers the essentials. The U.S. Department of
Energy’s Energy Sciences Network (ESnet) has produced detailed configuration
and tuning guides for Science DMZs and DTNs fasterdata.es.net .
11.5.2 The Research Data Portal Application
Eli Dart coined the term research data portal to indicate a web service designed
primarily to serve data to remote users. (The variant that accepts data from
remote users, for example for analysis or publication, has similar properties.) A
research data portal must be able to authenticate and authorize remote users,
allow tho se users to browse and query a potentially large collection of data, and
return selected data (perhaps after subsetting) to remote users. In other words, a
research data portal is like a web server, except that the data that it serves may
be orders of magnitude larger than typical web pages.
Figure 11.7 shows how research data portals have often been architected in
the past. A single
data portal server
both runs portal logic and serves data
from local storage. This architecture is simple but cannot easily achieve high
performance. The problem is that the control logic, being co ncerned with sensitive
topics such as authentication and authorization, needs to sit behind the enterprise
firewall. But this arrangement means that all data served by the portal also pass
through the firewall, which typically means that they are delivered at a small
fraction of the th eoretical peak performance of the available networks.
As figure 11.8 on the next page shows, S cien ce DMZs and DTNs allows for new
architectural approa ches that combine high-speed access and secure operations.
The basic idea is to separate what we may call the portal
control channel
communications (i.e., those concerned with such tasks as user authentication and
data search) and
data channel
communications (i.e., those concerned with data
upload and download delivery). The former can be located on a modestly sized
web server computer protected by the institution’s firewall, with modest capacity
networks, while the latter can be performed via high-speed DTNs and can use
245
11.5. The Research Data Portal Design Pattern
specialized protocols such as GridFTP. The research data portal design pattern
thus defines distinct roles for the web server, which manages who is allowed to do
what, and the Science D MZ, where authorized operations are performed.
Figure 11.7: A legacy data portal, in which both control trac (queries, etc.) and data
trac must pass through the enterp rise firewall. Figure courtesy Eli Dart.
Figure 11.8: A modern research data portal, showing the high-speed data path through
the border router and to the DTN in green and the control path through the enterprise
firewall to the portal server in red. Multiple DTNs provide for high-speed transfer between
network and storage. Figure courtesy Eli Dart.
246
Chapter 11. The Globus Research Data Management Platform
11.5.3 Implementing the Design Pattern with Globus
We now need mechanisms to allow research co d e running on the portal server
to manage access to, and drive transfers to and from, the DTNs. This is where
Globus SDKs come in, as we discuss next. We consider a use case similar to the
NCAR Research Data Archive example that follows. A user requests data for
download; the portal makes the data available via four steps: (1) create a shared
endpoint; (2) copy the requested data to that shared endpoint; (3) set permissions
on the shared endpoint to enable access by the requesting user, and email the user
a URL that they can use to retrieve data from the shared endpoint; and ultimately
(perhaps after several days or weeks), (4) delete the new shared endpoint.
The NCAR Research Data Archive
(RDA) [
48
]
rda.ucar.edu
operated by
the U.S. National Center for Atmospheric Research illustrates some of the issues
that can arise when implementing a research data portal. This system contains more
than 600 data collections, ranging in size from gigabytes to tens of terabytes, and
including meteorological and oceanographic observations, ope ration al and reanalysis
model outputs, and remote sensing datasets, along with ancillary datasets, such as
topography/bathymetry, vegetation , and land use.
The RDA data portal allows users to browse and search catalogs of environmental
datasets, place datasets that they wish to download into a “shopping basket,” and then
download selected datasets to their personal computer or other location. (RDA users
are primarily res earchers at federal and academic research laboratories. In 2014 alone,
more than 11,000 people downloaded more than 1.1 petabytes.) The portal must thus
implement a range of die rent fun c tions, s ome totally domain-independent (e.g., user
identities, authentication, and data transfer) and others more domain-specific (e.g., a
catalog of environmental data collections). As we see later in the chapter, the beauty
of the Globus approach is that much of the domain-independent logic—in particu lar,
that associated with identity management, authentication, data movement, and data
sharing—can be outsourced to cloud services.
We present in figure 11.9 a function
rdp
that implements these actions. As
shown in the following, this function takes as arguments the identifier for the
endpoint on which the shared endpoint is to be created; the folder on that endpoint
for which sharing is to be enabled (h ere,
Share123
,or
shared_dir
in figure 11.1
on page 228); the folder on that endpoint from which the contents of the shared
folder are to be copied; the identifier for the user to be granted access to the new
endpoint; and an email address to send a n otification of the new share.
247
11.5. The Research Data Portal Design Pattern
rdp('b0254878-6d04-11e5-ba46-22000b92c6ec',
'Share123',
'~/TEST/',
'cce13ca1-493a-46e1-a1f0-08bc219638de',
'foster@anl.gov')
As noted in section 3.6 on pa ge 51 and shown in this example, each Globu s
endpoint and user is named by a universally unique identifier (UUID). An endpoint’s
identifier can be determined via the Globus web client or programmatically; a
user’s identifier can be determined programmatically, as we show in noteboo k 8.
The code in figure 11.9 proceeds as follows. In steps 1 and 2, we obtain
Transfer an d Auth client references and use
endpoint_autoactivate
, a Globus
SDK function, to ensure that the research data portal admin has a credential that
permits access to the endpoint identified by
host_id
. (See section 3.6.1 on page 52
for more discussion of endpoint_autoactivate.)
In step 3, we call the function
create_share
of figure 11.2 on page 229, passing
as parameters the Trans fer client reference, the identifier for the endpoint on which
the shared endpoint is to be created, and the path for the folder that is to be
shared: in our example call, the directory
/~/Share123
. As discussed earlier, that
function creates a shared endpoint for the new directory. At this point, the new
shared endpoint exists and is associated with this directory. However, only the
creating user has access to this new shared endpoint at this point.
In step 4, we use a Globus transfer to copy the contents of the folder
source_path
to the new shared endpoint. (The transfer here is from the endpoint on which the
new shared endpoint has been created, but it could be from any Globus endpoint
that the research data portal admin is authorized to acces s.) We have already
introduced the Globus Transfer SDK functions used here in section 3.6 on page 51.
In step 5, we call the
grant_access
function defined in figure 11.2 to grant
our user access to the new shared endpoint. The function call specifies the type
of access to be granted (
'r'
: read only) and the message to be included in a
notification email:
'Your data are available'
. The invitation letter sent to the
user by the Globus SDK function
add_endpoint_acl_rule
is shown in figure 11.10.
The us er is now authorized to download data from the new shared endpoint.
That endpoint will typically be left operational for some period, after which it
can be deleted, as shown in step 6. Note that del eting a shared endpoint does not
delete the data that it contains: The research data portal administrator may want
to retai n the data for other purposes. If the data are not to be retained, we can
use the Globus SDK fu nction submit_delete to delete the folder.
248
Chapter 11. The Globus Research Data Management Platform
from globus_sdk import TransferClient , TransferData , AuthClient
import sys , ra ndom
def rdp (host_id , # Endpoint on which to create shared endpoint
source_path , # Directory to copy shared data from
shared_dir , # Directory name for shared endpoint
user_id): # User to share with
# (1) Obtain Transfer and Auth client references
tc = Tr ansferCli ent()
ac = AuthClient()
# (2) Activate host endpoint
tc . endpoin t_au toact ivat e ( host_id )
# (3) Create shared endpoint
share_id = create_share(tc , host_id , ' /~/ ' +shared_dir+'/ ')
# (4) Copy data into the shared endpoint
tc . endpoin t_au toact ivat e ( share_id )
tdata = TransferData (tc , host_id , share_id ,
label=' Copy to share ',sync_level=' checksum ')
tdata.add_item( source_path , ' /~/ ',recursive=True)
r=tc.submit_transfer(tdata)
tc . task_wait (r [' task_id'], timeout=1000, polling_interval=10)
# (5) Set access control to enable access by user
grant_access(tc , ac, share_id , user_id , 'r ',
' Your data are available ')
# (6) Ultimately , delete the shared endpoint
tc . de lete_en dpoint( share_id )
Figure 11.9: Globus code to implement research data portal design pattern.
249
11.6. The Portal Design Pattern Revisited
From: Globus Notification <noreply@globus.org>
To: Portal server user <user@user.org>
Subject: Portal server admin (admin@therdp.org) shared folder "/" on "Share123" with you
Globus user Portal server admin (admin@therdp.org) shared the folder "/" on the endpoint
"Share123" (endpoint id: 698062fa-88ed-11e6-b029-22000b92c261) with user@user.org,
with the message:
Your data are available.
Use this URL to access the share:
https://www.globus.org/app/transfer?&origin_id=698062fa-88ed-11e6-b029-
22000b92c261&origin_path=/&add_identity=cce13ca1-493a-46e1-a1f0-08bc219638de
The Globus Team
support@globus.org
Figure 11.10: Invitation email sent by the program in fi gure 11.9.
A variant of this approach, with certain administrative advantages, is as follows.
Rather than having the portal server create a new shared endpoint for each request,
a single sh ared endpoint is created once and the portal is given the access manager
role on the shared endpoint so that it can set ACL rules. Then, for each request it
creates a folder on the shared endpoint, puts the data in that location, and sets an
ACL rule to manage access. Cleanup is then simpler: the portal just removes the
ACL rule and deletes the folder.
11.6 The Portal Design Pattern Revisited
The preceding example shows the essentials of a Globus impl ementation of the
research data portal design pattern. We provide in figure 11.11 a more abstract
picture of the architecture that makes clear the components involved and their
relationships. To recap, the
portal web server
at the center of the figure is where
all custom logic associated with the research data portal sits. This portal server
acts as a client, in the Globus A uth/OAuth2 sense, to the other services that it uses
to handle the heavy lifting of authentication and authorization (Globus Auth), data
transfer and sharing (Globus Transfer), and other computations (Other services).
The user accesses portal capabilities via a web browser, and data transfers occur
between Globus Connect servers at various locations.
250
Chapter 11. The Globus Research Data Management Platform
Many variants of this basic research data portal design pattern can be imagined.
A minor variant is to prompt the user for where they want their data placed; the
portal then submits a transfer on the user’s behalf to copy the data to the specified
endpoint and path, hence automating yet another step. Or, the data that users
access may come from experimental facilities rather than a data archive, in which
case data may be deleted after successful download. Access may be granted to
groups of us ers rather than individuals. A portal may allow its users to upload
datasets for analysis and then retrieve analysis results. A data publication portal
may accept data submissions from users, and load data that pass quality control
procedures into a public archive. We give examples of several such variants in the
following, and show that each can naturally be expressed i n terms of the same
basic design pattern.
Similarly, while we have described the research data portal in the context of
an institutional Science DMZ, in which (as shown in figure 11.7) the portal server
and data store both sit within the research institution, other distributions are
also possible and can have advantages. For example, the portal can be deployed
on the public cloud for high availa bility, while the data sits in the Science DMZ
to enable di rect access from high-speed research networks and/or to avoid public
cloud storage ch arges. Alternatively, the portal can be in the research institution
and data in cloud storage. Or both components can be run on cloud resources.
Regardless of the specifics, a research data portal typically n eeds to perform
mundane but important tasks such as determining the identity of a user who wants
to access the service; controlling which users are able to access dierent data and
other services within the portal; uploading data reliably, securely, and eciently
from a variety of locations to storage systems within the Science D MZ; downloading
data reliably, securely, and eciently from storage systems within the Science
DMZ to a variety of locations; dispatching requests to other services on behalf of
users; and logging all actions performed for purposes of audit, accounting, and
reporting. Each task is modestly complex to implement and operate reliably and
well. Building on top of existing services can not only greatly reduce development
costs, but also increase code quality and interoperability via use of standards.
As figure 11.9 shows, the benefits of this approach lie not only in the separation
of concerns between control logi c and data movement. In addition, the portal
developer and admin both benefit from the ability to hand o the management of
file access and transfers to the Globus service. The use of Globus APIs makes it
easy to implement a wide range of behaviors via simple programs; Globus handles
the heavy lifting of high-quality, reliable, and secure authentication, authorizati on,
and data management.
251
11.7. Closing the Loop: From Portal to Graph Service
Figure 11.11: The research data portal architecture, showing principal components. Only
the portal web server logic needs to be provided by the portal developer. Not shown are
other applications that, like the Browser on the left, may access the portal server: for
example, command line, thick client, or mobile applications.
It i s these capabilities that made it easy to realize the example systems men-
tioned in this chapter: the NCAR Research Data Archive, which provides h igh -
speed delivery of research data to geoscientists; the DMagic data sharing system for
data di stribution from light sources; and the Sanger Imputation Service (described
on the next pa ge), which supports online analysis of user-provided genomic data.
11.7 Closing the Loop: From Portal to Graph Service
We have already shown in section 11.4 on page 240 how to use the Globus Auth SDK
to implement a service that responds to requests from a portal server: the arrow
labeled REST from the
Portal web server
to
Other services
in figure 11.11.
Such calls might be used in a research data portal for several reasons. You might
want to organize your portal as a lightweight front end (e.g., pure Javascript) that
interacts with one or more remote backend services. Ano ther reason is that you
might want to provide a public REST API for the ma in portal machinery, so that
other app and service developers can integrate with and build on your portal.
252
Chapter 11. The Globus Research Data Management Platform
Now we look at the logic and code involved in generating such requests. Our
research data service skeleton illustrates this capability. When a user s elects the
Graph
option to request that datasets be graphed, the portal does not perform
those graphing operations itself but instead sends a request to a separate Graph
service. The request provides the names of the datasets to be graphed. The Graph
service retrieves these datasets from a specified loca tion, runs the graphing program,
and uploads the resulting graphs to a dynamically created shared endpoint for
subsequent retrieval. We describe in the following both the portal server and
Graph server code used to implement this behavior.
Figure 11.12 shows a slightly simplified version of the portal code that sets
up, send s, and processes the response from the graph request, using the Python
Requests lib rary [
225
]. The code (1) retrieves the access tokens obtained during
authentication and extracts the access token for the graph service. (The graph
service scope is requested during this flow.) It then (2) assembles the URL, (3)
header (containing the Graph service access token), and (4) data for the REST
call (including information about the requesting user), and (5) dispatches the call.
The remainder of the code (6) checks for a valid response, (7) extracts the location
of the newly created graph files from the response, and (8) and directs the user to
a Globus transfer browser to access the files.
Sanger Institute Imputation Service imputation.sanger.ac.uk
.Operated
by the Sanger Institute in the UK, this service allows you to upload files containing
genome wide association study (GWAS) data from the 23andMe genotyping service
and receive back the results of imputation and other analyses that identify genes
that you are likely to possess based on those data. The service uses Globus APIs to
implement a variant of the research data service design pattern, as follows.
A user who wants to use the service first registers an imputation job. As part of
this process, they are prompted for their name, email addres s, and Globus identity,
and the type of analysis to be performed. The Sanger service then requests Globus to
create a shared en dpoint, share that endpoint with the Globus id entity provided by
the user, and email a link to this endpoint to the user. Th e user clicks on th at link to
upload their GWAS data file and the corresponding imputation task is added to the
imputation queue at the Sanger Institute. Once the imputation task is completed,
the Sanger service requests Globus to create a second shared endpoint to contain the
output and to email the user a link to that new endpoint for download. The overall
process diers from that of figure 11.9 only in that a shared endpoint is used for data
upload as well as download.
253
11.7. Closing the Loop: From Portal to Graph Service
# (1) Get access tokens for the Graph service
tokens = get_portal_tokens ()
gs_token = tokens.get(' Graph Service ')[ ' token']
# (2) Assemble URL for REST call
gs_url = ' {}/{} '. format(app.config[' SERVICE_URL_BASE '], ' api/doit ')
# (3) Assemble request headers
req_headers = dict(Authorization= 'Bearer {}'. format(gs_token))
# (4) Assemble request data. Note retrieval of user info .
req_data = dict(datasets=selected_ids,
year= selected_year ,
user_identity_id=session.get(' primary_identity'),
user_identity_name=session.get(' primary_username'))
# (5) Post request to the Graph service
resp = requests .post (gs_url ,
headers=req_headers ,
data=req_data ,
verify=False)
# (6) Check for valid response
resp. raise_for_status ()
# (7) Extract information from response
resp_data = resp.json()
dest_ep = resp_data.get(' dest_ep ')
dest_path = resp_data.get(' dest_path ')
# (8) Show Globus endpoint browser for new data
return redirect(url_for(' browse' ,endpoint_id=dest_ep,
endpoint_path=dest_path.lstrip('/ ')))
Figure 11.12: Slightly simplified version of the function
graph()
in file
portal/view.py
at github.com/globus/globus-sample-data-portal .
254
Chapter 11. The Globus Research Data Management Platform
11.8 Summary
In the distributed, collaborative, data-rich world of mod ern science, the abilities to
transfer, share, an d analyze data regardless of location, and to navigate complex
security regim es in so doing, are frequently essential to progress. We h ave described
cloud-hosted platform services, Globus Auth, Transfer, and Sharing, to which
developers of appl icati ons and tools that must operate in this world can ou tsource
responsibility for such tasks. We have used the example of a research data portal
to illustrate their use, but they can be used in many dierent configurations.
11.9 Resources
Globus has extensive online documentation for their REST APIs, Python SDK,
and command line interface
dev.globus.org
. Dart et al. [
106
]provideadditional
information on the Science D MZ concept and design pattern.
255