Chapter 11
The Globus Research Data
Management Platform
“Give me where to stand, and I will move the earth.”
We have seen how powerful cloud-based data storage and analysis services can
simplify working with large data. But not all science and engineering data live in
the cloud. Research is highly collaborative and distributed, and frequently requires
specialized resources: data stores, supercomputers, instruments. Thus data are
created, consumed, and stored in a variety of locations, including specialized
scientific laboratories , national facilities , and in stitutional computer centers.
movement and sharing
authentication and authorization
are perennial
challenges that can impose considerable friction on research and collaboration.
We describe in this chapter a set of p latform services that address these
challenges. The Globus cloud service provides data movement, data sharing, and
credential and identity management capabilities. We described briefly in section 3.6
on page 51 how these services can be accessed as software as a service, via web
interfaces. Here, we introduce more details on these services and describe the
Python SDKs that permit their use from within applications. We focus in particular
on how the Globus Auth service makes it straightforward to build science services
that can accept identities from dierent identity providers, use standard protocols
for authentication and authorization, and thus integrate naturally into a global
ecosystem of service providers and consumers. As a use case, we show how these
capabilities can be used to build research data portals.
11.1. Challenges and Opp ortunities of Distributed Data
11.1 Challenges and Opportunities of Distributed Data
Data movement is central to many research activities, including analysis, collab-
oration, publication, an d data preservation. However, given its importance and
ubiquity, this task remains surprisingly challenging in practice: storage system s
have dierent security configurations, achieving good transfer performance is non-
trivial, and as data sizes increase the likelihood of errors increases. Scientists and
engineers frequently struggle with such mundane tasks as authenticating and au-
thorizing user access to storage systems, establishing high-speed data connections,
and recovering from faults whil e a transfer proceeds.
Authentication and authorization are simila rly central to science and engineer-
ing, and for related reasons. Researchers often find themselves needing to navigate
a complex world of dierent identities, authentication methods, and credentials
as they access resources in dierent locations. For example, say tha t you need to
transfer data repeatedly from two sites
to a storage system at your home
. You have accounts at
, with identities
will accept your home institution id entity, thanks to the InCommon identity
management federation [
]. You will commonly need to authenticate once for
each transfer: a painful process and one that prevents scripting. You would prefer
to instead authenticate as
just once, and then perform subsequent
transfers from A and B to H without further authentications.
The data sharing problem sits at the intersection of these two challenges. Say
you want to grant a collaborator access to data at your home institution. Setting
up an account just for that purpose is typically a time consuming process, if it is
possible at all. And it forces your collaborator to deal with yet another username
and password. You need to be able to enable access without a local account.
Globus services address these and other related challenges that arise when our
work requires the integration of resources across dierent loca tions . As well as an
easy-to-use, web-browser based i nterface, Globus provides REST APIs and Python
SDKs to enable the integration of Globus solutions into applications in ways that
reduce development costs and i ncrease security, performance, and reliability.
11.2 The Globus Platform
Globus was first introduced in 2010 as a software-as-a-service solution to the
problem of moving data between pairs of storage systems or
(An endpoint is a storage system that has been connected to the Globus cloud
services by using software called
Globus Connect
.) The Amazon-hosted Globus
Chapter 11. The Globus Research Data Management Platform
software handles the complexity involved in transfers, such as authenticating and
authorizing user access to endpoints, creating a high-speed data connection between
endpoints, and recovering from faults while a transfer proceeds. Importantly, it
implements a thi rd-party transfer model in which no data are transferred via
the Globus service: instead, data are transferred directly between endpoint pairs
by using a protocol called GridFTP that provides specialized support for high
performance and reliability [
]. Globus can also perform rsync-like updates when
doing repeated transfers, allowing transfer of only new or modified files from the
source to the destination. Direct HTTPS transfers to and from endpoints are also
supported, allowing web browser access to data stored on Globus endpoints.
The Glob us team has subsequently built on this initial
Globus Transfer
service by adding
for identity and credential management,
for group
for data sharing, and
Data Search
data management. Importantly, the Globus team also created REST APIs and
Python SDKs to allow these capabilities to be used programmatically, from within
applications. It is these platform capabilities that we describe in this chapter,
building on the introductory material in section 3.6.1 on page 52, where we showed
how to use the Globus Python SDK to initiate, monitor, and co ntrol data transfers.
We first provide additional details on the programmatic use of Glob us Sharing
capabilities, then i ntrodu ce the use of Gl obu s Auth, and finally present illustrative
examples of the use of these capabilities.
11.2.1 Globus Transfer and Sharing
We introduced Globus Sharing capabilities in section 3.6.2 on page 54. Here we
show how to use the Python SDK to manage sharing programmatically. Recall
that Globus Sharing allows a user to make a specified folder on a Globus endpoint
accessible to other Globus users. Figure 11.1 shows the idea. Bob has enabled
sharing of folder ~/shared_dir on Regular endpoint by creating Shared endpoint,
and then granting Jane access to that shared endpoint. Jane can then use Globus
Transfer to read and/or write files in the shared fol der, depending on what rights
she has been granted.
As is the case with the Globus Transfer service presented in chapter 3, all data
sharing capabilities oered by the Globus web interface are also accessible via
the Python SDK. The code in figure 11.2 on page 229 illustrates their use. We
explain each of the two functions in the figure in turn. We use both functions in
section 11.5.3 on page 247 as part of a research data portal implementation.
11.2. The Globus Platform
Figure 11.1: The Globus shared endpoint construct allows an authorized administrator of
shared endpoint
granting access to a folder
within that endp oint, to which they can then authorize access by others (say Jane).
First, the
function: We assume that we have previously initiated
a transfer object,
, in the manner illustrated in the first lines of figure 3.9
on page 55, and that this object is passed to the function, along with the end-
point identifier and path for the folder that is to be shared (“Regular endpoint”
, respectively, in figure 11.1). The function uses the Globus
SDK function
to request creation of the specified
the specified
. It then creates a parameter structure, calls the Globus
SDK function
to create a shared endpoint for the new
directory, and finally returns the identifier for the new endpoi nt.
Second, the
function: This function requires both
and a
Globus Auth client reference,
(we introduce Auth in the next section); identifiers
for the shared endpoint (
) for which sharing is to be enabled and the user
: a UUID, as with endpoint identifiers) to whom a ccess is to be granted;
the type of access to be granted (
: can be
); and a message
to be emailed to the user upon completion. The function uses the Globus Auth
SDK function
to determine the identities that are associated with
the user for whom sharing is to be enabled, and extracts from this list an email
address. It then uses the Globus Transfer SDK function
to add an access control rule to the shared endpoint, granting the specified access
type to the specified user.
11.2.2 The rule_data St ructure
Our example program passes a
structure to the
function. The various elements specify, among other things:
'principal_type': the type of principal to which the rule applies;
Chapter 11. The Globus Research Data Management Platform
# Create a shared endpoint on specified ' endpoint ' and ' folder ';
# Return the endpoint id for new endpoint .
# Supplied ' tc ' is Globus transfer client reference.
def create_share(tc, endpoint , folder):
# Create directory to be shared
tc . op eration _mkdir( endpoint , path = folder )
# Create the shared endpoint on specified folder
shared_ep_data = {
' DATA_TYPE' : ' shared_endpoint' ,
' host_endpoint':endpoint,
' host_path' :folder,
' display_name' : ' Share ' +folder,
' description' : 'New shared endpoint '
# Return identifier of the newly created shared endpoint
return(r[' id '])
# Grant ' user_id' access ' atype ' on ' share_id '; email ' message'
# Supplied ' tc ' and 'ac ' are Globus Transfer and Auth client refs.
def grant_access(tc, ac, share_id , user_id , atype , message):
# (1)
email = r[' identities '][0][ 'email ']
rule_data = {
' DATA_TYPE' : 'access ',
' principal_type': ' identity', # To whom is access granted?
' principal' :user_id, # To an individual user
' path ' : '/ ', # Grant access to this path
' permissions' :atype, # Grant specified access
' notify_email' :email, # Email invite to this address
' notify_message':message # Include this message in email
Figure 11.2: A function that uses the Globus Python SDK to create a shared endpoint.
11.3. Identity and Credential Management
, thi s is the user id
with whom sharing is to be enabled;
: the type of access being granted: in this case read-only
('r'), but could a lso be read and write ('rw');
: an email address to which an invitation to access the
shared endpoint should be sent; and
'notify_message': a message to include in the invitation email.
element can al so take the value
, i n which case
element must be a group id. Alternatively, it can take the values
, in which cases the
element must be an empty string.
11.3 Identity and Credential Management
We noted above the challenges that users face when authenticating to dierent
sites and services in the course of their work. Similarly, service developers need
mechanisms for establishing the identity of a requesting user and for determin ing
what that user is authorized to do. Figure 11.3 illustrates some of the concepts
and issues involved. An end user wants to run an application that makes requests
to remote services on her behalf. Those remote services may themselves want to
make further calls to other
dependent services
. For consistency with commonly
used terminology, we refer to the us er as the
resource owner
, the application as
the client, and each remote and dependent service as a resource server.
Two interrelated problems frequently arise i n such contexts. The first concerns
the use of
alternative identity providers
. A resource server frequently wants
to establish the identity of the user (i.e., resource owner) who issued an incoming
request, often to determine whether to grant access and sometimes simply to log who
is using their service. In the past, developers of resource servers often implemented
their own username-password authentication systems, but such approaches are
inconvenient and insecure. Instead, we wa nt to allow a resource server to accept
credentials from other identity providers: for example, that associated with a user’s
home institution. Furthermore, dierent resource servers may require dierent
credentials. For examp le, to transfer a file from the University of Chicago to
Lawrence Berkeley National Laboratory, I must authenticate with both my Chicago
and my Berkeley identities to establish my credential s to access file systems at
Chicago and Berkeley, respectively.
Chapter 11. The Globus Research Data Management Platform
Figure 11.3: A schematic of the entities and interactions that engage in distributed
resource accesses, using the terminology of OAuth2.
The second problem concerns
(restricted) delegation
. A resource server
may need to perform actions on behalf of a requesting user. For example, it may
need to transfer files or perform computations. It may then need credentials that
allow it to establish its authority to perform such actions. (This requirement
is especially important if the resource server needs to operate in an unattended
manner, for example so that it can continue file tran sfers or computations while
the user eats lunch.) However, users may not want to grant unlimited rights to a
remote service to perform actions on their behalf, due to the potential for harm
if a credential is compromised. Thus, the ability to restrict the rights that are
delegated is important. For example, you might be ok with a service reading, but
not writing, files on a certain server. And you certainly do not want a compromised
service to be able to use oth er services that you have not authorized.
As we describe in the following, the cloud-hosted Globus Auth service addresses
these and other related concerns.
11.3.1 Globus Auth Is an Authorization Service
Globus Auth leverages two widely used web standards, the OAuth 2.0 Authoriza-
tion Framework (OAuth 2) [
] and OpenID Connect Core 1. 0 (OIDC) [
], to
implement solutions to these problems. OAuth2 is a widely used proto col that
applications can use to provide client applications with
secure delegated access
the delegation that we spoke about above. It works over HTTP and uses
to authorize servers, applications, and other entities. OIDC is a simple
identity layer on top of the OAuth protocol.
The cloud-hosted Globus Auth service is what OAuth2 calls an
. As su ch, it can issue access tokens to a
after s uccess full y authen-
ticating the
resource owner
and obtaining authorization from that resource
owner for the client to access resources provided by a
resource server
. (This
11.3. Identity and Credential Management
Figure 11.4: A Globus Auth cons ent request, in this case for the Globus web application.
authorization process typically involves a request for consent, such as those shown
in figure 11.4.) The resource owner in this scen ario is typically an end user, who
authenticates to a Globus Auth-managed Globus acco unt us in g an identity issued
by one of an extensible set of (federated) id entity providers supported by Globus
Auth. A resource owner could also be a robot, agent, or service acting on its own
behalf, rather than on behalf of a user; the client may be either an application
(e.g., web, mobile, desktop, command line) or another service acting as a client, as
we explain in subsequent discussion.
Having obtained an a ccess token, the client can then present that token as part
of a request to the resource server for which the token applies, to demonstrate that
it is authorized to make the request. The token is included in the request via the
HTTPS Authorization header.
Access tokens are thus the key to OAuth2 and Globus Auth. An access token
represents an authorization issued by a resource owner to a client, authorizing
the client to request access to a specified resource s erver on the resource owner’s
behalf. As we describe later, the resource server can then ask the Globus Auth
authorization service for details on what rights have been granted: a process that
is referred to as “introsp ection .” For example, if the resource owner in figure 11.3
wants to allow a client (e.g., a web portal) to access a remote service but only for
Chapter 11. The Globus Research Data Management Platform
purposes of reading during the next hour, introspection of the associated token can
reveal those restrictions. Globus Auth thus addresses the problems of (restricted)
delegation. It also supports the linking of multiple identities, as we discuss below,
to address the problem o f alternative identity providers.
A resource server receiving a token from a client can thus determine that the
resource owner has authorized it to perform certain actions on the resource owner’s
behalf. What if the resource server then wants to reach out to other resource servers,
for example to Globus Transfer to request a data transfer? A problem arises: the
resource server has a token that authorizes it to perform actions itself, but it has
no token that it can present to the Globus Transfer service to demonstrate that
the resource owner (the end user in our example) has authorized transfers.
This is where
dependent services
come i n. When a resource server
registered with Globus Auth, it can specify services that it needs to access to
perform its functions: its dependent services, say
. A request from
Globus Auth for authorization then causes Globus Auth to request consent from
the user not only for
but also for the dependent services
example of this scenario in figure 11.4: the Globus web application has registered
Globus Transfer and Globus Groups as dependent services, and thus you see the
user being asked to consent to those uses. Once consent has been granted, the
resource server can request additional dependent access tokens, as required, that
it can then includ e in requests to other services that it makes on the authorizing
resource own er’s behalf.
OAuth2 and Globus Auth incorporate various complexities and subtleties,
but the basic steps are simple. A user accesses an application; Globus Auth
authenticates and requests consents from the end user; Globus Auth provides
access tokens to the application; the application uses access tokens to access other
services; a service receiving an access token can validate it and use it to request
dependent access tokens to access other services. Importantly, di erent actors can
play dierent roles at dierent times: your web browser can be a client to a web
service, that itself can act as a client to other services, and so on.
11.3.2 A Typical Globus Auth Workflow
We use figure 11.5 on page 235 to illustrate how Globus Auth works. Th e figure looks
complicated, but please bear with us: the u nderl ying concepts are straightforward.
We describe each of the 12 steps shown in the figure in turn.
The end user accesses the application to make a request to a remote service.
11.3. Identity and Credential Management
The application might be a Web client or, alternatively, an application
running on the user’s desktop or some other computer.
The application contacts Globus Auth to request authorization for the use
of a set of
. A scope represents a set of capabilities provided by a
resource server for which an access token is to be granted. In this case, the
application requests two scopes: one for access to login information and one
for HTTPS/REST API access.
Globus Auth arranges for authentication of the user, using an identity provider
that is mutually acceptable to the user and the application. Because the user
only authenticates with the authorization server, the user’s credentials are
never shared with the client or with Globus Auth.
4. Globus Auth returns an authorization code to the user.
The user requests access tokens from Glob us Auth, passing the previously
acquired authorization code to establish their right to obtain these tokens.
Access tokens are returned, one per requested scope. The issuance of multiple
tokens enhances security by limiting the impact of a compromise.
The cli ent can now use the access token in an HTTPS/REST request to a
resource server, by setting an HTTPS
Authorization: Bearer
header with
the appropriate token. (For concreteness, the remote service is here shown
as Globus Transfer, but it could be anything.)
Using a recent OAuth2 extension [
], the resource server can contact
Globus Auth to “introspect” the token and thus obtain answers to questions
such as “is the token valid?,” “which resource owner is it for?,” “what client is
making the request?,” and “which scope is it for?”
Globus Auth responds to the introspection request. The resource server can
use the provided information to make an authorization decision as to how it
responds to the client request.
The resource server can also use its access token to request dependent access
tokens for any dependent services. For example, Globus Transfer can retrieve
an access token for the Globus Groups resource server, so that it can check
if the requesting user is a member of a particular group before taking some
action like allowing access to a shared endpoint.
11. Globus Auth returns requested dependent tokens.
Chapter 11. The Globus Research Data Management Platform
Figure 11.5: Entities and interactions involved in Globus Auth-mediated distributed
resource requests. Details are provided in the text.
The resource server uses a newly issued dependent a ccess token in an HTTP-
S/REST request to th e second resource server.
There are other OAuth2 and Globu s Auth details that are not covered here:
for exampl e, refresh tokens (because an access token’s lifetime may be less than
that of an application) and the somewhat dierent methods used in the case of a
long-lived application rather than a web browser. Also, an alternative protocol is
used for rich clients such as the Javas cript-based Globus Transfer client that avoids
the needs for steps 4 and 5; a variant of this flow supports mobile, command line,
and desktop applications: “native apps.” But we have covered the essentials.
11.3.3 Globus Auth Identities
Globus Auth maintains information about the identities that its users may use to
authenticate. A Globus Auth identity has a unique, case-insensitive username (for
), iss ued by an identity provider (e.g., a University,
research laboratory, or Google), for which a user or client can prove possession via
an authentication process (e.g., presenting a password to the identity provider).
Globus Auth manages the use of identities (e.g., to login to clients and services),
their properties (e.g., contact information), and relationships among identities
(e.g., allowing login to an identity by using another linked, “federated” identity).
11.3. Identity and Credential Management
Globus Auth neither defines its own identity usernames nor verifies authen-
tication (e.g., via passwords) with identities. Rather, it acts as an intermediary
between external identity providers, on the one han d, and clients and services that
want to leverage identities issued by those providers, on the other. Globus Auth
assigns each identity that it encounters an identifier: a UUID that is guaranteed to
be unique among all Globus Au th identities, and that will never be reused. This
ID is what resource servers and clients should use as the canonical identifier for a
Globus Auth identity. Associated with this ID are an identity provider, a username
given to the identity by the provider, and other provider-supplied information such
as display name and contact email address.
An example Globus Auth identity
. The following is an example of the informa-
tion that m ay be associated with a Globus Auth identity:
username : ro
id : de305d54-75b4-431b-adb2-eb6b9e546014
identity_provider :
display_name : Rocket J. Squirrel
email :
Globus supports more than 100 identity providers, and more are being added all
the time. Examples include the many US and international uni versities and oth er
institutions that support InCommon; various identity providers that support the
OpenID Connect protocol; Google; and the Open Researcher and Contributor ID
(ORCID). The process of integrating a new identity provider is beyond the scope
of this book, but it is a straightforward process. See the Globus documentation
for more information.
11.3.4 Globus Accounts
An identity can be used with Globu s Auth to create a
Globus account
. A Globus
account has a primary identity, but can also have any number of other identities
linked to it as well. Thus, for example, Mr. Squirrel may create a Globus account
with the identity above and then link to that account a Google identity, his ORCID,
and an identity provided by a scientific facility to which he has access.
A Glob us account is not an identity itself. It does not have its own name or
identifier. Rather, a Globu s account is identified by its primary id entity. Similarly,
profile information and other metadata are tied to identities, not to accounts. A
Globus account is simply a set of identities comprising the primary identity and
all identities linked to that primary identity.
Chapter 11. The Globus Research Data Management Platform
11.3.5 Using Globus Auth Identities
Clients and resource servers should always use the Globus Auth-provided identity
ID when referring to an identity, for example in access control lists, and when
referring to identiti es in a REST API. Clients and resource servers can use the
Globus Auth REST API to map any identity username to its (current) identity
ID, and request in formation about an identi ty ID (e.g., username, display_name,
provider, email), for example as follows:
import globus_sdk
# Obtain reference to Globus Auth client
ac = globus_sdk .AuthClient()
# Get identifies associated with username ' globus@globus. org '
id =ac.get_identities(usernames='')
# Return zero or more UUIDs
# Get identities associated with a UUID
r=ac.get_identities(ids=id )
The last command returns a JSON document containing a list of identities,
such as the following. (This example document contains just one identity.)
{ ' identities ':
[{' email ' :None,
'id ' : '46bd0f56 -e24f -11 e5 - a510 -131bef46955c' ,
' identity_provider': '7daddf46-70c5-45ee-9f0f-7244fe7c8707',
' name ' :None,
' organization' :None,
' status' : 'unused ',
' username' : ''}
11.3.6 Use of Globus Auth by Resource Servers
Having introduced various details of the Globus Auth server, Globus Auth identities,
and Globu s accounts, we can now turn to the practical ques tion of what we can
do with these mechanisms. In particular, we describe how resource servers can
use Globus Auth as an authorization server and thus both support sophisticated
OAuth2 a nd OpenID Connect functionality, and leverage other resource servers
that use Globus Auth.
Let us consider, for example, a research data service that accepts user requests
to analyze genomic sequence data. (We describe an example of such a system,
Globus Genomics, in section 14.4 on page 303.) This service is basically a data
11.3. Identity and Credential Management
and code repository wi th a REST API, whi ch other applications can leverage to
access this repository programmatically.
This service is a resource server in the Globus Auth context. It needs to be able
to authenticate users, validate user requests , and make requests to other services
(e.g., to cloud or institutional storage to retrieve sequence data and store results,
and to computing facilities to perform computations) on a u ser’s behalf. Globus
Auth allows us to program each of these capabilities via manipulation of identities,
access tokens , and OAuth2 protocol messages.
Assume that some client to this service has al ready followed steps 1–7 in
figure 11.5 on page 235 and thus possesses the necessary access tokens. (The “client”
may be a web client to the data server, or so me other web, mobile, desktop, or
command line application.) Interactions may then proceed as follows.
The client makes an HTTPS request to the resource server (the research
data service proper: in the following we refer to it as the “data service”) with
Authorization: Bearer
header containing an access token. (Step 8 in
figure 11.5.)
The data service calls the function
provided by
the Globus Auth SDK, authorized by the data service’s client identifier
and client secret (see below), to validate the request access token, and
obtain additional information related to that token (scopes, eective identity,
identities set, etc.). If the token is not valid, or is not intended for use with
this resource server, Globus Auth returns an error.
The data service verifies that the request from its client conforms to the
scopes associated with the request access token.
The data service verifies the identity of the resource owner (typically an end
user) on whose behalf the client is acting . Th e data service may use this
identity as its local account identifier for this user.
The data service uses the set of identities associated with the account referred
to by the request access token to determine what the request is al lowed to
do. For example, if the request is to access a resource that is sh ared with
particular identities, the data service should compare all o f the account’s
identities (primary and linked identity ids) with the resource access control
permissions to determine if the request should be granted .
The data service may need to act as a client to other (dependent) resource
servers, as discussed above. In that case, the data service uses the Globus SDK
Chapter 11. The Globus Research Data Management Platform
function to get dependent access tokens for
use with downstream resource servers, based on the request access token that
it received from the client.
The data service uses a dependent access token to make a request to a
dependent resource server.
8. The data service responds to its client with an appropriate response.
A note regarding the client identifier and client secret mentioned in Step 2:
Each client and resource server must register with Globus Auth and obtain a
client id
client secret
, which they can subsequently use with Globus Auth
to prove who it is in the various OAuth2 messages: fo r example, when swapping an
authorization token for a n access token, calling token introspect, calling dependent
token grant, and using a refresh token to obtain a new access token.
11.3.7 Other Globus Capabilities
Globus also supports a growing set of other capabilities beyond those described
here. For example, tabl e 11.1 lists additional functions supported by the Globus
Transfer Python SDK.
Table 11.1: Some of the close to 50 functions supp orted by the Globus Transfer Python
SDK. (Others mostly implement endpoint administration functions.)
Type Function Description
endpoint_search Search on name, keywords, etc.
get_endpoint Get endpoint information
my_shared_endpoint_list Get endpoints that I manage
File system
operation_mkdir Create a folder on endpoint
operation_ls List contents of endpoint
operation_rename Rename folder or directory
submit_transfer Submit a transfer request
submit_delete Submit a delete request
cancel_task Cancel submitted request
task_wait Wait for task to complete
task_list Get information about tasks
get_task Get information about a task
task_event_list Get event info for a tas k
task_successful_transfers Get successful transfers for task
task_pause_info Get info on why task paused
11.4. Building a Remotely Accessible Service
Other Globus services provide other