Part II

Computing in the Cloud

Managing&data&in&the&cloud

File%systems

Object%stores

Datab a s es%(SQL)

NoSQL%and%graphs

Warehouses

Globus%file%services

Computing&in&the&cloud

Virtual%machines

Containers%– Docker

MapReduce%– Yarn%and%Spark

HPC%clust ers%in%the%cloud

Mesos,%Swarm,%Kubernetes

HTCond or

The&cloud&as&platform

Data%analytics

Spark%&%Hadoop

Public%cloud%Tool s

Streaming%data

Kafka,%Spark,%Beam

Kinesis,%Azure%Events

Machine%learning

Scikit-Learn,%CNTK,%

Tenso rf lo w,%AWS% ML

Building&your&own&cloud

What%you%need%to%know

Using%Eucalyptus

Using%OpenStack

Security&and&other&topics

Securing%services%and%data%

Solutions

History,%critiques,%futures

Research%data%portals

DMZs%and%DT Ns,%Globu s

Science%gateways

Part&I

Part&II

Part&III

Part&IV

Part&V

Part II:

Computing in the Cloud

While scientists were ﬁrst attracted to the cloud by the ability to store and

share data, it was the introducti on o f cheap on-demand computing that created a

paradigm shift. In this second part of the book, we follow the pattern established

in the preceding one: we ﬁrst introduce principles and then show how you can use

both cloud portals and Python SDKs to compute on various cloud platforms.

Computing in the cloud has gone through a fascinating evolution. It started

with virtualization, an old computing technology ﬁrst invented in the context of

mainframe computers and later adopted within data centers as a means of al lowing

customers to create environments and services that are uniquely tailored to their

needs. Virtual machines can be started and stopped easily, an d the customer is

charged only for the time that the machine instance is running. In chapter 5, we

describe how to create and manage virtual machines on cloud platforms.

A second stage of the evolution of computing in the cloud was the introduction

of containers as a means of encapsulating software. Container technologies allow

researchers to share deployed applications that can be deployed rapidly on any

cloud and then run with a single command. In chapter 6, we show you how to

create and deploy containers based on a technology called Docker.

Scale has always been a critical cloud capability and a major requirement of

scientists. By “scale” we mean the ability of computation to be spread over multip le

cloud servers to exploit parallelism in the application. In chapter 7, we consider

four types of parallel application execution:

• SPMD clusters in the cloud, for traditional HPC-style computing.

• Many task

high throughput

parallel computation, characterized by a

large bag of tasks with few or no dependencies and that thus can be executed

in parallel.

• MapReduce and BSP

style parallelism, in which a singl e thread of control

applies parallel operators over distributed data. In the cloud, such compu-

tations often involve executing a directed graph of data parallel operations.

This model is used in tools such as Spark, many of the open source data

analytics tools, and most of the deep learning systems that we discuss in

part III.

• Microservices

, the most “cloud native” computational model. It uses frame-

works such as M esos and Kubernetes to allow applications to be composed

of swarms of dockerized, mostly stateless, small communicating services.

We also include a short discussion of

serverless computing

, a relatively new

capability of the big public clouds that, l ike many computational ideas, has deep

roots in operating system des ign. Brieﬂy stated, it allows a programmer to deﬁne

the code for an application plus the events that should cause that code to execute,

and then release the code to the cloud in such a way that its execution does not

require any cloud resource deployment by the user or programmer.