Chapter 10

Machine Learning in the Cloud

“Learning is any change in a system that produces a more or less

permanent change in its capacity for adapting to its environment.”

—Herbert Simon, The Sciences of the Artiﬁcial

Machine learning has become central to applications of cloud computing. While

machine learning is considered part of the ﬁeld of artiﬁcial intelligence, it has roots

in statistics and mathematical optimization theory and practice. In recent years it

has grown in importance as a number of critical appl icati on breakthroughs have

taken place. These include human-quality speech recognition [

144

] and real-time

automatic language translation [

], computer vision accurate and fast enough to

propel self-driving cars [

], and applications of rei nforcement learning tha t allow

machines to master some of the mos t com plex human games, such as Go [234].

What has enabled these breakthroughs has been a convergence of th e availabil-

ity of big data plus algorithmic advances and faster computers that have made it

possible to train even deep neural networks. The same technology is now being

applied to scientiﬁc prob lem s as diverse as predicting protein structure [

180

], pre-

dicting the pharmacological properties of drugs [

], and identifying new materials

with desired properties [264].

In this chapter we introduce some of the major machine learning tools that are

available in public clouds, as well as toolkits that you can install on a private cloud.

We begin with our old friend Spark and its machine learning (ML) package, and

then move to Azure ML. We progress from the core “classical” ML tools, including

logistic regression, clustering, and random forests, to take a brief look at deep

learning and deep learning toolkits. Given our emphasis on P ython, the reader may

10.1. Spark Machine Learning Library (MLlib)

expect u s to cover the excellent Python library scikit-learn. However, scikit-learn

is well covered elsewhere [

253

], and we introduced several of its ML methods in our

microservice-based science document classiﬁer example in chapter 7. We describe

the same example, but using diﬀerent technology, later in this chapter.

10.1 Spark Machine Learning Library (MLlib)

Spark MLlib [

198

], sometimes referred to as Spark ML, provides a set of high-level

APIs for creating ML pipelines. It implements four basic concepts.

• DataFrames

are containers created from Spark RDDs to hold vectors and

other structured types in a manner that permits eﬃcient execution [

]. Spark

DataFrames are similar to Pandas DataFrames and share some operations.

They are distributed objects that are part of the execution graph. You can

convert them to Pandas DataFrames to acces s them in Python.

• Transformers

are operators that convert one DataFrame to another. Since

they are nodes on the execution graph, they are not evaluated until the entire

graph is executed.

• Estimators

encapsulate ML and other algorithms. As we describe in the

following, you can use the

fit(...)

method to pass a D ataFrame and

parameters to a learning algorithm to create a model. The model is now

represented as a Trans former.

•

Pipeline

(usually linear, bu t can be a directed acyclic graph) l inks Trans-

formers and Estim ators to specify an ML workﬂow. Pipelines inherit the

fit(...)

method from the contained estimator. Once the estimator is

trained, the pipeli ne is a model and has a

transform(...)

method that can

be used to push new cases through the pipeline to make predictions.

Many transformers exist, for example to turn text documents into vectors of

real numbers, convert columns of a DataFrame from one form to another, or split

DataFrames into subsets. There are also various kinds of estimators, ranging from

those that transform vectors by projecting them onto principal component vectors,

-gram generators that take text documents and return strings of

consecutive

words. Classiﬁcation models inclu de logistic regression, decision tree classiﬁers,

random forests, and naive Bayes. The family of clustering methods includes

means and latent Dirichlet allocation (LDA). The MLlib online documentation

provides much useful material on these and related topics [29] .

192

Chapter 10. Machine Learning in the Cloud

10.1.1 Logistic Regression

The example that follows employs a metho d called

logistic regression

[

103

which we introduce here. Suppose we have a set of feature vectors

2 R

for

in [0

]. Associated with each feature vector is a binary outcome

.Weare

interested in the cond itio nal probability

(

), which we approximate by a

function

(

). Becaus e

(

) is between 0 and 1, it is not expressible as a linear

function of

, and thus we cannot use regular linear regression. Instead, we look

at the “odds” expression p(x)/(1  p(x)) and guess that its log is linear. That is:

✓

p(x)

1  p(x)

◆

= b

+ b · x,

where the oﬀset

and the vector

,...b

] deﬁne a hyperplane for linear

regression. Solving this expression for p(x) we obtain:

p(x)=

1+e

(b

+b·x)

We then predict

=1if

(

)

0 and zero otherwise. Unfortunately, ﬁnding

the best

and

is not as easy as in the case of linear regression. However, s imp le

Newton-like iterations converge to good solutions if we have a sample of the feature

vectors and known outcomes.

(We note that the logistic function (t) is deﬁned a s fol lows:

(t)=

1+e

t

It is used frequently in machine learning to map a real number into a probability

range [0, 1]; we use it for this purpose la ter in this chapter.)

10.1.2 Chicago Restaurant Example

To illustrate the use of Spark MLlib, we apply it to an example from the Azure

HDInsight tutorial [

195

], na mely predicting whether restaurants pass or fail health

inspections based on the free text of an inspector’s comments. We provide two

versions of this example in notebook 18: the HDInsight version and a version that

runs on any generic Spark deployment. We present the second here.

The data, from the City of Chicago Data Portal

data.cityofchicago.org

,are

a set of restaurant heath inspection reports. Each inspection report contains a

report number, the name of the owner of the establishment, the name of the

193

10.1. Spark Machine Learning Library (MLlib)

establishment, the address, and an outcome (“Pass,” “Fail,” or some alternative

such as “out of business” or “not available”). It also contain s the (free-text) English

comments from the inspector.

We ﬁrst read the data. If we are using Azure HDInsight, we can load it from

blob storage as follows. We use a simple function

csvParse

that takes each line in

the CSV ﬁle and parses it using Python’s csv.reader() function.

inspections = spark.sparkContext.textFile( \

' wasb :/// HdiSamples /HdiSamples /FoodInspectionData /

Food_Inspections1.csv'). map(csvParse)

The version of the program in notebook 18 uses a slightly reduced dataset. We

have eliminated the address ﬁelds and some other data that we do not use here.

inspections = spark.sparkContext.textFile(

'/path-to-reduced-data/Food_Inspections1.csv' ). map(csvParse)

We want to create a

training set

from a set of inspection reports that contain

outcomes, for use in ﬁtting our logistic regression model. We ﬁrst convert the RDD

containing the data,

inspections

, to create a DataFrame,

, with four ﬁelds:

record id, restaurant name, inspection result, and any recorded violations.

schema = StructType ([ StructField ("id", IntegerType (), False),

StructField("name", StringType(), False),

StructField("results", StringType(), False),

StructField("violations", StringType (), True)])

df = spark .creat eData Frame( inspections . map (\

lambda l: (int(l[0]), l[2], l[3], l[4])) , schema)

df . re gister TempTa ble ( ' CountResults')

If we want to look at the ﬁrst few elements, we can apply the

show()

function

to return values to the Python environment.

df . show(5)

+-------+--------------------+---------------+--------------------+

+-------+--------------------+---------------+--------------------+

+-------+--------------------+---------------+--------------------+

only showing top 5 rows

194

Chapter 10. Machine Learning in the Cloud

Fortunately for the people o f Chicago, it seems that the majority of the

inspections result in passing grades. We can use some DataFrame operations to

count the passing and failing grades.

print("Passing = %d"%df[df.results == ' Pass '].count())

print("Failing = %d"%df[df.results == ' Fail '].count())

Passing = 61204

Failing = 20225

To train a logi stic regression model, we need a DataFrame with a bin ary label

and feature vector for each record. We do not want to use records associated with

“out of business” or other special cases, so we map “Pass” and “Pass with conditions”

to 1, “Fail” to 0, and all others to -1, whi ch we ﬁlter out.

def labelForResults(s):

if s==' Fail ':

return 0.0

elif s==' Pass w/ Conditions ' or s==' Pass ':

return 1.0

else:

return -1.0

label = UserDefinedFunction (labelForResults , DoubleType ())

labeledData = df.select(label(df.results ).alias('label'), \

df . violations).where ( ' label >= 0')

We now have a DataFrame with two columns,

label

and

violations

and we

are ready to create and run the Spark MLlib pipeline that we will use to train our

logistic regression model, which we do with the following code.

# 1) Define pipeline components

# a) Tokenize ' violations' and place result in new column ' words '

tokenizer = Tokenizer (inputCol=" violations", outputCol="words")

# b) Hash ' words ' to create new column of ' features '

hashingTF = HashingTF (inputCol="words" , outputCol="features")

# c) Create instance of logistic regression

lr = Log isticR egress ion ( maxIter=10 , regParam=0.01)

# 2) Construct pipeline: tokenize , hash , logistic regression

pipeline = Pipeline(stages =[tokenizer , hashingTF , lr])

# 3) Run pipeline to create model

model = pipeline.fit(labeledData )

We ﬁrst (1) deﬁne our three pipeline components, which (a) tokenize each

violations

entry (a text string) by reducing it to lower case and splitting it into

195

10.1. Spark Machine Learning Library (MLlib)

a vector of words; (b) convert each word vector into a vector in

for some

by applying a hash function to map each word token into a real number value

(the new vectors have length equal to the size of the vocabulary, and are stored as

sparse vectors); and (c) create an instance of l ogi stic regression. We then (2) put

everything into a pipeline and (3) ﬁt the model with our labeled data.

Recall that Spark implements a graph execution model. Here, the pipeline

created by the Python program is the graph; this graph is passed to the Spark

execution engine by calling the

fit(...)

method on the pipeline. Notice that

the

tokenizer

component adds a column

words

to our working DataFrame, and

hashingTF

adds a column

features

; thus, the working DataFrame has columns

ID, name, results, label, violations, words, features

when logistic re-

gression is run. The names a re important, as logistic regression looks for columns

label, features

, which it uses for training to build the model. The trainer is

iterative; we give it 10 iterations and an algorithm-dependent value of 0.01.

We can now test the model with a separate tes t collection as follows.

testData = spark.sparkContext.textFile(

'/data_path/Food_Inspections2.csv')\

. map (csvParse) \

. map ( lambda l: (int(l[0]), l[2], l[3], l[4]))

testDf = spark. createDataFrame( testData , schema ).

where("results = ' Fail ' OR results = 'Pass' OR \

results = ' Pass w/ Conditions '")

predictionsDf = model.transform(testDf)

The logistic regression model has appended several new columns to the data

frame, including one called

prediction

. To test our prediction success rate, we

compare the prediction column with the results column.

numSuccesses = predictionsDf.where(\

"""( prediction = 0 AND results = ' Fail ') OR \

( prediction = 1 AND ( results = ' Pass ' OR \

results = ' Pass w/ Conditions '))""" ). count()

numInspections = predictionsDf.count()

print("There were %d inspections and there were %d predictions"\

%( numInspections , num Successes ))

print("This is a %2.2f sucess rate"\

%( float (numSuccesses) / float (numInspections) * 100))

We see the following output:

There were 30694 inspections and there were 27774 predictions

This is a 90.49\% success rate

196

Chapter 10. Machine Learning in the Cloud

Before getting too excited about this result, we examine other measures of

success, such as

precision

and

recall

, that are widely used in ML research. When

applied to our ability to predi ct failure, recall is the proba bil ity that we predicted

as failing a randomly s elected inspection from those with failing grades. As detailed

in notebook 18, we ﬁnd that our recall probability is only 67%. Our ability to

predict failure is thus well below our ability to predict passing. The reason may be

that other factors involved with failure are not reﬂected in the report.

10.2 Azure Machine Learning Workspace

Azure Machine Learning

is a cloud portal for designing and trai ning machine

learning cloud services. It is based on a drag-and-drop component com position

model, in which you build a solution to a machine learning problem by dragging

parts of the solution from a pallet of tools and connecting them together into a

workﬂow graph . You then train the solution with your data. When you are satisﬁed

with the results, you can ask Azure to convert your graph into a running web

service using the model you trained. In this sense Azure ML provides customized

machine learning as an on-demand service. This is another example of s erverless

computation. It does not require you to deploy and manage your own VMs; the

infrastructure is deployed as you need it. If your web service needs to scale up

because of demand, Azure scales the underlying resources automatically.

To illustrate how Azure ML works, we return to an example that we ﬁrst

considered in chapter 7. Our goal is to train a system to classify scientiﬁc papers,

based on their abstracts, into one of ﬁve categories: physics, math, computer

science, biol ogy, or ﬁnance. As training data we take a relatively small sample

of abstracts from the arXi v online library

arxiv.org

. Each sample con sis ts of a

triple: a classiﬁcation from arXiv, the paper title, and the abstract. For example,

the following is the record for a 201 5 paper in physics [83].

[ 'Physics',

A Fast Direct Sampling Algorithm for Equilateral Closed Polygons. (arXiv:1510.02466v1 [cond-

mat.stat-mech])',

Sampling equilateral closed polygons is of interest in the statistical study of ring polymers. Over

the past 30 years, previous authors have proposed a variety of simple Markov chain algorithms

(but have not been able to show that they converge to the correct probability distribution) and

complicated direct samplers (which require extended-precision arithmetic to evaluate numerically

unstable polynomials). We present a simple direct sampler which is fast and numerically stable.

]

197

10.2. Azure Machine Learning Workspace

This example also illustrates one of the challenges of the classiﬁcation problem:

science has become wonderfully multidisciplinary. The topic given for this sample

paper in arXiv is “condensed matter,” a subject in physics. Of the four authors,

however, two are in mathematics institutes and two are from physics departments,

and the abstract refers to algorithms that are typically part of computer science.

A human reader mig ht reasonably consider the abstract to be describ in g a topic in

mathematics or computer science. (In fact, multidisciplinary physics papers were

so numerous in our dataset that we removed them in the experiment below.)

Let us start with a solution in Azure ML based on a multiclass version of the

logistic regression algo rithm. Figure 10.1 shows the graph of tasks. To understand

this workﬂow, start at th e top, which is where the data source comes into the

picture. Here we take the data from Azure blob storage, where we have placed

a l arge subs et of our arXiv samples in a CSV ﬁle. Clicking the

Import Data

box

opens the window that allows us to identify the URL for the input ﬁle.

Figure 10.1: Azure ML graph used to train a multiclass logistic regression model.

The second box down,

Feature Hashing

, builds a vectorizer based on the

vocabulary in the document collection. This version comes from the Vowpal

Wabbit library. Its role is to convert each document into a numerical vector

corresponding to the key words and phrases in the document collection. This

198

Chapter 10. Machine Learning in the Cloud

numeric representation is essential for the actual ML phase. To create the vector,

we tell the feature hasher to look only at the abstract text. What happens in

the o utpu t is that the vector of numeric values for the abstract text is appended

to the tuple for each document. Our tuple now has a large number of columns:

class, title, abstract, an d

vector[0], ..., vector[n-1]

, where

is the number

of features. To conﬁgure the algorithm, we select two parameters, a hashing bin

size and an n-gram length.

Before sending the example to ML training, we remove the English text of the

abstract an d the title, leaving only the class and the vector for each document. We

accomplish this with a

Select Columns in Dataset

. Next we split the data into

two subsets: a training subset and a test subset. (We specify that

Split Data

should use 75% of the data for training and the rest for testing.)

Azure ML provides a good number of the standard ML modules. Each such

module has various parameters that can be selected to tune the method. For all

the experiments described here, we just used the default parameter settings. The

Train Model

component accepts as on e input a binding to an ML method (recall

this is not a dataﬂow graph); the other in pu t is the projected training data. The

output of the Train Model task is not data per se but a trained model that may

also be saved for later use. We can now use this trained model to classify our test

data. To this end, we use the

Score Model

component, which appends another

new column to our table, Scored Label, providing the classi ﬁcation predicted by

the trained model for each row.

To see how well we did, we use the

Evaluate Model

component, which computes

a confusion matrix. Each row of the matrix tells us how the documents in that

class were classiﬁed. Table 10.1 shows the confusion matrix for thi s experiment.

Observe, for example, tha t a fair number of biology papers are classiﬁed as math.

We attrib ute this to the fact that most biology papers in the archive are related to

quantitative methods, and thus contain a fair amount of mathematics. To access

the confusion matrix, or for that matter the output of any stage in the g raph, click

on the outpu t port (the small circle) on the corresponding box to access a menu.

Selecting visualize in that menu brings up u seful information.

Table 10.1: Confusion matrix with only math, computer science, biology, and ﬁnance.

bio compsci ﬁnance math

bio 51.3 19.9 4.74 24.1

compsci 10.5 57.7 4.32 27.5

ﬁnance 6.45 17.2 50.4 25.8

math 6.45l 16.0 5.5 72

199

10.2. Azure Machine Learning Workspace

Now that we have trained the model, we can click the

Set Up Web Service

button (not visible, but at the bottom of the page) to turn the model i nto a web

service. The Azure ML portal rearranges the graph by eliminating the split-train-

test parts and leaves just the feature hashing, column selection, and the scoring

based on the trained model. Two new nodes have been added: a web service input

and a web service output. The resu lt, with o ne exception, is shown in ﬁgure 10 .2.

The exception is that we have added a new

Select Columns

node so that we can

remove the vectorized document columns from the output of the web service. We

retain the origin al clas s, the predicted class, and the probabilities computed for

the document being in a class.

Figure 10.2: Web service graph generated by Azure ML, with an additional node to

remove the vectorized document.

You can now try additional ML classi ﬁer algorithms simply by replaci ng the

box Multiclass Logistic Regression with, for example, Multiclass Neural Network

or Random forest classiﬁer. Or, you can incorporate all three methods into a

single web service that uses a majority vote (“consensus”) method to pick the best

classiﬁcation for each document. As shown in ﬁgure 10.3, the construction of this

consensus method is straightforward: we si mp ly edit the web service graph for the

multiclass logistic regression to add the trained models for the other two methods

and then call a Python script to tie the three results together.

200

Chapter 10. Machine Learning in the Cloud

Figure 10.3: Modiﬁed web service graph based on a consensus model, showing three

models and a Python script component, used to determine the consensus.

The Python script can simply compare the outputs from the three classiﬁers. If

any two agree, then it selects that classiﬁcation as a ﬁrst choice and the classiﬁcation

that does not agree as a second choice. The results for the ﬁrst choice, shown

in table 10.2, are only modestly better than in the logistic regression case, but

if we consider both the ﬁrst and second choices, we reach 65% for biology, 72%

for computer science, 60% for ﬁnance, and 88% for math. N otebook 19 contains

this Python script as well as the code used to test and invoke the services and to

compute the confusion matrices.

Table 10.2: Confusion matrix for the three-way classiﬁer.

bio compsci ﬁnance math

bio 50.3 20.9 0.94 27.8

compsci 4.9 62.7 1.54 30.9

ﬁnance 5.6 9.9 47.8 36.6

math 3.91 13.5 2.39 80.3

201

10.3. Amazon Machine Learning Platform

10.3 Amazon Machine Learning Platform

The Amazon platform provides an impressive array of ML services. Each is designed

to allow developers to integrate cloud ML into mobile and other applications. Three

of the four are based on the remarkable progress that has been enabled by the

deep learning techniques that we discuss in more detail in the next section.

Amazon Lex

allows users to incorporate voice input into applications. This

service is an extension of Amazon’s Echo product, a small networked device with

a speaker and a micropho ne to which you can pose questions about the weather

and make requests to schedule events, play music, and report on the latest news.

With Lex as a service, you can build specialized tools that allow a speciﬁc voice

command to Echo to launch an Amazon lambd a function to execu te an application

in the cloud. For exampl e, NASA has bui lt a replica of the NASA Mars rover that

can be controlled by voice commands, and has integrated Echo into several other

applications around their labs [189].

Amazon Polly

is the opposite of Lex: it turns text into speech. It can speak

in 27 languages wi th a variety of voices. Using the Speech Synthesis Markup

Language, you can carefully control pronunciation and other aspects of intonation.

Together with Lex, Polly makes a ﬁrst step toward conversational computing. Polly

and Lex do not do real-time, voice-to-voice language translation the way Skype

does, but together they provide a great platform to deliver such a service.

Amazon Rekognition

is at the cutting edge of deep learning applications. It

takes an image as input and returns a textual description of the items that it sees

in that image. For example, given an i mage of a scene with people, cars, bicycles,

and animals, Rekognition returns a list of those items, with a m easu re of certainty

associated with each. The service is trained with many thousands of captioned

images in a manner not unlike the way natural language translation systems are

trained: it considers a million im ages contain ing a cat, each with an associated

caption that mentions “cat,” and a model association is formed. Rekognition can

also perform detailed facial analysis and comparisons.

The

Amazon Machine Learning

service, like Azure ML, can be used to

create a predictive mo d el based on training data that you provide. However,

it requires much less understanding of ML concepts than does Azure ML. The

Amazon Machine Learning dashboard presents the list of experiments, m odels,

and data sources from your previous Amazon Machine Learning work. From the

dashboard you can deﬁne data sources and ML models, create evaluations, and

run batch predictions.

202

Chapter 10. Machine Learning in the Cloud

Using Amazon Machine Learning is easy. For example, we used it to build a

predictive model from our collection of scientiﬁc articles in under an hour. One

reason that it is so easy to use is that the options are si mp le. You can build only

three types of models—regression , binary classiﬁcation, or multiclass classiﬁcation—

and in each case, A mazo n Machine Learning provides a single model. In the case

of multiclass classiﬁcation, it is multinomial logistic regression with a stochastic

gradient descent optimizer. And it works well. Usin g the same test and training

data as earlier, we obtained the results shown in table 10.3. Although the trained

Amazon Machine Learning classiﬁer failed to recognize any computational ﬁnance

papers, it beat our other classiﬁers in the other categories. Amazon Labs has

additional excellent examples [44].

Table 10.3: Confusion matrix for the science document classiﬁer using Amazon ML.

bio compsci ﬁnance math

bio 62.0 19.9 0.0 18.0

compsci 3.8 78.6 0.0 17.8

ﬁnance 6.8 2.5 0.0 6.7

math 3.5 11.9 0.0 84.6

Amazon Machine Learning is also fully accessible from the Amazon REST

interface. Fo r example, you can create a ML model using Python as fo llows.

response = client.create_ml_model(

MLModelId=' string ',

MLModelName=' string ',

MLModelType=' REGRESSION '| ' BINARY '| ' MULTICLASS ',

Parameters ={

' string ': ' string '

TrainingDataSourceId=' string ',

Recipe=' string ',

RecipeUri=' string '

)

The parameter

ModelID

is a required, user-suppli ed, unique identiﬁer; other

parameters specify, for example, the maximum allowed size of the model, the

maximum number of passes over the data in building the model, an d a ﬂag to tell

the learners to shuﬄe the data. The training data source identiﬁer is a data recipe

or URI for a recipe in S3. A recipe is a JSON-like document that describes how to

transform the datasets for input while building the model. (Consult the Amazon

Machine Learning documents for more details.) For our science document example,

we used the default recipe generated by the portal.

203

10.4. Deep Learning: A Shallow Introduction

10.4 Deep Learning: A Shal low Introduction

The use of

artiﬁcial neural networks

for machine learning tasks has been

common for at least 40 years. Mathematically, a neural network is a method of

approximating a function. For example, consider the function that takes an image

of a car as input and produces the name of a manufacturer as output. Or, consider

the fu nctio n that takes the text of a scientiﬁc abstract and outputs the most likely

scientiﬁc discipline to which it belongs. In order to be a computational entity,

our function and its approximation need a numerical representation. For example,

suppose our function takes as input a vector of three real numbers and returns

a vector of length two. Figu re 10.4 is a diagram of a neural net with one hid den

layer representing such a function.

Figure 10.4: Neural network with three inputs, one hidden layer, and two outputs.

In this schematic representation, the lines represent numerical weights connect-

ing the inputs to the

interior neurons, and the terms

are oﬀsets. Mathematically

the function is given by the following equations.

= f(

i=0

i,j

+ b

) for j =1,n

= f

(

i=0

i,j

+ b

) for j =0, 1

The functions

and

are called the

activation functions

for the neurons.

Two commonly used activation functions are the logistic function



(

) that we

204

Chapter 10. Machine Learning in the Cloud

introduced at the beginning of this chapter and the rectiﬁed linear function:

relu(x) = max(0,x).

Another common case is the hyperbolic tangent function

tanh(x)=

 e

x

+ e

x

An advantage of



(

) and

tanh

(

) is that they map values in the range (

1, 1

)

to the range (0

1), which corresponds to the idea that a neuron is either on or oﬀ

(not-ﬁred or ﬁred). When the function represents the probability that an input

corresponds to one of the outputs, we use a version of the logi stic functio n that

ensures that the probabilities all sum to one.

softmax(x)

k6=j

x

This formulation is commonly used in multiclass classiﬁcation, inclu ding in

several of the examples we have studied earlier.

The trick to making the neural net truly approximate our desired functio n is

picking the right values for the weights. There is no closed form solution for ﬁnding

the best weights, but if we have a large numb er of labeled examples (

),we

can try to minimize the cost function.

C(x

||y

 y

)||

The standard approach is to use a variation of gradient d escent, speciﬁcally

back propagation

. We do not provide details on this algorithm here but instead

refer you to two outstanding mathematical treatments of deep learning [

143

210

10.4.1 Deep Networks

An interesting property of neural networks is that we can stack them in layers as

illustrated in ﬁgure 10.5 on the next page. Furthermore, using the deep learning

toolkits that we discuss in the remainder of this chapter, we can construct such

networks with just a few lines of code. In this chapter, we introduce three deep

learning toolkits. We ﬁrst illustrate how each can be used to deﬁne some of the

standard deep networks and then, in later sections, describe how to deploy and

apply them in the cloud.

MXNet

github.com/dmlc/mxnet

is the ﬁrst deep learning toolkit that we con-

sider. Using MXNet, the network in ﬁgure 10.5 would look as follows.

205

10.4. Deep Learning: A Shallow Introduction

Figure 10.5: Neural network with three inputs, two hidden layers, and two outputs.

data = mx. symbol. Variable ('x ')

layr1= mx. symbol. FullyConnected(data=data ,name=' W1 ',num_hidden=7)

act1 = mx. symbol. Activation( data=layr1 , name=' relu1 ',act_type="relu")

layr2= mx. symbol. FullyConnected(data=act1 ,name=' W2 ',num_hidden=4)

act2 = mx. symbol. Activation( data=layr2 , name=' relu2 ',act_type="relu")

layr3= mx. symbol. FullyConnected(data=act2 , name='W3 ' ,num_hidden=2)

Y=mx.symbol.SoftmaxOutput(data=layr3,name=' softmax ')

The code creates a stack of fully connected networks and activations that

exactly describe our diagram. In the following section we return to the code needed

to train and test this network.

The term

deep neural network

generally refers to networks with many layers.

Several special case networks also have proved to be of great value for certain types

of input data.

10.4.2 Convolutional Neural Networks

Data with a regular spatia l geometry such as images or one-dimensional streams

are often analyzed with a special class of network called a

convolutional neural

network

or CNN. To expla in CNNs, we use our second example toolkit,

Tensor-

Flow tensorflow.org

, which was open sourced by Google in 2016. We consider a

classic example that appears in many tutorials and is well covered in that provided

with TensorFlow tensorflow.org/tutorials .

Suppose you have thousands of 28

⇥

28 black and white images of handwritten

digits and you want to b ui ld a system that can identify each. Images are strings

206

Chapter 10. Machine Learning in the Cloud

of bits, but they also have a lot of local two-dimensional structure such as edges

and holes. In order to ﬁnd these patterns, we examine each of the many 5

⇥

windows in each image individu all y. To do so, we train the system to build a 5

⇥

template array

1 and a scalar oﬀset

that together can be used to reduce each

5⇥5 window to a point in a new array conv by the following formula.

conv

p,q

i,k= 2

i,k

image

pi,qk

+ b

(The imag e is padded near the boundary points in the formula above so that

none of the indices are out of bounds.) We next modify the

conv

array by applying

the

relu

function to each

in the

conv

array so that it has no negative values.

The ﬁnal step, max pooling, simply computes the maximum value in each 2

⇥

block and assigns it to a smaller 14

⇥

14 array. The most interesting part of the

convolution al network is that we do not use one 5

⇥

1 template but 32 of them

in parallel, producing 32 14⇥14 results, pool1, as illustrated in ﬁgure 10.6.

Figure 10.6: Schematic of how a convolutional neural net processes an image.

When the network is fully trained, each of the 32 5

⇥

5 templates in

1 is

somehow diﬀerent, and each selects for a diﬀerent set of features in the original

image. One can think of the resulting stack of 32 14

⇥

14 arrays (called

pool1

)asa

type of transform of the original image, which works much like a Fourier transform

to separate a signal in space an d time and transform it into frequency space. This

is not what is going on here; but if you are familia r with these transforms, the

analogy may be helpful.

We next apply a second convolutional layer to

pool1

,butthistimeweapply

64 sets of 5

⇥

5 ﬁlters to each of the 32

pool1

layers and sum the results to ob tain

64 new 14

⇥

14 arrays. We then reduce these with max pooling to 64 7

⇥

7arrays

207

10.4. Deep Learning: A Shallow Introduction

called

pool2

. From there we use a dense “all-to-all” layer and ﬁnally reduce it to

10 values, each representing the likeliho od that the image corresponds to a digit 0

to 9. The TensorFlow tutorial deﬁnes two ways to build and train this network;

ﬁgure 10.7 is from the community-contributed library called layers.

input_layer = tf.reshape(features , [-1, 28, 28, 1])

conv1 = tf. layers. conv2d(

inputs= input_layer ,

filters=32,

kernel_size=[5, 5],

padding="same",

activation=tf.nn.relu)

pool1 = tf. layers. max_pooling2d (inputs=conv1 , \

pool_size =[2, 2], strides =2)

conv2 = tf. layers. conv2d(

inputs=pool1 ,

filters=64,

kernel_size=[5, 5],

padding="same",

activation=tf.nn.relu)

pool2 = tf. layers. max_pooling2d (inputs=conv2 , \

pool_size =[2, 2], strides =2)

pool2_flat = tf.reshape(pool2 , [-1, 7 * 7 * 64])

dense = tf. layers. dense(inputs =pool2_flat , \

units =1024 , activation =tf.nn. relu )

logits = tf.layers .dense(inputs =dense , units =10)

Figure 10.7: TensorFlow two convolutional layer digit recognition network

As you can see, these operators explicitly describe the features of our CNNs.

The full program is in the TensorFlow examples tutorial layers directory in ﬁle

cnn_mnist.py

. If you would rather see a version of the same program using lower-

level TensorFlow operators, you can ﬁnd an excellent Jupyter notebook version in

the Udacity deep learning course m aterial [

]. CNNs have many applications in

image analysis. One excellent science example is the solution to the Kaggle Galaxy

Zoo Challenge, which asked participants to predict h ow Galaxy Zoo users would

classify images of galaxies from the Sloan Digital Sky Survey. Dieleman [

111

]

describes the solution, wh ich uses four convolutional layers and three dense layers.

208

Chapter 10. Machine Learning in the Cloud

10.4.3 Recurrent Neural Networks.

Recurrent neural networks

(RNNs) are widel y used in language modeling

problems, such as predicting the next word to be typed when texting or in automatic

translation systems. RNNs can learn from sequences that have repeating patterns.

For example, they can learn to “compose” text in the style of Shakespeare [

168

]or

even music in the style of Bach [

183

]. They have also been used to study forest

ﬁre area coverage [94] and cycles of drought in California [178].

The input to the RNN is a word or signal, along with the state of the system

based on words or signals seen so far; the output is a predicted list and a new state

of the system, as shown in ﬁgure 1 0.8.

Figure 10.8: Basic RNN with input stream x and output stream h.

Many variations of the basic RNN exist. One challenge for RNNs is ensuring

that the state tensors retain enough long-term memory of the sequence so that

patterns are remembered. Several approaches have been used for this purpose.

One popular method is the Long-Short Term Memory (LSTM) version that is

deﬁned by the foll owing equations, where the input sequence is

,theoutputis

and the state vector is the pair [c, h].

= (W

(xi)

+ W

(hi)

t1

+ W

(ci)

t1

+ b

(i)

)

= (W

(xf)

+ W

(hf)

t1

+ W

(cf)

t1

+ b

(f)

)

= f

· c

t1

+ i

· tanh(W

(xc)

+ W

(hc)

t1

+ b

(c)

)

= (W

(xo)

+ W

(ho)

t1

+ W

(co)

+ b

(o)

)

= o

· tanh(c

)

Olah provides an excellent explanation of how RNNs work [

213

]. We adapt one

of his illustrations to show in ﬁgure 10.9 on the next page how information ﬂows in

209

10.4. Deep Learning: A Shallow Introduction

our network. Here we use the vector concatenation notation

concat

as follows to

compose the various

matrices and thus obtain a more compact representation

of the equations.

(concat(x, h, c)) = (W [x, h, c]+b)=(W

(x)

x + W

(h)

h + W

(c)

c + b).

Figure 10.9: LSTM information ﬂow, adapted from Olah [

213

]toﬁtequationsinthetext.

We use a third toolkit, the

Microsoft Cognitive Toolkit

(formerly known

as the Computational Network Toolkit

CNTK

), to illustrate the application of

RNNs. Specﬁcally, we consider a sequence-to-sequence LSTM from the Microsoft

Cognitive Toolkit distribution that is trained with inpu t from ﬁnancial news items

such as “shorter maturities are considered a sign of risi ng rates because portfolio

managers can capture higher rates sooner” and “j. p. <unk> vice chairman of

grace and co. which holds an interest in this company was elected a director.”

Following training, this network can be used to generate other sentences with a

similar structure.

To illustrate what this n etwork can do, we saved the trained

and

arrays

and two other arrays that together deﬁne its structure. We then loaded these

arrays into the Python version of the RNN shown on the next page, which we

created by transcribing the equations above.

210

Chapter 10. Machine Learning in the Cloud

def rnn (word , old_h , old_c ):

Xvec = getvec (word , E)

i=Sigmoid(np.matmul(WXI,Xvec)+

np . matmul ( WHI , old_h ) + WCI * old_c + bI )

f=Sigmoid(np.matmul(WXF,Xvec)+

np . matmul ( WHF , old_h ) + WCF * old_c + bF )

c=f*old_c+i*(np.tanh(np.matmul(WXC,Xvec)+

np . matmul ( WHC , old_h ) + bC ))

o=Sigmoid(np.matmul(WXO,Xvec)+

np . matmul ( WHO , old_h )+ ( WCO * c )+ bO )

h=o*np.tanh(c)

# Extract ordered list of five best possible next words

q=h.copy()

q.shape = (1, 200)

output = np.matmul (q, W2)

outlist = getwordsfromoutput(output)

return h, c, outlist

As you can see, this code is almost a literal transl ation of the equations. The

only diﬀerence is that the code ha s as input a text string for the input word,

while the equations take a vector encoding of the word as input. The RNN

training generated the encoding m atrix

, which has the nice property that the

column o f the matrix corresponds to the word in the

th position in the vocabulary

list. The function

getvec(word, E)

takes the embedding tensor

,looksupthe

position of the word in the vocabulary list, and returns the column vector of

that corresponds to that word. The output of one pass through the LSTM cell

is the vector

. This is a compact representation o f the words likely to follow

the input text to this point. To convert this back into “vocabulary” space, we

multiply it by another trained vector W2. The size of our vocabulary is 10,000,

and the vector output is that length. The

th el ement of the output represents the

relative likelihood that the

th word is the next word to follow the input so far.

Our addition,

Getwordsfromoutput

, simply returns the top ﬁve candidate words,

in order of likelihood.

To see whether this LSTM is truly a recurrent network, we provide the network

with a starting word, let it sugges t the next word, and repeat this process to

construct a “sentence.” In the code on the next page, we randomly pick one of the

top three suggested by the network as the next word.

211

10.5. Amazon MXNet Virtual Machine Image

c=np.zeros(shape=(200,1))

h=np.zeros(shape=(200,1))

output = np.zeros( shape = (10000 , 1))

word = 'my '

sentence= word

for _ in range (40):

h, c, outlist = rnn(word, h, c)

word = outlist[ randint (0 ,3)]

sentence = sentence + " " +word

print(sentence+".")

Testing this code with the start word “my” produced the following output.

my new rules which would create an interest position here unless

there should prove signs of such things too quickly although the

market could be done better toward paying further volatility where

it would pay cash around again if everybody can .

Using “the” as our start word produced the following.

the company reported third- quarter results reflecting a number

compared between N barrels including pretax operating loss

from a month following fiscal month ending july earlier

compared slightly higher while six -month cds increased

sharply tuesday after an after -tax loss reflecting a strong.

This RNN is hallucinating ﬁnancial news. The sentences are obviously nonsense,

but they are excellent examples of mimicry of the patterns that the network was

trained with. The sentences end rather abru ptly because of the 40-word limit in

the code. If you let it go, it runs until the state vector for the sentence seems to

break down. Try this yourself. To make it easy to play with th is example, we have

put the code in notebook 20 along with the 50 MB model data.

10.5 Amazon MXNet Virtual Machine Image

MXNet [

]

github.com/dmlc/mxnet

is an open source library for distributed

parallel machine learning. It was originally developed at Carnegie Mellon, the

University of Washington, and Stanford. MXNet can be programmed with Python,

Julia, R, Go, Matlab, or C++ and runs on many diﬀerent platforms, including

clusters and GPUs. It is also now the deep learning framework of choice for

Amazon [

256

]. Amazon has al so released the

Amazon Deep Learning AMI

[

], which includes not only MXNet but also CNTK and TensorFlow, plus other

212

Chapter 10. Machine Learning in the Cloud

good toolkits that we have not discussed here, including Caﬀe, Theano, and Torch.

Jupyter and the Anaconda tools are there, too.

Conﬁguring the Amazon AMI to use Jupyter is easy. Go to the Amazon

Marketplace on the EC2 portal and search for “deep learning”; you will ﬁnd the

Deep Learnin g AMI. Then select the server type. This AMI is tuned for the

p2.16xlarge

instances (64 virtual cores plus 16 NV IDIA K80 GPUs). This is an

expensive option. If you simply want to experiment, it works well with a no-GPU

eight-core option such as

m4.2xlarge

. When the VM comes up, log in with

ssh

and conﬁgure Jupyter for remote access as foll ows.

>cd .jupyter

>openssl req -x509 -nodes -days 365 -newkey rsa:1024 \

-keyout mykey.key -out mycert.pem

>ipython

[1]: from notebook.auth import passwd

[2]: passwd ()

Enter password :

Verify password:

Out [2]: ' sha1:---- long string -----------'

Remember your password, and copy the long

sha1

string. Next create the ﬁle

.jupyter/jupyter_notebook_config.py, and add the following lines.

c=get_config()

c.NotebookApp.password = u'sha1 :----long string -----------'

c.NotebookApp.ip = '* '

c.NotebookApp.port = 8888

c.NotebookApp.open_browser = False

Now invoke Jupyter as follows.

jupyter notebook --certfile =. jupyter/mycert.pem \

--keyfile =. jupyter /mykey. key

Then, go to

https://ipaddress :8888

in your browser, where ipaddress is the

external IP address of your virtual machine. Once you have accessed Jupyter

within your browser, visit src/mxnet/example/notebooks to run MXNet.

Many excellent MXNet tutorials are available in addition to those in the AMI

notebooks ﬁle, for example on the MXNet community site. To illustrate MXNet’s

use, we examine a particularly deep neural network trained on a dataset with 10

million images. Resnet-152 [

152

] is a network with 152 co nvolutional layers based

on a concept called deep residual learning, which solved an important problem

with training deep networks called the vanishing gradient problem. Put simply, it

states that trainin g by gradient descent methods fails for deep networks because

213

10.5. Amazon MXNet Virtual Machine Image

as the network grows in depth, the computable gradients become numerically so

small that there is no stable descent direction. Deep residual trainin g approaches

the problem diﬀerently by adding an identity mappi ng from on e layer to the next

so that one solves a residual problem rather than the original. It turns out that

the residual is much easier for stochastic gradient descent to solve.

Resnets of various sizes have been built with each of the toolkits mentioned

here. Here we describe one that is part of the MXNet tutorials [

]. (Notebook 21

provides a Jupyter version.) The network has 150 convolutional layers with a

softmax output on a fully connected layer, with 11,221 nodes representing the

11,221 image labels that it is trained to recognize. The input is a 3

⇥

224

⇥

224 RGB

format image that is loaded into a batch normali zation function and then sent to

the ﬁrst convolutional layer. The example ﬁrst fetches the archived data for the

model. There are three main ﬁles.

• resent-152-symbol.json

, a complete description of the network as a large

json ﬁle

• resnet-152-0000.params

, a binary ﬁle containing all parameters for th e

trained model

• synset.txt, a text ﬁle containing the 1,121 image labels, one per line

You can then load the pretrained model data, build a model from the data, and

apply the model to a JPEG image. (The image must be converted to 3

⇥

244

⇥

244

RGB format: see notebook 21.)

import mxnet as mx

# 1) Load the pretrained model data

with open (' full - synset . txt ','r ')asf:

synsets = [l.rstrip() for l in f]

sym, arg _params , aux_pa ram s =

mx . model .load _chec kpoint( 'full - resnet -152' ,0)

# 2) Build a model from the data

mod = mx.mod .Module (symbol =sym , context =mx. gpu ())

mod. bind ( for_training =False, data_shapes=[( 'data ',(1,3,224,224))])

mod. set_params ( arg_params , aux_params )

# 3) Send JPEG image to network for prediction

mod. forward (Batch ([ mx.nd. array (img )]))

prob = mod. get_outputs()[0].asnumpy ()

prob = np. squeeze (prob)

a=np.argsort(prob)[::-1]

for i in a[0:5]:

print( ' probability=%f, class=%s' %( prob [i ], synsets [i ]))

214

Chapter 10. Machine Learning in the Cloud

You will ﬁnd that the accu racy of the network in recognizing images is excellent.

Below we selected four images in ﬁgure 10.10 from the Bing image pages with a

focus on biology. You can see from the results in table 10.4 that the top choi ce of

the network was correct for each example, although the conﬁdence was less high

for yeast and seahorse. These results clearly illustrate the potential for automatic

image recognition in aiding scientiﬁc tasks.

Figure 10.10: Three sample images that we have fed to the MXNet Resnet-152 network.

Table 10.4: Identiﬁcation of images in ﬁgure 10.10 along with estimated probabilities.

Yeas t Streptococcus Amoeba Seahorse

p=0.26, yeast

p=0.75, streptococcus,

streptococci, strep

p=0.70, ameba,

amoeba

p=0.33, seahorse

p=0.21,

microorganism

p=0.08, staphylococcus,

staph

p=0.15,

microorganism

p=0.12, marine

animal, marine

creature, sea animal

p=0.21, cell p=0.06, yeast

p=0.05, ciliate,

ciliated protozoan,

ciliophoran

p=0.12, benthos

p=0.06,

streptococcus, strep

p=0.04, microorganism,

micro-organism

p=0.04, paramecium,

paramecia

p=0.05, invertebrate

p=0.05, eukaryote,

eucaryote

p=0.01,

cytomegalovirus, CMV

p=0.03,

photomicrograph

p=0.04, pipeﬁsh,

needleﬁsh

10.6 Google TensorFlow in the Cloud

Google’s

TensorFlow

is a frequently discussed and used deep learn ing toolkit.

If you have installed the Amazon Deep Learning AMI, then you already have

TensorFlow installed, and you can begin experimenting right away. W hil e we have

already introduced TensorFlow when discussing convolutional neural networks, we

need to look at some core concepts before we dive more deeply.

Let us start with tensors, which are generalizations of arrays to dimensions

beyond 1 and 2. In TensorFlow, tensors a re created an d stored in container objects

215

10.6. Google Ten sorFlow in the Cloud

Table 10.5: (Fake) graduate scho ol admission data.

GRE GPA Rank Decision

800 4.0 4 0

339 2.0 1 1

750 3.9 1 1

800 4.0 2 0

that are one of three types: variab les, placeholders, and constants. To illustrate

the use of TensorFlow, we build a logistic regression model of som e (fake) graduate

school admissions decisions. Our data, shown in table 10.5, con sis t of a GRE exam

score in the (pre-2012) range of 0 to 800; a grade point average (GPA) in the range

0.0 to 4.0; and the rank of the student’s undergraduate in stitu tion from 4 to 1

(top). The admission decision is binary.

To build the model, we ﬁrst initialize TensorFlow for an interactive session and

deﬁne two variables and two placehold ers, as follows.

import tensorflow as tf

import numpy as np

import csv

sess = tf. InteractiveSession ()

x=tf.placeholder(tf.float32,shape=(None,3))

y=tf.placeholder(tf.float32,shape=(None,1))

# Set model weights

W=tf.Variable(tf.zeros([3,1]))

b=tf.Variable(tf.zeros([1]))

The placeholder tensor

represents the triple [

GRE, GP A, Rank

] from our

data and the placeholder

holds the corresponding Admissi ons Decision.

and

b are the learned variables that minimize the cost function.

cost =

i=0

(y  (W · x + b))

)

In this equation,

W · x

is the dot product, but the placeholders are of shape

(None, 3)

and

(None,1)

, respectively. In Tenso rFlow, this means that they can

hold an array of size

x3and

x 1, respectively, for any value of

. The

minimization step in TensorFlow now takes the following form, deﬁning a graph

with inputs

and

feeding into a cost function, which is then passed to the

optimizer to select the W and b that minimize the cost.

216

Chapter 10. Machine Learning in the Cloud

pred = tf. sigmoid (tf. matmul(x, W) + b)

cost = tf. sqrt(tf. reduce_sum ((y - pred )**2/ batch_size ))

opt = tf. train . AdamOptimizer ()

optimizer = opt.minimize(cost)

The standard way to train a system in TensorFlow (and indeed in the other

packages that we discuss here) is to run the optimizer with successive b atches of

training data. To do this, we need to initialize the TensorFlow variables with the

current interactive session. We use a Python function

get_batch()

that pulls a

batch of values from train_data and stores them in train_label arrays.

training_epochs = 100000

batch_size = 100

display_step = 1000

init = tf. initialize_all_variables ()

sess. run(init )

# Training cycle

for epoch in range (training_epochs):

avg_cost = 0.

total_batch = int( len(train_data)/batch_size)

# Loop over all batches

for i in range(total_batch):

batch_xs , batch_ys =get_batch (batch_size ,train_data ,train_label )

# Fit training using batch data

_,c=sess.run([optimizer,cost],

feed_dict ={x:batch_xs , y:batch_ys})

# Compute average loss

avg_cost += c / total_batch

# Display logs per epoch step

if (epoch+1) % display_step == 0:

print("Epoch:", ' %04 d' %(epoch+1),"cost=",str (avg_cost))

Figure 10.11: TensorFlow code for training the simple logistic regression function.

The code segment in ﬁgure 10.11 illustrates how data are passed to the compu-

tation graph for evaluation with the

sess.run()

function via a Python dictionary,

which binds the data to the speciﬁc TensorFlow placeholders. Notebo o k 23 provides

additional details, including analysis of the resu lts. You will see that training on

the fake admissions dataset led to a model in which the decision to admit is based

solely on the student graduating from the top school. In this case the trainin g

rapidly converged, since this rule is easy to “learn.” The score was 99.9% accurate.

If we base the admission decision on the equally inappropriate policy of granting

217

10.7. Microsoft Cognitive Toolkit

admission only to those students who either scored an 800 on the GRE or came

from the top school, the learn ing does n ot converge as fast and the best we could

achieve was 83% accuracy.

A good exercise for you to try would be to convert this model to a neural

network with one or more hidden layers. See whether you can improve the result!

10.7 Microsoft Cognitive Toolkit

We introduced the

Microsoft Cognitive Tool kit

in section 10.4.3 on page 209,

when discussing recurrent neural networks. The CNTK team has made this software

availab le for download in a variety of formats so that deep learning examples can

be run on Azure as clusters of Docker containers, in the following conﬁgurations :

•

CNTK-CPU-InﬁniBand-IntelMPI for execution across multiple InﬁniBand

RDMA VMs

• CNTK-CPU-OpenMPI for multi-instance VMs

•

CNTK-GPU-OpenMPI for multiple GPU-equipped servers such as the NC

class, which have 24 cores and 4 K80 NVIDIA GPUs

These deployments each use the Azure Ba tch Shipyard Docker m odel, part of

Azure Batch [

]. (Shipyard also provides scripts to provision Dockerized clusters

for MXNet and TensorFlow with similar conﬁgurations.)

You also can deploy CNTK on your Windows 10 PC or in a VM running in

any cloud. We provide detailed deployment instructions in notebook 22, along

with an example that we describe below. The style of computing is simil ar to

Spark, TensorFlow, and others that we have looked at. We use Python to build a

ﬂow graph of computations that we invoke with data using an

eval

operation. To

illustrate the style, we create three tensors to hold the input values to a graph and

then tie those tensors to the matrix-multiply operator and vector addition.

import numpy as np

import cntk

X=cntk.input_variable((1,2))

M=cntk.input_variable((2,3))

B=cntk.input_variable((1,3))

Y=cntk.times(X,M)+B

is a 1

⇥

2-dimensional tensor, that is, a vector of length 2;

is a 2

⇥

3matrix;

and

is a vector of length 3. The expression

Y=X*M+B

yields a vector of length 3.

218

Chapter 10. Machine Learning in the Cloud

However, no computation has taken place at this point: we have only constructed

a graph of the computation. To execute the graph, we input values for

,and

, and then apply the

eval

operator on

, as follows. We use Numpy arrays to

initialize the tensors and, in a manner identical to TensorFlow, supply a dictionary

of bindings to the eval operator as follows.

x=[[np.asarray([[40,50]])]]

m=[[np.asarray([[1,2,3],[4,5,6]])]]

b=[[np.asarray([1.,1.,1.])]]

print(Y.eval ({ X:x , M: m , B: b }))

----- output -------------

array ([[[[ 241. , 331. , 421.]]]] , dtype=float32 )

CNTK also supports several other tensor container types, such as

Constant

, for

a scalar, vector, or other multidimensional tensor with values that do not change,

and

ParameterTensor

, for a tensor variab le whose value is to be modiﬁed during

network training.

Many more tensor operators exis t, and we cannot discuss them all h ere. How-

ever, one important class is the set of operators that ca n be used to build multilevel

neural networks. Called the layers library, they form a cri tical part of CNTK. One

of the most basic is the

Dense(dim)

layer, which creates a fully connected layer of

output dimension

dim

. Many other standard layer types exist, including Co nvolu-

tional, MaxPooling, AveragePo oli ng, and LSTM. Layers can also be stacked with

a simple operator called

sequential

. We show two examples taken directly from

the CNTK documentation [

]. The ﬁrst is a standard ﬁve-level image recognition

network based on convolutional layers.

with default_options (activation =relu ):

conv_net = Sequential ([

# 3 layers of convolution and dimension reduction by pooling

Convolution ((5,5),32, pad=True),MaxPooling ((3,3), strides =(2,2)),

Convolution ((5,5),64, pad=True),MaxPooling ((3,3), strides =(2,2)),

# 2 dense layers for classification

Dense (64),

Dense (10, activation= None )

])

The second example, on the next page, is a recurrent LSTM network that takes

words

embedded

in a vector of size 15 0, passes them to the LSTM, and produces

output through a dense network of dimension labelDim.

219

10.8. Summary

model = Sequential ([

Embedding (150) , # Embed into a 150 -dimensional vector

Recurrence(LSTM (300)) , # Forward LSTM

Dense( labelDim) # Word - wise classification

])

You use word embeddings when your inputs are sparse vectors of size equal

to the word vocabulary (i.e., if item

in the vector is 1, then the word is the

th element of the vocabulary), in which case the embedding matrix has size

vocabulary-size by number of inputs. For example, if there are 10,000 words in the

vocabulary and you have 150 inputs, then the matrix is 10,000 rows of length 150,

and the

th word in the vocabulary corresponds to the

th row. The embedding

matrix may be passed as a parameter or learned as part of training. We illustrate

its use with a detailed example later in this chapter.

The

Sequential

operator used i n the same code can be thought of as a

concatenation of the layers in the given sequence. The

Recurrence

operator is

used to wrap the correct LSTM output back to the input for the next input to

the network. For details, we refer you to the tutorials provided by CNTK. One

example of particular interest concerns

reinforced learning

, a technique that

allows networks to use feedback from dynamical systems, such as games, in order

to learn how to control them. We reference a more d etailed discussion onl ine [

134

Azure als o provides a large collection of pretrained machine learning services

similar to those provided by the Amazon Machine Learning platform: the

Cortana

cognitive services

. Speciﬁcally, these include web service AP Is for speech and

language understanding; text analysis; language translation; face recognition a nd

attitude analysis; and search over Microsoft’s academic research database and

graph. Figure 10.12 shows an example of their use.

10.8 Summary

We have introduced a variety of cloud and open source machine learning tools.

We began with a simple logistic regression demonstration that used the machine

learning tools in Spark running in an Azure HDInsight cluster. We next turned

to the Azure Machine Learning workspace Azure ML, a portal-based tool that

provides a drop-and-drag way to compose, train, and test a machine learning model

and then convert it automatically into a web service. Amazon also provides a

portal-based tool, Amazon Machine Learning, that allows you to build and train a

predictive model and deploy it a s a service. In addi tion , both Azure and Amazon

220

Chapter 10. Machine Learning in the Cloud

provide pre-trained models for image and text analysis, in the Co rtana services

and the Amazon ML platform, respectively.

We devoted the remainder of this chapter to looking at deep learning and

the TensorFlow, CNTK, and MXNet toolkits. The capabilities of these tools can

sometimes seem almost miraculous, but as Oren Etzioni [

118

] observes, “Deep

learning isn’t a d angerou s magic genie. It’s just math.” We presented a modest

introduction to the topic and described two of the most commonly used networks:

convolutional and recurrent. We described the use of the Amazon virtual machine

image (AMI) for machine learning, which includes MXNet, Amazon’s preferred deep

learning toolkit, as well as deployments of all the other deep learning frameworks.

We illustrated MXNet with the Resnet-152 image recognition network ﬁrst designed

by Microsoft Research. Resnet-152 consists of 152 layers, and we demonstrated how

it ca n be used to help classify biological sam pl es. This type of image recognition

has been used successfully in scientiﬁc studies ranging from protein structure to

galaxy classiﬁcation [180, 60, 264, 111].

We also used the Amazon ML AMI to demonstrate TensorFlow, Google’s open

[

{

"faceRectangle": {

"left": 45,

"top": 48,

"width": 62,

"height": 62

"scores": {

"anger": 0.0000115756638,

"contempt": 0.00005204394,

"disgust": 0.0000272641719,

"fear": 9.037577e-8,

"happiness": 0.998033762,

"neutral": 0.00184232311,

"sadness": 0.0000301841555,

"surprise": 0.00000277762956

}

]

Figure 10.12: Cortana face recognition and attitude analysis web service. When applied to

an image of a person on a sailboat, it returns the JSON document on the right. Cortana

determines that there is one extremely (99.8%!) happy face in the picture.

221

10.9. Resources

source contribution to the deep learning world. We illu strated how one deﬁnes a

convolution neural network in TensorFlow as part of our discussion of that topic,

and we provided a complete example of using TensorFlow for logistic regression.

Microsoft’s cognitive tool kit (CNTK) was the third toolkit that we p resented. We

illustrated some of its basic features, including its use for deep learning. CNTK

also provides an excellent environment for Jupyter, as well a s many good tutorials.

We have provided in this chapter only a small introduction to the subject

of machine learning. In addition to the deep learning toolkits mentioned here,

Theano [

] and Caﬀe [

161

] are widely used. Keras

keras.io

is another interesting

Python library that runs on top of Theano and TensorFlow. We also have not

discussed the work done by IBM with their impressive Watson services—or systems

such as Torch torch.ch .

Deep learning has had a profound impact on the technical d irection s of each of

the major cloud vendors. The role of deep neural networks in science is still small,

but we expect it to grow.

Another topic that we have not addressed in this chapter i s the performance

of ML toolkits for various tasks. In chapter 7 we discussed the various ways by

which a computation can be scaled to solve bigger problems. One approach is

the SPMD model of communicating sequential processes by using the Message

Passing Standard (MPI) model (see section 7.2 on page 97). Another is the graph

execution dataﬂow model (see chapter 9), used in Spark, Flink, and the deep

learning toolkits described here.

Clearly we can write ML algorithms using either MPI or Spark. We should

therefore be concerned about understanding the relative performance and pro-

grammability of the two approaches. Kamburugamuve et al. [

166

] address this

topic and demonstrate that MPI implementations of two standard ML algorithms

perform much better than the versions in Spark and Flink. Often the diﬀerences

were factors of orders of magnitude in execution time. They also acknowledge

that the MPI versions were harder to program than the Spark versions. The sam e

team has released a library of MPI tools called SPIDAL, designed to perform data

analytics on HPC clusters [116].

10.9 Resources

The cla ssi c Data Mining: Concepts and Techniques [

148

], recently updated, provides

a strong introduction to data mining and knowledge discovery. Deep Learning [

143

]

is an exceptional treatment of that technology.

222

Chapter 10. Machine Learning in the Cloud

For those interested in learning more of the basics of machine learning with

Python and Jupyter, two good books are Python Machine Learning [

224

]and

Introduction to Machine Learning with Python: A Guide for Data Scientists [

207

All the examples in this chapter, with the exception of

-means, involve supervised

learning. These books treat the subject of unsupervised learning in more depth.

On the topic of deep learning, each of the three toolkits covered in this chapter—

CNTK, TensorFlow, and MXNet—provides extensive tutorials in their standard

distributions, when downloaded and installed.

We also mention the six notebooks introduced in this chapter.

•

Notebook 18 demonstrates the use of Spark machine learning for logistic

regression.

• Notebook 19 can be used to send data to an AzureML web service.

•

Notebook 20 demonstrates how to load and use the RNN model originally

built with CNTK.

•

Notebook 21 shows how to load and use the MXNet Resnet-152 model to

classify images.

• Notebook 22 discusses the installation and use of CNTK.

• Notebook 23 illustrates simple logistic regression using TensorFlow.

223