Chapter 16
History, Critiques, Futures
“I’ve looked at clouds from both sides now / From up and down and
still somehow / It’s cloud’s illusions I recall / I really don’t know clouds
at all”
—Joni Mitchell
We have devoted 14 chapters to how you can use the cloud for scientific research.
We now spend some time on context, covering, in turn, the historical context
from which today’s cloud emerged; contemporary critiques of cloud computing;
and some important directions in which cloud technologies are developing. This
material is brief, but we hope will stimulate thought and discussion.
16.1 Historical Perspectives
The idea of computing as a utility is far from new. Artificial intelligence pioneer
Professor John McCarthy, speaking at MIT’s centennial celebration in 1961, opined
that: “Computing may someday be organized as a public utility just as the telephone
system is a public utility.” He went on to predict a future in which:
“Each subscriber needs to pay only for the capacity he actually u ses,
but he has access to all programming languages characteristic of a
very large system ... Certain subscribers might oer service to other
subscribers ... The com pu ter utility could become the basis of a new
and important industry.”
16.1. Historical Perspectives
McCarthy’s words were inspired by what he s aw as the possibilities of time
sharing, recently demonstrated in project Multics [
100
]. If many people could
run on the same computer at the same time, then why not leverage economies
of scale and use one computer to serve the needs of many people? This concept
led to the mainframe, but it seems that McCarthy had something more ambitious
in mind: perhaps a single computing utility to serve an entire nation? (At a
similar talk at Stanford, McCarthy was apparently challenged by a physicist who
observed that “this idea will never work: a simple back-of-the-envelope calculation
shows that the amount of copper wire required to connect users to the computing
utility would be impossible.” This exchange provides a useful warning of the
diculties inherent in technological predictions, when new developments—in this
case optical fiber—can upend fundamental assumptions. But it was also accurate:
the large-scale realization of computing utilities was for a long ti me hindered by
network limitations.)
These ideas continued to percolate in the imaginations of researchers. In 1966,
Parkhill produced a prescient book-length analysis [
217
] of the challenges and
opportunities of utility computing, and in 1969, when UCLA turned on the first
node of the ARPANET, Internet pioneer Leonard Kleinrock claimed that “as
[computer networks] grow up and become more sophisticated, we will probably see
the spread of ‘computer utilities’ which, like present electric and telephone util ities,
will service individual homes and oces across the country” [248].
The large-scale realization of computing utilities had to wait until networks
were faster. In th e early 1990s, various groups started to deploy then-new optical
networking technologies for research purposes. In the US,
gigabit testbeds
linked a number of universities and research laboratories. Inspired by what might
be possible now that computers were connected at speeds cl ose to the memory
bandwidth, researchers started to talk about
metacomputers
[
237
]—virtual
computational systems created by linking components at dierent sites. Out of
these discussions grew the idea of a computational grid, which “by analogy to
the electric power grid provides access to power on demand, achieves economies
of scale by aggregation of supply, and depends on large-scale federation of many
suppliers and consumers for its eective operation” [
126
]. Software and protocols
were developed for remote access to storage and computing, and many scientific
communities leveraged thes e developments to federate computing facilities on local,
national, and even global scales. For example, high energy physicists designing
the Large Hadron Collid er (LHC) realized that they needed to federate computing
systems at hundreds of sites if they were to analyze the many petabytes of data to
be produced by LHC experiments; in response, they developed the LHC Computing
Grid (LCG) [175].
330
Chapter 16. History, Critiques, Futures
Grid computing enabled on-demand access to computing, storage, and other
services, but its impact was primarily limited to science [
127
]. (One exception
was within the enterprise, where “enterprise Grids” were widely d eployed. These
deployments are today o ften called “private clouds,” with the principal dierence
being the use of virtualization to facilitate dynamic resource provisioning.) The
emergence of cloud computing around 2006 is a fascinating story of marketing,
business model, and technological innovation. A cynic could observe, with some
degree of truth, that many a rticles from the 1990s and early 2000s on grid computing
could be—and often were—republished by replacing every occurrence of “grid”
with “cloud.” But this is more a comment on the fashion- and hype-driven nature
of technology journalism (and, we fear, much academic research in computer
science) than on cloud itself. In practice, cloud is about the eective realization
of the economies of scale to which early grid work aspired but did not achieve
because of inadequate suppl y and demand. The success of cloud is due to profound
transformations in these and other aspects of the computing ecosystem.
Cloud is driven, first an d foremost, by a transformation in demand. It is
no accident that the first successful infrastructure-as-a-service business emerged
from an e-commerce provider. As Amazon CTO Werner Vogels tel ls the story,
Amazon realized, after its first dramatic expansion, that it was building out
literally hundreds of similar work-unit computing systems to support the dierent
services that contributed to A mazo n’s online e-commerce platform. Each such
system needed to be able to sca le rapidly its capacity to queue requests, store
data, and acquire computers for data processing. Refactoring across the dierent
services produced services like Amazon’s Simple Queue Service, Simple Storage
Service, and El astic Computing Cloud. Those services (and other similar services
from other cloud providers, as described in previous chapters) have in turn been
successful in the marketplace because many other e-commerce businesses need
similar capabilities, whether to host simple e-commerce sites or to provide more
sophisticated services such as video on demand .
Cloud is also enabled by a transformation in transmission. While the U.S . and
Europe still lag behind broadband leaders such as South Korea and Japan, the
number of households with megabits per second or faster connections is large and
growing. One consequence is the widespread adoption of data-intensive services
such as YouTube and Netflix. Another is th at businesses feel increasingly able to
outsource bu sin ess processes such as em ail , customer relationship management,
and accou nting to software-as-a-service (SaaS) vendors.
Finally, cloud is enabled by a transformation in supply. Both IaaS vendors and
companies oering consumer-facing services (e.g., search: Google, auctions: eBay,
331
16.2. Critiques
social networking: Facebook, Twitter) require enormous quantities of computing
and storage. Leveraging advances in commodity computer technologies, these
and other companies have learned how to meet those needs cost eectively within
enormous data centers themselves [
69
] or, alternatively, have outsourced this aspect
of their business to IaaS vendors. The commoditization of virtualization [
67
,
227
]
has facilitated this transformation, making i t far easier than before to allocate
computing resources on demand, with a precisely defined software stack installed.
In our opinion, it is the transformation al changes in demand, transmission, a nd
supply, and the resulting virtuous circle of increased use, better networks, and
reduced costs, that account for the tremendous succes s of cloud technologies. It
will be interesting to see where the next set of disrup tive changes will occur, a
topic that we consider in section 16.3.
16.2 Critiques
The reader will by now have realized that we are great fans of the power of the
outsourcing and automation that cloud computing provides. We believe that by
enabling users faci ng mundane or challengin g computational tasks to focus on
their problem, rather than the task of acquiring and operating computation al
infrastructure, cloud computing can frequently increase productivity and thus
discovery and innovation.
Nevertheless it is also important to be aware of the various critiques that have
been levied against cloud, some of which, in our opinion, speak to real or potential
limitations and some to misunderstandings or dierences of opinion. We review
some of those critiques in the following. (As we have already discussed security
concerns in chapter 15, we do not revisit them here.)
16.2.1 Cost
A common critique of cloud is that it is too expensive. We do not dismiss the
importance of such concerns, particularly in academic settings where personnel
and equipment spending may not be fungible. But without getting into the details
of cost comparisons between on-premises and commercial cloud providers, we point
out that when performing such comparisons, it is important to consider all costs,
including personnel, space, and power. See, for example, Burt Holzman’s 2016
analysis of in-house vs. public cloud computing costs for high energy physics [
155
].
He found that when power, cooling, and sta costs were included, on-premises
computing in the Fermilab data center cost 0.9 cents p er core hour under the
332
Chapter 16. History, Critiques, Futures
assumption of 100% utilization, while o-premises computing on Amazon cost 1.4
cents per core hour. The observed computational speeds for their application were
close to identical. Experience suggests that, depending on the sp eci fics of your
institutional computing environment and workload, cloud costs can be insignificant,
greater than local costs, or less than local costs.
16.2.2 Lock In
Free so ftware evangelist Richard Stallman [
162
] h as argued that cloud computing
is “simply a trap aimed at forcing more people to buy into locked, proprietary
systems that [will] cost them more and more over time.” He expands upon this
point in an article in the Boston Review [238].
This is a common critique of cloud computing. At issue is the risk that arises
when we become dependent for our computing o n a third party provider. What
if that provider goes out of business, discontinues services on which we depend,
fails to meet desired quality of service commitments, or raises prices? What if
they lose your data? These are real risks that any potential cloud user needs
to evaluate, bala ncin g them against the benefits that cloud brings. On e partial
hedge is to use only services for which equivalents exist from other providers,
and to develop applications that use those services so that they can easily be
retargeted. One way to do this is to build applications in containers, such as
Docker, that allow them to run without modification on any commercial cloud.
However, if the application in the container invokes a special platform service, such
as a cloud-specific NoSQL service or stream broker, then ch ang es are required.
Good design and encapsulation of these dependencies in microservices can mitigate
this problem. A more fundamental issue is the data stored in the cloud. Moving
data can be dicult if they are large. The best solution may be to maintain an
archive of the d ata elsewhere.
16.2.3 Education
We have heard people critique the use of cloud co mpu ting in education on the basis
that students who rely on cloud services for storage and computing will not gain
the hands-on knowledge that is gained from, for example, installin g an d operating
Linux on a computer cluster. (We have both been asked variants of this rather
disturbing question: “How will graduate students gain employment if they cannot
perform systems administration tasks?”)
333
16.2. Critiques
It is easy to dismiss such concerns as Luddite misunderstan di ngs of new
technologies, but we feel that an important point is being made. One should rejoice
in the capabilities that cloud computing provides, but we are the poorer if in
seizing those benefits we lose understanding of the technologies that we are using.
We should be educating students not simply how to use simple cloud services to
perform simple tasks, but how cloud can be a platform for new approaches to
science. We h ope that this book can help in that process.
16.2.4 Black Box Algorithms
Another critique of clou d computing concerns the impact of handing o various
aspects of your work to proprietary software developed and operated by third parties.
If one cannot read the source code for a software component, obtain accurate
documentation of the methods that it uses, or even test it comprehensively, then
one has presumably lost the ability to determine the precise provenance of any
results obtained with that software [
205
]. A related concern is that software on
which one depends may be updated by a cloud provider without one’s knowledge,
in ways that turn out to aect your results.
These concerns appear to us to be quite real in the case of, fo r example,
proprietary machine learning, data analytics, or computational modeling packages
operated by cloud providers: in such cases, the result of a computation may
indeed depend on decision s, changes, or errors made deep within complex software
packages. We see fewer concern s in the case of systems software: while we may
lack knowledge of how exactly a cloud provider implements a particular data
management function, for examp le, the range of people using the software is larger
and thus undetected errors are l ess likely.
These concerns are by no means new to cloud com pu ting: they arise whenever
results derive from software that can not easily be studied or understood. (Microsoft
Excel, for example, while simple to use, is a complex black box.) The hi gh
complexity and frequent updates associated with cloud software packages do
arguably raise new challenges, but we suggest that simple approaches can be
adopted. Use signals such as peer opinion and documentation to evaluate software
quality. Test with problems for which you know the answers. Use cloud services for
which source code is available—as it often is, as we have detailed in other chapters.
In the case of machine learning methods, seek methods that yield models that
are interpretable by human readers, so that the implications of a model can be
understood and reviewed for hidden bias es.
334
Chapter 16. History, Critiques, Futures
16.2.5 Hardware Limitations
A common critique of cloud computing, at least in the early days, was that cloud
provided only limited hardware choices: it was ok if you wanted a vanilla x86
box, but not if you wanted something special. Today, the range of availabl e
hardware options is surely far greater than exists in any laboratory. Amazon,
Azure and Google provide dozens of machine types, with varying quantities of
CPU cores, memory, GPU capabilities, and other capabilities as we have described
in section 7.2.2 on page 98.
16.3 Futures
The cloud that we have described in this book is the cloud o f 2017. We believe
that many of the technologies and the principles presented here will have a long
life, but we also know that cloud technologies are evolving with great rapidity.
(Amazon, Azure and Google all regularly announce d ozens of new services and
capabilities.) Thus we spend some time prognosticating about areas in which we
believe cloud com puting is likely to evolve in the next several years.
16.3.1 Cloud-native Applications
What does it mean to develop an application for the cloud? As we saw in chapter 4,
it is straightforward to take many existing ap pli cation s, package them so that they
run in a virtual machine, and deploy that virtual machine onto a cloud compute
service. But in so doing, all you have done is eliminate (or at least shift) hardware
costs. You have not changed the essential nature of your appl ications in ways
that take advantage of cloud features such as fault tolerant storage, elasticity, and
powerful services such as those described in part III.
The term
cloud native
is used to describe applications that are written to
take advantage of the powerful collections of services provided by cloud platforms.
The
Cloud Native Computing Foundation www.cncf.io
writes that: “[c]loud
native computing [deploys] applications as microservices, packaging each part into
its own container, and dynami call y orchestrating those containers to optimize
resource utilization.” They describe open so urce software packages available on
Amazon, Azure, and Google, such as Kubernetes and Prometheus, that can be used
to support such applications. We described microservice architecture in section 7.6
on page 110 and illustrated it with our simple scientific document analyzer. The
cloud-native concept is more than just microservice implementations [
120
]. Cloud-
335
16.3. Futures
native applications have a clear separation between persistent state, such as a
database, and logic that runs in ephemeral virtual machines or containers, as shown
in figure 16. 1. The Globus service described in section 14.5 on page 307 has these
characteristics.
The tools that you use to deploy such applicati ons (Kubernetes, Mesos, etc.)
also allow you to easily monitor and manage them. Such applications scale eort-
lessly and can be partitioned so that new versions of an application’s microservices
can be deployed and tested alongside the current “active” deployment. If the new
versions work as planned, the old versions can be scaled back and no interruption
in external service is seen.
Figure 16.1: On the left, the conventional deployment approach: each application is
deployed in a virtual machine or container that contains all application state. On the right,
the cloud-native approach: state is maintained in cloud data services and computation is
performed by ephemeral service instances.
As we discussed in section 4.3 on page 67, serverless computing is about having
the cloud mana ge collections of your functions to be executed on special conditions
you define. This concept is closely related to cloud-native design. Unlike traditional
scientific computations which run from start to finish, cloud-native apps run until
you scale their implementation back to zero. Even in that quiescent state, they can
be restarted simply by telling the deployment tool to increase from zero. One can
set it so that an external event can trigger a serverless responder such as Amazon
Lambda to invoke the deployment system to scale up the application.
So what does cloud-native have to do with the future of science? Consider the
following scenario. Suppose you have a network of experimental instruments that
produce data in large volumes and bursts, which you need to analyze as they arrive
in real-time. This application can naturally be structured as a set of interactive
microservices. One microservice receives and scans data. If something interesting
is spotted, it invokes other microservices to perform additional processing, each
of which may need to scale up to take on these tasks. These various components
336
Chapter 16. History, Critiques, Futures
all send results to cataloging microservices that push data to a persistent data
repository. A second category of events may be triggered by users making queries
concerning the data that has been gathered. These other events can also cause other
analysis tasks to be performed, or may just involve access to the data repository.
The resulting cloud-native experiment management system may have dozens of
individual microservice types, all interacting and scaling according to demand.
16.3.2 Architectural Evolution
Once upon a time, cloud data centers were built with racks of o-the-shelf servers
from companies like Dell and HP. The relentless economics of competing in the
cloud marketplace has completely changed the way data centers are designed. The
first thing to go was o-the-shelf servers. Google was early in moving to cheap
blade servers packed d ensel y into racks. Amazon followed this practice and was
soon building its own servers in collaboration with companies like Taiwan’s Quanta.
Traditional servers were just too expensive.
Big changes came around 2005 when those building massive data centers were
forced to confront the fact that energy consumption was a major cost of doing
business. Amazon, Google and Microsoft were experimenting with a variety of
ideas to reduce the energy footprint of their data centers. This included tapping
into renewable sources of energy such as geothermal, wind and wave action. Data
center designs began to adopt supercomputer-style hot-cold aisle air conditioning.
Microsoft was able to m ove to a system in which 2000 servers were packaged into
a large shipping container that could be deployed outside.
The next phase of design evolution involved the server and not just its packaging
and cooling. By 2010, many data cloud vendors were designing their own servers.
In 2011, Facebook started the
Open Compute Project
[
35
] to create an open
source design for the server itself. Facebook, Google, and Microsoft also began
experimenting with ARM processors as a lower power alternative to the traditional
Intel processor. As it became clear that dierent cloud workloads required dierent
resources, the variety of server configurations began to explode.
The original data center designs used conventional com mercial networking gear
at the top of each rack and between racks. As these centers grew, their networking
needs became more demanding. Institutions demanded ways to extend th eir private
network directly into the cloud through scalable, virtual private networks. By 2012,
the Azure network was all based on software defined networks [
228
]; the same is
true for Amazon and Google.
337
16.3. Futures
The most recent architectural changes in the cloud are being driven by the
performance requirement of search, analytics, and machine learning . In 2010.
Microsoft research began a study of how to optimize the Bing search algorithms.
This work evolved into a major redesign of server architecture around
Field
Programmable Gate Arrays
(FPGAs) that have been added to the Azure
servers [
85
]. The FPGAs are situated between the network switches and the
servers so that this programmable logic lies in a plane allowing FPGA-to-FPGA
direct communication. This architecture, called Catapult, allows applications
needing special acceleration to group together a set of FPGAs and servers into a
special purpose mesh. This co nfiguratio n is used for applications like high speed
encryption and accelerating deep learning [
216
]. Microsoft is not the only cloud tha t
is d epl oying custom hardware. Google recently announced the Tensor Processing
Unit [
164
], which is designed to be a better accelerator for TensorFlow than GPUs.
These examples of cloud data center evolution illustrate that the designs are
moving rapidly toward a possible convergence with supercomputer technology.
While the cloud will always have a dierent use model than the largest supercom-
puters, we expect the value of the cloud for sci ence to only increase.
16.3.3 Edge Computing
Cloud computing has become synonymous with massive, hyper-connected data
centers, within which storage and computation are allocated fluidly in response
to user demand. This highly centralized architecture has been central to cloud
computing’s success, permitting both economies of scale in terms of operations
costs and innovative applications that depend on the aggregation and analysis
of large quantities of data. And as cloud provider services continue to increase
in sophistication, and as business es, homes, and people become increasingly well
connected, it can easily seem that there is no limit to the applications that can
be moved from personal computers to the cloud. Perhaps, we may think, all
computing will soon occur elsewhere.
Yet at the same time as cloud data centers become more powerful and people
become more connected to those data centers, other important trends are pushing
towards decentralization. Increasingly powerful sensors generate vast quantities of
data that often cannot be cost eectively transferred to cloud data centers but must
be processed l ocally. Increasing demands for computer-in-the-loop control make
latency increasingly critical. Consider, for example, an automated observation
system that is to detect migrating birds and then zoom in to obtain high-resolution
images that can be used to identify individual animals. It is likely not practical
to stream real-time video from thousands of cameras to the cloud, process the
338
Chapter 16. History, Critiques, Futures
data, and return results in time to zoom the cameras. But an inexpensive local
processing unit, perhaps running al gorithm s configured based on large-scale o ine
machine learning, can easily perform such tasks.
For applicatio ns such as these, computing needs to occur “at the edge” of
the network: hence
edge computing
[
232
]; the term “fog computing,” another
nebulous neologism, is sometimes also used [
75
]. Of course, that is where computing
has always been performed, at least since the PC era. But a new question being
considered is how the edge and the cloud may be connected. Will we see cloud
providers start to engineer cloud services that extend out to the edge? What will
this mean for what we choose to outsource to the cloud? It will be fascinating to
see how these questions are a nswered over the next decade and beyond.
We can already see early examples of cloud providers extending the reach of their
services beyond their primary data centers.
Content distribution networks
(e.g., Akamai, Amazon CloudFront, Azure CDN) run edge servers distributed
worldwide (68 such servers for Amazon CloudFront, as of 2017) to cache content
(e.g., web pages) that is to be made available rapidly to clients. More intriguing
are developments in serverless computing. As we saw in section 4. 3 on page 67,
services such as Amazon Lamb da, Azure Functions, and Google Cloud Functions
allow users to define fun ctions to be performed when certain events occur. While
these services make it possible to implement powerful reactive applications, their
responsiveness will be limited if every event notification and subsequent response
have to travel from the origin site to a cloud data center. Thus, Amazon provides
Lambda"@Edge
, which allows functions to run on Amazon CloudFront content
delivery network nodes. Intriguingly, they have also announced plans to allow
Lambda functions to “execute on hardware that isn’t a part of Amazon ’s cloud or
doesn’t have a consistent connection to the internet” [132]: perhaps, for example,
on computers associated with experimental apparatus in a scientific laboratory or
on Internet of Things components such as the Array of Things nodes described in
section 9.1 .2 on page 163.
16.4 Resources
The History of the Grid [
126
] reviews many developments relevant to utility, grid,
and clou d computing.
339