Chapter 9
Streaming Data to the Cloud
“Panta rhei [everything flows].”
—Heraclitus
While batch analysis of big data collections is important, real-time or near-
real-time analysis of data is becoming increasingly critical. Consider, for example,
data from instruments that control complex systems such as the sensors onboard
an autonomous vehicle or an energy power grid: here, the data analysis is critical
to driving the system. In some cases, the value of the results diminishes rapidly
as they get older. For example, trending topics and hashtags in a twitter stream
may be uninteresting after the fact. In o ther cases, such as some large science
experiments, the volume of data that arrives each second is so large that it cannot
be retained and real-time analysis or data reduction is the only way to h and le it.
We refer to the activity of analyzing data coming from unbounded streams
as
data stream analytics
. While many p eo ple think this is a new topic, it
dates back to basic research on complex event processing in the 1990s at places
including Stanford, Caltech, an d Cambridge [
87
]. That research created part of
the intellectual foundation for today’s systems. In the paragraphs that follow, we
describe some recent approaches to data stream analytics from the open source
community and public cloud p roviders. These approaches include Spark Stream-
ing
spark.apache.org/streaming
, derived from the Spark parallel data a nal ysis
system described in the next chapter; Twitter’s Storm system
storm.apache.org
,
redesigned by Twitter as Heron [
173
]; Apache Flink
flink.apache.org
from the
German Stratosphere project; and Google’s Cl oud Dataflow [
58
], becoming Apache
Beam
beam.apache.org
, which runs on top of Flink, Spark, and Google’s Cloud.
9.1. Scientific S tream Examples
University projects include Borealis from Brandeis, Brown, and MIT, and the
Neptune and Granules projects at Colorado State. Other commercially developed
systems include Amazon Kinesis
aws.amazon.com/kinesis
, Azu re Event Hubs [
31
],
and IBM Stream Analytics [25].
The analysis of instrument data streams sometimes needs to move closer to
the source. Tools are emerging to perform preanalysis, with the goal of identifying
the data subsets that should be sent to the cloud for deeper analysi s. For example,
the Apache Edgent edge-analytics tools
edgent.apache.org
are designed to run
in small systems such as the Raspberry Pi. Kamburugamuve and Fox [
165
] survey
many of these stream-processing technologies, covering i ssu es not discussed here.
In the rest of this chapter, we use examples to motivate the need for data
stream analytics in science, discuss challenges that data stream analytics systems
must address, and describe the features and illustrate the us e of a range of open
source and cloud provider systems.
9.1 Scientific Stream Examples
Many interesting examples of data stream processing in science have been examined
in recent workshops [130]. We review a few representative cases here.
9.1.1 Wide-area Geophysical Sensor Networks
Many geophysical phenomena, from climate chang e to earthquakes and inland
floodin g, require vast data gathering and analytics if we are to both understand
the underlying scientific pro cess es and mitigate damage to life and property.
Increasingly, scientists are operating sensor networks that provide large quantities
of streaming data from many geospatial locations. We give three examples here.
Studies of ground motion conducted by the Southern California Earthquake
Center
scec.org
can involve thousands of sens ors that record continuously for
months at high sample rates [
208
]. One purpose for collecting such data is to
improve ea rthquake models, in which case data delivery rates are not of great
concern. Another purpose is real-time earthquake detection, in which ground
movement data are collected, analyzed to determine likely location and hazard
level, and alerts generated which may, for example, be used to stop trains [
174
].
In this case, delivery and processing speed are vital.
The Geodesy Advancing Geosciences and EarthScope (GAGE) Glob al Posi-
tioning System (GPS) [
220
] network manages data streams from approximately
162
Chapter 9. Streaming Data to the Cloud
2,000 continuously operating GPS receivers spann ing an area covering the Arctic,
North America, and the Caribbean. These data are used for studies of seismic,
hydrol ogi cal, and other phenomena, and if availab le with low latency can also be
used for earthquake and tsunami warnings.
The U.S. National Science Foundation-funded
Ocean Observatories Initia-
tive
(OOI) [
102
,
34
] operates an integrated set of sci ence-driven platforms and
sensor systems at multiple locations worldwide, including cabled seafloor devices,
tethered buoys, and mobile assets. 1,227 instruments of 75 dierent types collect
more than 200 dierent kinds of data regarding physical, chemical, geological, and
biological properties and processes, from the seafloor to the air-sea interface. Once
acquired, raw data (consisting mostly of tables of raw instrument values—counts,
volts, etc.) are transmitted to one of three operations centers and hence to mi rrored
data repositories at east and west coast cyberinfrastructure sites, where numerous
other derived data are computed. The resulting data are used to study issues such
as climate change, ecosystem variability, ocean acidification, and carbon cycling.
Rapid data delivery is important because scientists want to use near-real-time data
to detect and monitor events as they are happening.
9.1.2 Urban Informatics
Cities are complex, dynamic systems that are growin g rapidly and becoming
increasingly dense. The global urban population in 2014 accounted for 54% of the
total global population, up from 34% in 1960. In the U.S., 62.7% of the people live
in cities, although cities constitute only 3.5% of the land area [
98
]. Cities consume
large quantities of water and power, and contribute significantly to greenhouse
gas emissio ns. Maki ng urban centers safe, healthy, sustainable, and ecient is of
critical global importance.
Understanding how cities work and how they respond to changing environmental
conditions is now part of the emerging discipline called
urban info rmatics
. This
new discipline brings together data analytics experts, sociologists, economists,
environmental health specialists, city planners, and safety experts. In order to
capture and understand the dynamics of a city, many municipalities have begun
to install instrumentation that helps monitor energy use, air and water quali ty,
the transportation system, crime rates, an d weather conditions. The aim of this
real-time data gathering and analysis is to help municipalities avert citywide or
neighborhood crises and more intelligently plan future expansion.
163
9.1. Scientific S tream Examples
Array of Things
. A project in the city of Chicago called the
Array of Things
arrayofthings.github.io
is leading the way in defining how data can be colle cted
and analyzed from instruments placed within cities. The project is led by Charlie
Catlett and Peter Beckman from the Urban Center for Computation and Data, a joint
initiative of the University of Chicago and Argonne National Laboratory. The Array
of Things team has designed the hardware and software for a sensor pod that can be
placed on utility poles throughout a city, gather local data, and push the data back as
a stream to a recording and analysis point in the cloud. As shown in figure 9.1 on the
following page, the sensor package contains numerous instruments, a data-pro c ess ing
engine, and a communications package. Because the sensor package is intended to
be installed on poles and infrequently accessed, reliability is an important feature.
Consequently, the system contains a sophisticated reliability management subsystem
that monitors the instruments, computing, and communications, and reboots failed
systems if needed and possible.
Sensors in the sensor pod can measure temperature and humidity; physical
shock and vibration; magnetic fields; infrared, visible, ultraviolet light; sound; and
atmospheric contents such as carbon monoxide, hydrogen sulphide, nitrogen dioxide,
ozone, sulfur dioxide, and air particles. The pod also contains a camera, but it cannot
transmit identifiable images of individual people. The data from these instruments
are uploaded approximately every 30 minutes as a JSON record, which approximately
takes the form shown below.
{
"NodeID": string,
"timestamp": 2016-11-10 00:00:20.143000,
"sensor": string,
"parameter": string,
"value": string
}
Note th at a data s tre am from the pod may have more than one sensor and that
individual sensors can take multiple dierent measurements. These measurements
are distinguished by the
parameter
keyword. We use this structure in an example
later in this chapter.
9.1.3 Large-scale Science Data Flows
At the other end of the application spectrum, m ass ively parallel simulation models
running on supercomputers can generate vast amounts of data. Every few simulated
time steps, the program may generate large (50 GB or more) datasets distributed
over thousands of p rocessing elements. While you can save a few such datasets to
164