Chapter 13
Building Your Own Cloud with
OpenStack (with Stig Telfer)
“If we as a society do not understand “the cloud,” in all its aspects
what data it holds, how it works, what the bargains are we make as we
engage with it, we’ll all be the poorer for it, I believe.”
—John Battelle
OpenStack is an open source cloud operating system used by a broad and g rowing
global community [
54
]. The Openstack software has a twice-yearly release cycle,
with versions named i n an incrementing alphabetical sequence. The latest release
at the time of writing, OpenStack Newton [
38
], includes code contributions from
2,581 developers and 309 organizations. OpenStack’s development model embodies
the state of the art in distributed, open source software development.
While OpenStack’s origins are in the orchestration of virtual machines, the
project has diversified to become a versatile coordinator of virtualization, container-
ization and bare metal compute. OpenStack now provides a unified foundation for
the management of many fo rms of storage, network, and compute resources. User
surveys have identified four dominant use cases [80]:
Enterprise private cloud
Public cloud
Telecom and network functions virtualization
Research and big data, including high-performance computing
13.1. OpenStack Core S ervices
OpenStack’s popularity as a choice for research computing infrastructure ma n-
agement is underlined by its use in academic clouds in the U.S., such as
Chameleon
chameleoncloud.org
,
Bridges
[
51
], and
Jetstream jetstream-cloud.org
,and
in international projects such as NeCTAR
cloud.nectar.org.au
in Australia and
at CERN [
37
] in Europe. The scientific com pu ting use cases of OpenStack are
served by a dedicated area
openstack.org/science
within the OpenStack website.
13.1 OpenStack Core Services
OpenStack control planes are formed from a number of intercommunicating (mostly)
stateless services and a set of stateful data stores. OpenStack services are de-
scribed in depth in an online resource called the OpenStack project navigator
openstack.org/software . Table 13.1 lists core components.
The cloud i nfras tructure ecosystem is evolving rapidly. The OpenStack strat-
egy for adapting to a rapid pace of innovation is referred to as the “Big Tent.”
New projects, often experimental or exploratory, can easily be created and are
encouraged to progress toward the OpenStack conventions for project governance
as their functionality develops. Competing or conflicting projects are supported,
to enable constructive competition within the ecosystem.
13.2 HPC in an OpenStack Environment
OpenStack can be configured variously to orchestrate compute resources in three
dierent ways: via virtualization, containerization, or bare metal. (Recall that we
Table 13.1: Six core components likely to be found in any Op enStack deployment.
Name Function Description
Keystone Identity
Provides authentication and authorization for all
OpenStack services
Nova Compute Provides virtual servers on demand
Neutron Networking
Provides “network connectivity as a service” be-
tween interface devices
Glance VM images
Supports discovery, registration, and retrieval of
VM images
Cinder Block storage Provides persis tent block storage to guest VMs
Swift
Object storage
Object storage interface
284
Chapter 13. Building Your Own Cloud with OpenStack
described the dierence between virtual machines and containers in cha pter 4; in
a bare metal deployment, software is run directly on the underlying hardware.)
Outside HPC, OpenStack is most frequently used as an orchestrator of virtual-
ization, in order to realize the full flexibility and advantages of software-defined
infrastructure. In contrast, bare metal deployments are most common in HPC
settings. While HPC workloads can be run on OpenStack systems configured in
any one of the three fo rms, many administrators choose to trade o flexibility for
reduced runtime overhead. H owever, because of rapid technological evolution at
all levels, this trade-o is a continually shifting balance.
One emerging trend is to use specialized hardware and virtualization optimiza-
tions to support virtualized HPC workload s in ways that deliver the flexibil ity
of software-defined infrastructure but avoid associated performance overheads.
Hardware technolo gies such as Single-Root I/O Virtualization (SR-IOV), described
in a later section, integrate with virtualized OpenStack compute to deliver H PC
networking with minimized overhead. Virtualization optimizations such as CPU
pinning and non-uniform memory access (NUMA) passthrough, both also described
later, enable hardware-aware scheduling and placement optimizations. This abili ty
to reconfigure cloud infrastructure via programming oers advantages throughout
the software development life cycle. You can, for example, develop and test an
application or workload on a standard OpenStack system and then apply hardware
and virtualization optimizations.
We focus in this chapter on OpenStack deployments that use virtualization.
Information available online describes implementation of containerized [
12
]and
bare metal [9] use ca ses.
13.3 Considerations for Scientific Workloads
OpenStack’s configuration and services can be adapted to support a range of
requirements particular to scientific workloads. We describe some examples here.
13.3.1 Network-intensive Ingest or Egress
An OpenStack workload may involve the ingest and/or egress of large volumes
of data involving external sources: for example, data ingested from a scientific
instrument or a public d ataset not hosted within the cloud. In these circumstances,
the external network bandwidth of compute instances can become a bottleneck.
A typical OpenStack configuration may deploy
software gateway routers
for networking between compute instances and the external world. However, while
285
13.3. Considerations for Scientific Workloads
such software gateway routers implement the rich feature set of software-defined
networking, they struggle to deliver high bandwidth and low latencies. When
operating in extremis, software switches are observed to discard packets instead of
exerting back pressu re. The following alternative configurations can be used to
improve performance for external network connectivity.
Provider networks
. A provider network is a pre-existing network in the
data center. It is not created or controlled by OpenStack Neutron, the
OpenStack networking controller, but Neutron can be made aware of it and
connect compute instances with it. This approach bypasses OpenStack-
controlled routing and gateways.
Router gateways in silicon
. Some switch vendors are able to ooad
software-defined networking (SDN) capabilities into switch port configura-
tions, enabling OpenStack-defined Layer-3 Internet protocol routing oper-
ations to be performed at full speed i n th e switch ports. Similarly, some
network interface cards (NICs) support hardware ooading of large por-
tions of SDN, greatly reducing load on control plane network nodes. Router
gateways in silicon do not support rich networking features such as network
address translation (NAT, required for supporting floating IP addresses),
although these features may not be necessary for private cloud use cases.
Distributed virtual routers
. Distributing network node functions over
components within each hypervisor produces a scalable external router
implementation. (The
hypervisor
is the software component responsible
for creating and running virtual machines. Typically there is one hypervisor
instance per compute node.) One drawback of this approach is the increased
hypervisor CPU overhead for networking. Furthermore, the approach raises
security concerns by making every hypervisor externally reachable.
13.3.2 Tightly Coupled Compute
In a generic OpenStack configuration, networking for compute instances passes
through one or more software virtual switches in the hypervisor, providing flexi-
bility in configuration but leading to higher latency and reduced bandwidth for
applications due to the additional data copies and context switches. Such software
switches can also introduce higher levels of jitter and packet loss. The performance
of tightly coupled application workloads, such as some bulk synchronous parallel
applications, can be strongly influenced by communication latencies between in-
286
Chapter 13. Building Your Own Cloud with OpenStack
stances. Consequently, virtualized networking can have an adverse impact on the
performance of tightly coupled workloads.
The overheads introduced by virtualized networking can be bypas sed through
use of
Single-Root I/O Virtualization
(SR-IOV), although this feature is not
supported by all NICs. This PCI hardware capability specifies how a PCI device
can be shared, through creation of a number of shadow devices referred to as
virtual functions
. A hypervisor that supports SR-IOV enables the passthrough
of a virtual function device into a compute i nstan ce, giving the virtual instance
direct access to the un derlying physical network device’s hardware interface. The
resulting direct path from compute instances to p hysical networks circumvents
the software-defined networking implementation by the hypervisor. While this
approach delivers high levels of performance, it also bypasses the security group
firewall rules that OpenStack applies to an instance. Consequently, SR-IOV
networking should only be used on internal (trusted) networks and should be
configured in conjunction with conventional network configurations for managing
connectivity with untrusted networks.
The performan ce of latency-sensitive workloads can be improved further through
smarter process scheduling. Pinning virtual processor cores to physical cores im-
proves cache locality. Memory access performance is improved by leveraging anity
between physical processor cores and memory regions. We briefly explain these
concepts. Modern system architectures tend to incorporate multiple processors,
each with an integrated memory controller and with memory directly attached to
each processor. A single-memory system is constructed with coherent access to all
memory from all CPU cores, with hardware buses between processors to ensure
consistency. A consequence of this design is that memory and CPUs are unevenly
coupled: wha t is referred to as non-uniform memory access. Making the virtual
compute instances aware of the NUMA topology of the physical host enables better
scheduling an d placement decisions by the guest kernel. The compute hypervisors
and OpenStack services can be configured to enab le these optimizations.
13.3.3 Hierarchical Storage and Parallel Fi le Systems
A workload may require hi gh-performance coupling with a data source rather than
between compu te hosts. OpenStack can support storage services of mu ltip le types,
including types suitable for dierent tiers in a storage hierarchy, concurrently.
However, it itself does not have a native implementation of hierarchical storage
management. The HPC data-movement protocol
iSCSI Extensions for RDMA
(iSER) is supported for serving data for OpenStack block storage (Cinder). i SER-
enabled Cinder storage requires an RDMA-capable NIC in both the compute
287
13.4. OpenS tack Deployment
hypervisors an d the block storage servers exporting volumes of this type.
When using RDMA and SR-IOV-enabled NICs in an OpenStack private cloud,
high levels of performance can be achieved from virtualized clients to a parallel
file system. The multitenancy model of cloud infrastructure diers from the
conventional multi tenancy model of HPC parallel file systems, an d this dierence
should be taken into account when connecting OpenStack compute instances
with parallel file systems across the data center intranet, as we now exp lai n.
Conventional HPC platforms are multiuser environments, and user privileges and
permissions are controlled throu gh their user ID. In cloud-hosted infrastructure, it
is standard practice to grant the tenant trivial access to root withi n their instances.
When exporting file systems to cloud-hosted instances, provision should be made
for potentially hostile clients with superuser privileges. Recent developments i n
Lustre have introduced Kerberos-based authentication for clients, which aims to
resolve this issue. An alternative approach is to provision the creation of scratch
parallel file systems for a tenant within their tenancy on the OpenS tack cloud. In
this way, a tenant is isolated from other cloud users and unable to subvert their
access to a shared file system resource.
13.4 OpenStack Deployment
Even a default deployment of OpenStack is large and complex and is not norma lly
deployed manually. The OpenStack project’s rich software ecosystem includes
a diverse range of automated systems for deployment and configu ration. The
OpenStack market has (broadly) converged on four approaches.
Turnkey systems. Rack-scale appliances provide integrated private clou d
compute and management.
Vendor-supported. Linux distribution vendors are becoming dominant as
OpenStack vendors. There are commercially-supported distributions of
OpenStack from Canonical, Red Hat, and SUSE, and from OpenStack
specialists such as Rackspace and Mirantis.
Community-packaged. Freely-available Linux distributions such as CentOS
and Ubuntu have community supported OpenStack packages.
Upstream code. Some OpenStack deployments are assembled by using source
code p ul led directly from upstream source repositories, often deployed as
containerized services.
288
Chapter 13. Building Your Own Cloud with OpenStack
Thus, users have considerable choice when selecting the means of deploying
OpenStack. This choice provides for considerable flexibi lity when it comes to
matching an organization’s requirements, budget, and skil l set to a suitable method
of OpenStack deployment.
Deployment begins with several dedicated servers, networks, and disks. A
conventional OpenStack deployment cl assifies these servers into distinct roles. On
small-scale deployments, several roles may be combined to enable a scal ed-down
footprint, even down to a single node. These roles are listed below.
Compute hypervisors. These are servers that run the client workloads in a
virtualized environment. In addition to virtualized compute, services are
usually required for implementing software-defined networking and storage.
OpenStack controllers. Centralized OpenStack control services and data
stores typically run on separate servers. These servers should be configured
for supporting database IO patterns and a highly concurrent, high-throughput
transactional workload.
Storage. Many OpenStack deployments are underpinned by a Ceph storage
cluster (although this is not a requirement). A wide range of vendors oer
OpenStack connectivity for commercial storage products.
Networking. Several approaches are available for implementing tenant net-
working in OpenStack, but the conventional and most established approach
involves managing router gateways to tenant networks using software vir-
tual switches on controller nodes. Since this can quickly become highly
CPU-intensive (and a performance bottleneck for tenants), OpenStack de-
ployments using Neutron IP routers often include dedicated servers for scaling
up networking performance. See section 13.3 on page 285 for other strategies.
13.5 Example Deployment
We next describe a deployment of OpenStack carried out at the University of Cam -
bridge research computing services. These incorporate many of the considerations
outlined in section 13.3 on page 285 to provide a flexible but performant resource.
13.5.1 Hardware Components
As illustrated in figure 13.1 on the next page, our example deployment is on current-
generation Dell Xeon-based servers equipped with current-generation Mellanox
289
13.5. Example Deployment
Ethernet NICs that support remote direct memory access (RDMA) (using RoCE:
RDMA over converged Ethernet) and SR-IOV. The servers are connected by a
50G Ethernet high-speed data network and a 100G multipath layer-2 Ethernet
fabric, using multichassis link aggregation (MLAG) to achieve multipathing. The
high-speed network uses Mellanox Ethernet switches. Separate 1G networks are
used for power management and server provisioning and control. The system also
uses a separate 1 0G network for the OpenStack control plane.
Figure 13.1: Hardware configuration used in our example OpenStack for science deployment.
Storage services are delivered by u sing a range of components. A high-speed
storage service is implemented by using iSER and NVMe devices. A moderate-
scale Ceph cluster is used to provide a tier of storage with greater capacity and
resilience. Enterprise storage is provided by Nexenta. Outside of the OpenStack
infrastructure, Intel E nterprise Edition Lustre is delivered to compute instances by
using a data-center provider network.
290
Chapter 13. Building Your Own Cloud with OpenStack
13.5.2 OpenStack Components
We describe here a freely available, community-supported OpenStack deployment
that uses the Community ENTerprise Operating System (CentOS) Linux distri-
bution with OpenStack packages from the Red Hat Distribution for OpenStack
(RDO). The servers are deployed with CentOS and OpenStack using TripleO, a
tool for automated OpenStack deployment and management. The TripleO online
documentation [
49
] provides a comprehen sive guide to using the tool for Open-
Stack deployment. We focus here on adapting an OpenStack configuration to
improve support for scientific computing workloads. To retain gen erali ty with other
methods of d eployment, we also describe the key com ponents of the OpenStack
configuration and how TripleO is used to realize that configuration.
13.5.3 Enabling Block Storage via RDMA
In order to use the iSER protocol, all associated storage servers and hypervis or
clients must have both RDMA-capable NICs and the Open Fabrics (OFED) stack
installed. The storage server manages the Cinder block storage volumes using
LVM. The iSCSI protocol is configured as iSER in the OpenStack Cinder Volume
driver configuration (
/etc/cinder/cinder.conf
) on the storage server, as follows.
[hpc_storage]
volume_driver=cinder.volume.drivers.lvm.LVMVolumeDriver
volumes_dir=/var/lib/cinder/volumes
iscsi_protocol=iser
iscsi_ip_address =10.4.99.3
volume_backend_name=hpc_storage
iscsi_helper=lioadm
To enable this using TripleO (which supports iSER configuration from Open-
Stack Ocata release), the following configuration is required.
parameter_defaults:
CinderEnableIscsiBackend: true
CinderIscsiProtocol: ' iser '
CinderISCSIHelper: ' lioadm '
Once th is configuration is deployed, iSE R-enabled block storage volumes can
be created and attached to compute instances, using the same interface as any
other kind of Cin der volume.
291
13.5. Example Deployment
13.5.4 Enabling SR-IOV Networking
Because of its circumvention of OpenStack security groups, SR-IOV is not suitable
for use on externally accessible networks. SR-IOV requires hardware support in
the NIC. This can be checked for in the product specs or with
lspci -v
. The
PCI Vendor ID and Device ID of the NIC are needed for hypervisor configuration
(the Mellanox ConnectX4-LX NICs used in this example configuration have IDs
0x15b3
and
0x1016
, respectively). SR-IOV must be enabled in bo th BIOS and
the Linux kernel. These additional kernel command-line boot parameters enable
SR-IOV support in the Linux kernel as follows.
intel_iommu=on iommu=pt
SR-IOV networking requires configuration of both Nova and Neutron. Addition-
ally, several SR-IOV virtual functions (VFs) must be created in advance, typically
during system startup. On compute hypervisors, permission for PCI-Passthrough
of the NIC VFs must be declared in Nova’s configuration /etc/nova/nova.conf:
pci_passthrough_whitelist = [{"vendor_id": "15b3", \
"device_id": "1015", \
"physical_network": "hpc_network"}]
On OpenStack controller nodes, the Nova configuration file
/etc/nova/nova.conf
needs to be edited to configure the scheduler with an additional filter for sched ul ing
instances according to availability of SR-IOV capable devices:
scheduler_default_filters = RetryFilter ,AvailabilityZoneFilter ,
RamFilter , DiskFilter , ComputeFilter ,ComputeCapabilitiesFilter ,\
ImagePropertiesFilter ,ServerGroupAntiAffinityFilter , \
ServerGroupAffinityFilter ,PciPassthroughFilter
On OpenStack controller nodes, the SR-IOV network driver is configured in
Neutron’s configuration
/etc/neutron/plugins/ml2/ml2_conf.ini
with the PCI
details of the N IC.
[ml2_sriov]
supported_pci_vendor_devs=15b3:1016
You must also edit
/etc/neutron/plugins/ml2/ml2_conf.ini
to s et the
VLAN range from which to allocate tenant networks, as follows:
[ml2_type_vlan]
network_vlan_ranges = hpc_network:1001:4000, external
292
Chapter 13. Building Your Own Cloud with OpenStack
parameter_defaults:
NeutronBridgeMappings: "hpc_network:br50g ,external:br10g"
NeutronNetworkType: "vxlan ,vlan"
NeutronMechanismDrivers: "sdnmechdriver ,openvswitch ,sriovnicswitch"
NeutronNetworkVLANRanges: "hpc_network:1001:4000, external"
NovaComputeExtraConfig:
neutron::agents::ml2::ovs :: bridge_mappings: [' external :br10g']
nova :: compute :: pci_passthrough :' "[{\" vendor_id\":\"15 b3\", \
\"device_id \":\"1015\",\" physical_netwo rk \":\" hpc_networ k \"}]" '
compute_classes:
-::neutron::agents::ml2::sriov
controllerExtraConfig:
nova :: scheduler :: filter::schedul er_def ault_ filte rs :
-RetryFilter
-AvailabilityZoneFilter
-RamFilter
-DiskFilter
-ComputeFilter
-ComputeCapabilitiesFilter
-ImagePropertiesFilter
-ServerGroupAntiAffinityFilter
-ServerGroupAffinityFilter
-PciPassthroughFilter
neutron::config::plugin_ml2_config :
ml2_sriov/supported_pci_vendor_devs:
value: '15 b3 :1016 '
Figure 13.2: TripleO configuration for enabling SR-IOV support. As described in the text,
PCI device details are configured, networking mechanisms are defined, and physical network
connectivity is mapped. The Nova scheduler is extende d with PciPassthroughFilter.
This TripleO configuration enables SR-IOV support for an internal network
named
hpc_network
, usin g a defined range o f VLANs. It also provides for Open
vSwitch and VXLAN networking, for general-purpose and externally connected
networking, as in figure 13.2.
Once OpenStack has been d eployed with this configuration, SR-IOV network
ports can be easily created either by using the OpenStack comm and -lin e interface
or through an orchestrated deployment via OpenStack H eat. OpenStack refers to
SR-IOV network ports as direct-bound ports.
When using the command line, four steps are needed in order to create a
network for use with SR-IOV.
293
13.5. Example Deployment
1. Create a VLAN network.
2. Assign an IP subnet to it.
3. Attach direct-bound (SR-IOV) ports to the network.
4. Create compute instances connected to those ports.
[neutron net-create hpc_net --provider:network_type vlan
neutron subnet -create --name hpc_net --gateway <gw > \
--dns -nameserver <dns> --enable-dhcp hpc_net <cidr >
neutron port -create <net vlan uuid > --binding:vnic -type direct
nova boot --flavor <flavor > --image <image > \
--nic port-id =<sriov port uuid> <name >
Enabling CPU pinning
Associating virtual CPUs with physical cores can
improve cache performance through improved locality. This kernel command-
line boot parameter excludes a given range of CPUs from scheduling, eectively
reserving those CPUs for virtualized workloads . On a 24-core system, we might
assign four cores for hypervisor activities and reserve 20 cores for guest workloads.
isolcpus =4 -23
Once the CPUs have been isolated from scheduling, they can be assigned in
Nova for CPU pinning. In /etc/nova/nova.conf insert the following.
vcpu_pin_set = 4-23
The Nova configuration can also be specified in TripleO-driven deployments.
parameter_defaults:
NovaComputeExtraConfig:
nova :: compute :: vcpu_pin_set : 4-23
Enabling NUMA passthrough
Further performance gains can be achieved
by exposing the physical topology of processors, memory, and ha rdware devices. A
guest operating system that is NUMA-aware can exploit this awareness to improve
the eciency of virtualized application workloads, in the same manner as for
workloads running in a bare metal environment. Passthrough of NUMA topology
is supported for current versions of KVM and libvirt, and (since OpenStack Juno)
OpenStack. The release of KVM that ships with CentOS (7.3) requires updating
294
Chapter 13. Building Your Own Cloud with OpenStack
to the version available in the CentOS-virt KVM repository. KVM 2.1.0 is the
minimum required version for supporting NUMA passthrough.
The NUMA topology requested by a compute instance is defined by using ad-
ditional properties of either a compute flavor or a software image. The OpenStack
documentation on CPU topologies [
36
] describes h ow flavors and images may be
configured to specify the underlying NUMA resources d esired for supporting the
workload. In a similar manner to SR-IOV support, NUMA awareness requires a fil-
ter in th e Nova com pute scheduler to ensu re that available resources meet the stated
requirements of the flavor of compute instance being scheduled. Scheduler filters are
specified in
/etc/nova/nova.conf
by setting the
scheduler_default_filters
property, as in the following example.
scheduler_default_filters = RetryFilter ,AvailabilityZoneFilter , \
DiskFilter , ComputeFilter , ComputeCapabilitiesFilter , \
ImagePropertiesFilter ,ServerGroupAntiAffinityFilter , \
ServerGroupAffinityFilter ,PciPassthroughFilter , \
RamFilter , NUMATopologyFilter
This TripleO config file deploys the Nova Compute Sch edu ler with NUMA
awareness enabled.
parameter_defaults:
controllerExtraConfig:
nova :: scheduler :: filter::schedul er_def ault_ filte rs :
-RetryFilter
-AvailabilityZoneFilter
-RamFilter
-DiskFilter
-ComputeFilter
-ComputeCapabilitiesFilter
-ImagePropertiesFilter
-ServerGroupAntiAffinityFilter
-ServerGroupAffinityFilter
-PciPassthroughFilter
-NUMATopologyFilter
13.6 Summary
We have presented the key architectural components of the OpenStack system
and illustrated the basic concepts required to deploy the OpenStack software. In
keeping with the theme of this book, we have focused o n scientific use cases and
provided tips on how to optimize performance.
295
13.7. Resources
OpenStack provides the benefits of software-defined infrastructure, while min-
imizing associated performance overheads. Diverse approaches to OpenStack
configuration and deployment yiel d a range of trade-os between flexibility and
performance. The rapid pace of OpenStack’s evolution seems likely to ensure that
as the platform matures, its value for scientific compute infrastructure management
will become increasingly compelling.
13.7 Resources
Many OpenStack private clouds are deployed for research computing, and a
significant proportion of research computing private clouds are deployed using
freely available versions of OpenStack. The operators of these clouds depend on
an open commun ity for support.
OpenStack operators of all forms exchange information and experiences through
the OpenStack Operators mailing list [
39
] and through regular meetups. The
OpenStack Foundation established the Scientific Working Group [
40
] as a focal
point for the specific interests of the research computing community. The Scientific
Working Group is free to join an d draws its membership from a global network
of research institutions that are using OpenStack infrastructure to meet the
requirements of modern research computing.
296