Availability
The DT Cloud Services infrastructure cloud (IC) is based on OpenStack and has availability zones implemented as defined by the OpenStack standard.
Contents
Regions and availability zones
An OpenStack availability zone is a logical partitioning of the infrastructure visible to the user. The partitioning does not need to correspond to the physical distribution of infrastructure, but is rather a representation based on the level of service resilience that can be ensured.
Availability zones are based on aggregates, which are specific sets of hosts with labels (meta-data) attached, and gives the user freedom to allocate instances to chosen zones or sets of hosts. An host can only be in one availability zone, and any host is part of a default availability zone even if it does not belong to an aggregate.
DT Cloud Services uses a modification of the OpenStack availability concepts. The OpenStack concepts are illustrated in Figure 1, and comprises
Region (or Deployment)
Sites (or Data Centers)
Availability Zone
Each Region has its own OpenStack deployment, including API endpoints, networks and compute resources. Different regions share Keystone and Horizon services, providing access control and the web interface, respectively.
Within each region, compute nodes (hosts) are logically grouped into Availability Zones (AZ). When an instance is created, the user can specify the AZ where it should be instantiated. It may even be possible to specify a specific node inside an AZ when the instance should be run.
In the DT Cloud Services cloud, an enhanced availability concept is used, as illustrated in Figure 2. The terminology is defined as follows:
Regions and Regional Availability Zones
Sites (or Data Centers)
Site Availability Zones
In the DT Cloud Services IC, each region consists of at least two geographically redundant data centers, which constitute the regional availability zones. Each site (that is, data center) has three site availability zones with physically separate compute resources, but shared control plane. Services requiring high availability should be deployed using regional and site availability zones.
A region corresponds to a country or part of a country. It represents a cluster of geographically close sites constituting Regional Availability Zones (RAZ). A site's membership in a RAZ is driven by
Country geography and site distances (to ensure low intra-region latency)
Population density and service demand (for load distribution)
DT Cloud Services IC control plane scalability limits
A Site is an OpenStack deployment. At present, each DT Cloud Services data center has its own OpenStack deployment, so a site can be considered synonymous with a data center.
A Site Availability Zone (SAZ) is the same as an OpenStack availability zone of a single environment or deployment.
Anti-affinity rules
The anti-affinity rule is a setting to ensure that VMs supporting the same application can be deployed on different hardware platforms (that is, compute nodes). This guarantees spatial diversity in the implementation, which in turn increases the reliability of the application. An AA rule can be set independently or together with a deployment across different availability zones.
The procedure to apply anti-affinity rules by creating server groups is described in how-to Create (anti-)affinity group
DT Cloud Services cloud availability
The DT Cloud Services cloud topology allows for design of high-availability application when distributed across availability zones. The purpose of having availability zones is to be able to avoid or mitigate service impact by identifying single points of failure. These points are such that when an event occurs, it causes the service to fail (leading to server downtime and non-availability).
Critical faults and operational events can be classified as
Random failures - component or link failures due to technical malfunction or accidents. The effect of this type of failures is mitigated by implementing the service with geographical redundancy, for example in different availability zones.
Planned maintenance - scheduled maintenance can be mitigated by proper operational countermeasures. Each availability zone has a defined maintenance plan, so distributing a service across availability zones increases server uptime.
By using regional availability zones, DT Cloud Services cloud targets a service availability of 99.99% for any region. Base on this target figure, zone availability figures can be computed by assuming independent failures of each logical component.
With two (or more) sites per region, the availability requirement A per RAZ is
1-(1-A)*(1-A) = 0.9999
or A=99% for any single RAZ.
Repeating the same argument for two (or more) VM's per site, we see that the required VM reliability is at least 90%. The nominal availability figures for DT Cloud Services cloud components are listed in Table 1.
Zone | Target availability |
Region | 99,99 % |
RAZ | 99,01 % |
SAZ | > Single instance VM |
Table 1. Availability of zones.
With a VM availability of at least 90%, with proper distribution of VMs highly resilient applications can be built, with an availability similar to a region, or 99.99%. For still higher availability, a multi-regional design must be used. These principles are illustrated in Figure 3.
Availability SLA
Region SLA for Compute is applied for the deployment with minumum 4 x VM Instances distributed over minimum 2 x RAZ where VM Instances in each RAZ are placed in separate SAZ and/or Anti-Affinity rule is applied.
Region Availability Zone SLA for Compute is applied for the deployment with minumum 2 x VM Instances in this RAZ, where VM Instances are placed in separate SAZ and/or Anti-Affinity rule is applied.
Downtime means:
For VM Instances: Loss of external connectivity or persistent disk access for all running VM instances.
Downtime does not include loss of external connectivity as a result of
Failure of NatCo specific private peering; that sort of downtime is addressed exclusively in the MPLS NNI SLA
Failures outside DT Cloud Services network responsibility.
The availability period is one month.