DT Cloud Services cloud maintenance
Contents
In the DT Cloud Services cloud, maintenance of Network Function Virtualization Infrastructure (NFVI), comprising compute, storage and networking resources, is performed by automation and scheduled to minimize VM disruption.
Maintenance principles
The maintenance is based on the following principles:
VMs are evacuated or live migrated (LM) - if supported by the application - to another compute node (depending on the granularity of the automation).
By default, necessary maintenance is performed relying on designed and implemented application redundancy.
If possible, maintenance is performed whenever a SAZ or compute node is free.
After successful maintenance compute nodes are brought back into operation.
This maintenance process is automated and transparent to the user. Users having VMs without live migration support are informed about planned maintenance hours. It does not mean that these VMs will be out of service during whole maintenance period. DT Cloud Services is constantly optimizing maintenance operations to minimize VM downtime.
The maintenance of compute nodes is done on a per SAZ per node basis. Following maintenance, there is two-day graze period between each SAZ. Within the SAZ, compute nodes (CN) are sequentially maintained with half an hour pause after each CN update. The maintenance process is illustrated in Figure 1.
When an application is deployed onto two or more VM instances in a RAZ with applied anti-affinity rule, then not both or all VMs in the deployment will be down simultaneously due to maintenance.
Availability design
When designing an architecture for a targeted availability, it is a good start to look at how many failure domains (that is, redundancy factors) the application has.
For most applications, it is either two (primary and backup), three (for example a database or other quorum-based systems), or simply one (a legacy app allowed to have single points of failure). Therefore, most VPCs are using two, three or just a single availability zone.
As a general principle, the number of availability zones should be kept to a minimum, since by partitioning the capacity, some economy of scale is lost. This may lead to routing problems, uneven load or additional effort to track and maintain capacity in each availability zone.
Table 1 can be used for designing an architecture to meet your availability requirements. The columns show the configuration (deployment model), RAZ multiplicity, whether AA is enabled or not, states causing app performance degradation, states causing application downtime, and availability. The figures are based on the assumptions that
Geo-redundancy has been properly implemented
Disasters do not affect two RAZ at the same time
VM config | RAZ | AA | SAZ | App degradation | App down | App min availability |
1 x VM | Single | No | CN maint., SAZ healing, RAZ down | 90% | ||
2 x VMs | Single | Yes | No | CN maint. | SAZ healing, RAZ down | 99% |
2 x VMs | Single | No | Yes | CN maint. | SAZ healing, RAZ down | 99% |
3 x VMs | Single | Yes | No | CN maint., SAZ healing | RAZ down | 99% |
3 x VMs | Single | No | Yes | CN maint., SAZ healing | RAZ down | 99% |
(N >= 2) x VMs | Single | No | No | CN maint. (if VMs share same node), SAZ healing, RAZ down | 90% | |
(N >= 3) x VMs | Single | Yes | Yes | CN maint., SAZ healing | RAZ down | 99% |
2 x VMs (1 x RAZ) | Dual |
| CN maint., SAZ healing, RAZ down | 99% | ||
4 x VMs (2 x RAZ) | Dual | Yes | No | CN maint., SAZ healing, RAZ down | 99,99% | |
4 x VMs (2 x RAZ) | Dual | No | Yes | CN maint., SAZ healing, RAZ down | 99,99% | |
2(N > 2) x VMs (N x RAZ) | Dual | Yes | No | CN maint., SAZ healing, RAZ down | 99,99% | |
2(N > 2) x VMs (N x RAZ) | Dual | No | Yes | CN maint., SAZ healing, RAZ down | 99,99% | |
2(N > 2) x VMs (N x RAZ) | Dual | Yes | Yes | CN maint., SAZ healing, RAZ down | 99,99% |
Table 1. Application deployment models and corresponding availability.