DT Cloud Services cloud maintenance

Contents

In the DT Cloud Services cloud, maintenance of Network Function Virtualization Infrastructure (NFVI), comprising compute, storage and networking resources, is performed by automation and scheduled to minimize VM disruption.

Maintenance principles

The maintenance is based on the following principles:

  • VMs are evacuated or live migrated (LM) - if supported by the application - to another compute node (depending on the granularity of the automation).

  • By default, necessary maintenance is performed relying on designed and implemented application redundancy.

  • If possible, maintenance is performed whenever a SAZ or compute node is free.

  • After successful maintenance compute nodes are brought back into operation.

This maintenance process is automated and transparent to the user. Users having VMs without live migration support are informed about planned maintenance hours. It does not mean that these VMs will be out of service during whole maintenance period. DT Cloud Services is constantly optimizing maintenance operations to minimize VM downtime.

The maintenance of compute nodes is done on a per SAZ per node basis. Following maintenance, there is two-day graze period between each SAZ. Within the SAZ, compute nodes (CN) are sequentially maintained with half an hour pause after each CN update. The maintenance process is illustrated in Figure 1.

Figure 1. DT Cloud Services NFVI maintenance process.

When an application is deployed onto two or more VM instances in a RAZ with applied anti-affinity rule, then not both or all VMs in the deployment will be down simultaneously due to maintenance.

Availability design

When designing an architecture for a targeted availability, it is a good start to look at how many failure domains (that is, redundancy factors) the application has.

For most applications, it is either two (primary and backup), three (for example a database or other quorum-based systems), or simply one (a legacy app allowed to have single points of failure). Therefore, most VPCs are using two, three or just a single availability zone.

As a general principle, the number of availability zones should be kept to a minimum, since by partitioning the capacity, some economy of scale is lost. This may lead to routing problems, uneven load or additional effort to track and maintain capacity in each availability zone.

Table 1 can be used for designing an architecture to meet your availability requirements. The columns show the configuration (deployment model), RAZ multiplicity, whether AA is enabled or not, states causing app performance degradation, states causing application downtime, and availability. The figures are based on the assumptions that

  • Geo-redundancy has been properly implemented

  • Disasters do not affect two RAZ at the same time

VM config

RAZ

AA

SAZ

App degradation

App down

App min availability

1 x VM

Single

No

  •  

  •  

CN maint., SAZ healing, RAZ down

90%

2 x VMs

Single

Yes

No

CN maint.

SAZ healing, RAZ down

99%

2 x VMs

Single

No

Yes

CN maint.

SAZ healing, RAZ down

99%

3 x VMs

Single

Yes

No

CN maint., SAZ healing

RAZ down

99%

3 x VMs

Single

No

Yes

CN maint., SAZ healing

RAZ down

99%

(N >= 2) x VMs

Single

No

No

  •  

CN maint. (if VMs share same node), SAZ healing, RAZ down

90%

(N >= 3) x VMs

Single

Yes

Yes

CN maint., SAZ healing

RAZ down

99%

2 x VMs (1 x RAZ)

Dual

 

  •  

CN maint., SAZ healing, RAZ down

  •  

99%

4 x VMs (2 x RAZ)

Dual

Yes

No

CN maint., SAZ healing, RAZ down

  •  

99,99%

4 x VMs (2 x RAZ)

Dual

No

Yes

CN maint., SAZ healing, RAZ down

  •  

99,99%

2(N > 2) x VMs (N x RAZ)

Dual

Yes

No

CN maint., SAZ healing, RAZ down

  •  

99,99%

2(N > 2) x VMs (N x RAZ)

Dual

No

Yes

CN maint., SAZ healing, RAZ down

  •  

99,99%

2(N > 2) x VMs (N x RAZ)

Dual

Yes

Yes

CN maint., SAZ healing, RAZ down

  •  

99,99%

Table 1. Application deployment models and corresponding availability.