Motivation for OpenStack High Availability
ICCLab’s MobileCloud Networking solution is supposed to offer private cloud services to end users. MobileCloud is based on OpenStack. Since our OpenStack installation is supposed to be used mainly by end users, it is necessary to provide High Availability.
As mobile end users we all know that we want our IT services to be available everytime and everywhere – 24 hours per day, 7 days per week, 365 days per year. End users normally don’t reflect that this requirement is challenge for system architects, developers and engineers who offer the IT services. Cloud components must be kept under regular maintenance to remain stable and secure. While performing maintenance changes, engineers have to shut down components. At the same time the service should still remain available for the end user. Achieving High Availability in a cloud environment is a very complex and challenging task.
Requirements for OpenStack High Availability
For delivering High Availability on an OpenStack environment there are different requirements:
- Availability of a cloud service is the result of the availability of all its participating components. An app hosted in the cloud is only available if its supporting OS is available. The OS is only available if its underlying virtual or physical server is available. And everything breaks down if the network devices between service user and service provider break down. If one crucial component participating in the service fails, the whole service becomes unavailable. Therefore “High Availability on OpenStack” means High Availability on all components managed by OpenStack.
- To maintain availability of service componenets, it is necessary to implement redundancy. If a crucial service component fails, a redundant component must take over its function to maintain availability of the service.
- There’s a trade off between redundancy and costs: if you establish redundancy of MobileCloud service by doubling its components you double the overall availability of the service, but you also double the costs of the service.
- 100%-Availability is an illusion since no service component can be available all the time. A better solution is to define availability levels or classes of availability for every component that define the possible idle time of service components. Availability classes have to be assigned to service components according to their importance to the total availability of the service.
- High Availability is related to the concept of Event Management. An event is Service components must be able to react to events that could lead to outages in order to maintain their stability.
- High Availability closely depends on monitoring tools. High Availability can only be implemented if outages and events which are harmful to availability of components can be monitored. The High Availability on OpenStack project depends on Monitoring on OpenStack project.
- The High Availability solution for the OpenStack installation must contain the following parts: architecturial overview of all components (virtual and physical servers, network devices, operating systems, subsystems and software) that are crucial for service operation, assignment of availability levels for all those components, redundant components, a monitoring tool that captures events (traffic, load, outage etc.) and an event management systems that reacts to events.
- Availability information of the monitored resources must be assignable to its tenant.
- The metered values must be collected and correlated automatically.
- The collection of values must be able to trigger events.
- The event management system must be able to drive changes (e. g. switch traffic to a redundant device) in the service architecture and reconfigure components automatically.
- Monitoring tool and event management system must be as generic as possible to ensure support of any device.
- The monitoring tool and event management system must offer an API.
Currently an extended version of the Ceilometer monitoring tool is used for the OpenStack environment of the ICCLab. An evaluation of possible Event Management functionality is currently performed. There is also an ongoing evaluation on solutions that implement redundancy in OpenStack.