Tag: High Availability (page 2 of 2)

The core components of any HA strategy

In his excellent article in Linux Technical Review #04 Jens-Christoph Brendel proposes a new way how to implement High Availability (HA) in current IT architectures. According to Bendel, modern IT architectures continually gain in complexity. This fact makes it difficult to guarantee availability on a certain level. Nevertheless High Availability is not merely a competitional advantage: for many companies keeping availability levels above 99,999 % per year is a matter of existence. Therefore a few systematic steps should help in planning and implementing high availability in your IT environment. This article shows a possible strategy on how to plan High Availability in the Mobile Cloud environment.

Redundancy vs. Complexity

According to Brendel, every HA-strategy starts with an evaluation of necessary degrees of availability each architecture component requires. Basically availability can be increased by adding redundant components (as mentioned in my former article). On the other hand, every new component makes the overall system more complex and increases the risk of component failures.  In short: there is always a trade off between avoiding system component outages and adding complexity (and possible points of failure) to the overall architecture by adding redundant components to an IT architecture. For the OpenStack environment this means one has to classify the different OpenStack components according to the availability an OpenStack user requires.

AEC-classification proposal for OpenStack

One possible classification for IT components is the AEC-classification developed by the Harvard Research Group. The AEC-classes reach from AEC-0 (non-critical systems, typically 90% availability) to AEC-5 (disaster-tolerant systems, 99.99999% or “Five-Nines” availability). OpenStack basically consists in the following components: Nova (including Nova-Compute, Nova-Volume and Nova-Network), Horizon, Swift (ObjectStore), Glance, Cinder, Quantum and Keystone. A typical OpenStack end user has to deal with these components in order to be able to handle his cloud installation. One has to think about the targeted availability levels of these components in order to know more about the overall stability of the OpenStack cloud environment. Some components need not be AEC-5, but for others AEC-5 is a must. The following table is a proposal of AEC-classes for each of the OpenStack components.

table_aec

Of course the real availability architecture of a productive OpenStack implementation also depends on how many OpenStack nodes are used and on the underlying virtual and even physical infrastructure, but this proposal serves as a good starting point to think about adequate levels of availability in productive OpenStack architectures. How do we secure critical components like Nova or Keystone against failures? Any OpenStack HA strategy must focus on this question first.

Risk Management and the “Chaos Monkey”

The next steps towards developing an OpenStack HA strategy are risk identification and risk management. It is obvious that the risk of a component failure depends on the underlying physical and virtual infrastructure of the current OpenStack implementation and also on the requirements of the end users, but to investigate risk probabilities and impacts, we must have a test on what happens to the OpenStack cloud if some components fail. One such test is the “Chaos Monkey” test developed by Netflix. A “Chaos Monkey” is a service which identifies groups of systems in an IT architecture environment and randomly terminates some of the systems. The random termination of some components serves as a test on what happens if some systems in a complex IT environment randomly fail. The risk of component failures in an OpenStack implementation could be tested by using such Chaos Monkey services. By running multiple tests on multiple OpenStack configurations one can easily learn if the current architecture is able to reach the required availability level or not.

Further toughts

Should OpenStack increase in terms of availability and redundancy? According to TechTarget, the OpenStack Grizzly release should become more scalable and reliable than former releases. A Chaos Monkey test could reveal if the decentralization of components like Keystone or Cinder can lead to enhanced availability levels.

 

 

 

 

 

Pacemaker: clusters to allow HA in OpenStack

Open Stack’s capabilities to support High Availability are very limited. If a virtual machine crashes, there is no automatic recovery. Clustering software seems a to be a great workaround to allow redundancy and implement High Availability (HA).

Pacemaker is a scalable cluster resource manager developed by Clusterlabs. Its advantages are:

  • Support of many different deployment scenarios
  • Monitoring of resources
  • Recovery from outtages

According to the OpenStack documentation website the OpenStack HA environment builds on Pacemaker and Corosync. Corosync is Pacemaker’s message layer which is responsible for the distribution of clustering messages. The Pacemaker software uses resource agents that manage different ressources and communicate via Corosync. Corosync is responsible for synchronizing DRBD block devices which are virtual devices layered on top of the machine node devices themselves (like hard-disks etc.). The DRBD block device layer allows clustering of different machine nodes, while Corosync organizes the synchronicity of data in these clusters. Pacemaker resource agents control the DRBD devices via Corosync and are therefore able to organize high availability of machine nodes in an OpenStack environment.

Integration of Pacemaker into OpenStack is a major step towards creating a HA cloud environment. There’s an ongoing evaluation how Pacemaker fits into the MobileCloud environment, but it is obvious that there should be a test procedure to evaluate availability of cloud resources in different integration scenarios. Follow up information on this subject will be posted in a further blog post.

 

High Availability on OpenStack

Motivation for OpenStack High Availability

ICCLab’s MobileCloud Networking solution is supposed to offer private cloud services to end users. MobileCloud is based on OpenStack. Since our OpenStack installation is supposed to be used mainly by end users, it is necessary to provide High Availability.

As mobile end users we all know that we want our IT services to be available everytime and everywhere – 24 hours per day, 7 days per week, 365 days per year. End users normally don’t reflect that this requirement is challenge for system architects, developers and engineers who offer the IT services. Cloud components must be kept under regular maintenance to remain stable and secure. While performing maintenance changes, engineers have to shut down components. At the same time the service should still remain available for the end user. Achieving High Availability in a cloud environment is a very complex and challenging task.

Requirements for OpenStack High Availability

For delivering High Availability on an OpenStack environment there are different requirements:

  • Availability of a cloud service is the result of the availability of all its participating components. An app hosted in the cloud is only available if its supporting OS is available. The OS is only available if its underlying virtual or physical server is available. And everything breaks down if the network devices between service user and service provider break down. If one crucial component participating in the service fails, the whole service becomes unavailable. Therefore “High Availability on OpenStack” means High Availability on all components managed by OpenStack.
  • To maintain availability of service componenets, it is necessary to implement redundancy. If a crucial service component fails, a redundant component must take over its function to maintain availability of the service.
  • There’s a trade off between redundancy and costs: if you establish redundancy of MobileCloud service by doubling its components you double the overall availability of the service, but you also double the costs of the service.
  • 100%-Availability is an illusion since no service component can be available all the time. A better solution is to define availability levels or classes of availability for every component that define the possible idle time of service components. Availability classes have to be assigned to service components according to their importance to the total availability of the service.
  • High Availability is related to the concept of Event Management. An event is Service components must be able to react to events that could lead to outages in order to maintain their stability.
  • High Availability closely depends on monitoring tools. High Availability can only be implemented if outages and events which are harmful to availability of components can be monitored. The High Availability on OpenStack project depends on Monitoring on OpenStack project.
  • The High Availability solution for the OpenStack installation must contain the following parts: architecturial overview of all components (virtual and physical servers, network devices, operating systems, subsystems and software) that are crucial for service operation, assignment of availability levels for all those components, redundant components, a monitoring tool that captures events (traffic, load, outage etc.) and an event management systems that reacts to events.
  • Availability information of the monitored resources must be assignable to its tenant.
  • The metered values must be collected and correlated automatically.
  • The collection of values must be able to trigger events.
  • The event management system must be able to drive changes (e. g. switch traffic to a redundant device) in the service architecture and reconfigure components automatically.
  • Monitoring tool and event management system must be as generic as possible to ensure support of any device.
  • The monitoring tool and event management system must offer an API.

Architecture

OpenStack_HA

OpenStack High Availability Architecture

As-is state

Currently an extended version of the Ceilometer monitoring tool is used for the OpenStack environment of the ICCLab. An evaluation of possible Event Management functionality is currently performed. There is also an ongoing evaluation on solutions that implement redundancy in OpenStack.

Newer posts »