Use pacemaker and corosync on Illumos (OmniOS) to run a HA active/passive cluster

In the Linux world, a popular approach to build highly available clusters is with a set of software tools that include pacemaker (as resource manager) and corosync (as the group communication system), plus other libraries on which they depend and some configuration utilities.

On Illumos (and in our particular case, OmniOS), the ihac project is abandoned and I couldn’t find any other platform-specific open source and mature framework for clustering. Porting pacemaker to OmniOS is an option and this post is about our experience with this task.

The objective of the post is to describe how to get an active/passive pacemaker cluster running on OmniOS and to test it with a Dummy resource agent. The use case (or test case) is not relevant, but what should be achieved in a correctly configured cluster is that, if the node of the cluster running the Dummy resource (active node) fails, then that resource should fail-over and be started on the other node (high availability).

I will assume to start from a fresh installation of OmniOS 151012 with a working network configuration (and ssh, for your comfort!). Check the general administration guide, if needed.

This is what we will cover:

  • Configuring the machines
  • Patching and compiling the tools
  • Running pacemaker and corosync from SMF
  • Running an active/passive cluster with two nodes to manage the Dummy resource

Continue reading

OpenStack HA: why is Pacemaker such a slow recovery tool?

If you ever tried to implement High Availability in OpenStack by using Pacemaker, you might be disappointed by Pacemaker’s extremely slow recovery speed. Pacemaker recovers OpenStack at a very low pace – and even worse: it sometimes detects outages when they do not occur. As a result Pacemaker starts unnecessary computationally intensive recovery actions which are very slow and decrease OpenStack’s availability. This article describes why Pacemaker recovery actions are sometimes slow and what we can do against it.

Pacemaker is a distributed software that monitors and controls execution of programs or services on different computers in a cluster. The controlled services are called “resources” and Pacemaker needs a “resource agent” interface in order to be able to manage a resource. Resource management actions are performed by programs that run locally on each computer of the cluster: the “Local Resource Management Daemons” (LRMDs). LRMDs are programs that can monitor execution of services and restart them in case of failure. The LRMD actions are orchestrated by the “Cluster Resource Manager” (CRM). LRMDs know how to manage resources (from the resource agent specifications), but they do not monitor, stop or restart local IT services autonomously: the CRM has to tell them when and at what time interval they have to perform failover actions. The CRM can be configured by a distributed XML-file: the “Cluster Information Base” (CIB). The CIB contains all information that is necessary to orchestrate the LRMD actions. The communication between CRM and LRMDs is performed by a “Cluster Communication Manager” (CCM). Typical CCMs that are used in combination with Pacemaker are Corosync or Heartbeat.

Fig. 1: OpenStack HA with Pacemaker.

Fig. 1: OpenStack HA with Pacemaker.

OpenStack can be made highly available by installing redundant OpenStack services (Keystone, Nova, Glance etc.) on different machines and let Pacemaker control execution of the OpenStack services. Custom resource agents must be installed in order to allow the LRMDs to manage OpenStack resources. Then the CIB must be configured so the CRM can orchestrate the LRMD actions. An example of such a OpenStack HA architecture using Pacemaker is shown in Fig. 1.

Why is Pacemaker slow?

Sometimes one can experience that Pacemaker failover actions are very slow. There could be several reasons why the Pacemaker recovery of OpenStack is such a time-consuming task. The most common ones are these:

  • Suboptimal initialization scripts: OpenStack services do not generate a file containing the process identification (pid) in a pid file per default. Therefore Pacemaker is not able to identify OpenStack services as manageable entities or resources. Some hacking is necessary in order to make OpenStack services Pacemaker-compliant.
  • Custom resource agents: there are no OCF-compliant OpenStack resource agents delivered out of the box. Pacemaker’s Local Resource Management Daemons (LRMDs) are therefore not able to manage OpenStack services.
  • Bad Cluster Information Base (CIB) configuration: The worst thing is a messy CIB configuration. If e. g. recovery tasks are kept in large groups and monitoring intervals are too long to discover outages very fast, the Pacemaker recovery will act very slowly, because Pacemaker has to recover large resource groups and recovery actions are started lately.

What can be done to make Pacemaker faster?

The first and most important step to make Pacemaker recovery faster is to identify the cause of the slowness. Once you have done that, you can take one of the following actions:

  • Optimize initialization scripts: Depending on your initialization system (Init-V, Upstart, Systemd), you must customize the upstart of services in order to generate pid files which help Pacemaker to identify the service on the system. OpenStack services in Ubuntu are upstarted by the Init-V system. If you run OpenStack on Ubuntu, you must customize the upstart scripts so they will generate pid files automatically. This can be done by changing the configuration files in /etc/init. For the quantum server e. g. you have to change the /etc/init/quantum-server.conf file to contain several lines which tell the upstart daemon to create a pidfile and place it in a specified folder (typically /var/run). Creation of pid files can be performed using the start-stop-daemon. For more information on the start-stop-daemon read the manpage.
  • Create custom resource agents: there are no OpenStack resource agents delivered out of the box, but you can create them if you want. Resource agents must be placed in the /usr/lib/ocf/resource.d/ folder. They must contain methods to monitor, start and stop services as well as a method to control the execution status of the service. Some good examples for OpenStack resource agents can be found on the Hastexo website.
  • Improve Cluster Information Base (CIB) configuration: Most improvements can be done by changing the CIB configuration. Ideally OpenStack services should run redundantly at the same time on two different OpenStack nodes which can be reached by using a shared virtual IP. In case of a service failure on one node, Pacemaker just has to route traffic to the node where the service is still running. If the service is not running redundantly on the fallback node before the failure occurs, Pacemaker has to upstart the service on at least one of the nodes. A small context switch is usually faster than the upstart of whole services. Therefore redundant nodes must always keep redundant OpenStack services up and running. It is really important to ensure that parallel execution of redundant services is configured in the CIB file.

If you improve OpenStack initialization scripts, optimize OpenStack resource agents and improve the CIB configuration, Pacemaker should be a great tool to make OpenStack services highly available.

Automated Vagrant installation of MySQL HA using DRBD, Corosync and Pacemaker

Fig. 1: Redundant MySQL Server nodes using Pacemaker, Corosync and DRBD.

Fig. 1: Redundant MySQL Server nodes using Pacemaker, Corosync and DRBD.

If automation is required, Vagrant and Puppet seem to be the most adequate tools to implement it. What about automatic installation of High Availability database servers? As part of  our Cloud Dependability efforts, the ICCLab works on automatic installation of High Availability systems. One such HA system is a MySQL Server – combined with DRBD, Corosync and Pacemaker.

In this system the server-logic of the MySQL Server runs locally on different virtual machine nodes, while all database files are stored on a clustered DRBD-device which is distributed on all the nodes. The DRBD resource is used by Corosync which acts as resource layer for Pacemaker. If one of the nodes fails, Pacemaker automagically restarts the MySQL server on another node and synchronizes the data on the DRBD device. This combined DRBD and Pacemaker approach is best practice in the IT industry.

At ICCLab we have developed an automatic installation script which creates 2 virtual machines and configures MySQL, DRBD, Corosync and Pacemaker on both machines. The automated installation script can be downloaded from Github.

Dependability Modeling: Testing Availability from an End User’s Perspective

In a former article we spoke about testing High Availability in OpenStack with the Chaos Monkey. While the Chaos Monkey is a great tool to test what happens if some system components fail, it does not reveal anything about the general strengths and weaknesses of different system architectures.  In order to determine if an architecture with 2 redundant controller nodes and 2 compute nodes offers a higher availability level than an architecture with 3 compute nodes and only 1 controller node, a framework for testing different architectures is required. The “Dependability Modeling Framework” seems to be a great opportunity to evaluate different system architectures on their ability to achieve availability levels required by end users.

Overcome biased design decisions

The Dependability Modeling Framework is a hierarchical modeling framework for dependability evaluation of system architectures. Its purpose is to model different alternative architectural solutions for one IT system and then calculate the dependability characteristics of each different IT system realization. The calculated dependability values can help IT architects to rate system architectures before they are implemented and to choose the “best” approach from different possible alternatives. Design decisions which are based on Dependability Modeling Framework have the potential to be more reflective and less biased than purely intuitive design decisions, since no particular architectural design is preferred to others. The fit of a particular solution is tested versus previously defined criteria before any decision is taken.

Build models on different levels

The Dependability Models are built on four levels: the user level, the function level, the service level and the resource level. The levels reflect the method to first identify user interactions as well as system functions and services which are provided to users and then find resources which are contributing to accomplishment of the required functions. Once all user interactions, system functions, services and resources are identified, models are built (on each of the four levels) to assess the impact of component failures on the quality of the service delivered to end users. The models are connected in a dependency graph to show the different dependencies between user interactions, system functions, services and system resources. Once all dependencies are clear, the impact of a system resource outage to user functions can be calculated straightforward: if the failing resource was the only resource which delivered functions which were critical to the end user, the impact of the resource outage is very high. If there are redundant resources, services or functions, the impact is much less severe.
The dependency graph below demonstrates how end user interactions depend on functions, services and resources.
Dependability Graph

Fig. 1: Dependency Graph

The Dependability Model makes the impact of resource outages calculable. One could easily see that a Chaos Monkey test can verify such dependability graphs, since the Chaos Monkey effectively tests outage of system resources by randomly unplugging devices.  The less obvious part of the Dependability Modelling Framework is the calculation of resource outage probabilities. The probability of an outage could only be obtained by regularly measuring unavailability of resources over a long time frame. Since there is no such data available, one must estimate the probabilities and use this estimation as a parameter to calculate the dependability characteristics of resources so far. A sensitivity analysis can reveal if the proposed architecture offers a reliable and highly available solution.


Dependability Modeling on OpenStack HA Environment

Dependability Modeling could also be performed on the OpenStack HA Environment we use at ICCLab. It is obvious that we High Availability could be realized in many different ways: we could use e. g. a distributed DRBD device to store all data used in OpenStack and synchronize the DRBD device with Pacemaker. Another possible solution is to build Ceph clusters and again use Pacemaker as synchronization tool. An alternative to Pacemaker is keepalived which also offers synchronization and control mechanisms for Load Balancing and High Availability. And of course one could also think of using HAProxy for Load Balancing instead of Ceph or DRBD.
In short: different architectures can be modelled. How this is done will be subject of a further blog post.

Evaluation of HA technologies for OpenStack

As proposed in a former article different technologies must be evaluated in order to make the current MobileCloud environment suitable to High Availability (HA) requirements. The following article lists a basic evaluation of the different technologies that could be used.

Basically there are four technologies which allow to build a reliable HA-infrastructure for OpenStack:

  1. Build OpenStack on top of Corosync and use Pacemaker cluster resource manager to replicate cluster OpenStack services over multiple redundant nodes.
  2. For clustering of storage a DRBD block storage solution can be used. DRBD is a software that replicates block storage (hard disks etc.) over multiple nodes.
  3. Object storage services can be clustered via Ceph. Ceph is a clustered storage solution which is able to cluster not only block devices but also data objects and filesystems. Obviously Swift ObjectStore could be made highly available by using Ceph.
  4. OpenStack has MySQL as an underlying database system which is used to manage the different OpenStack Services. Instead of using a MySQL standalone database server one could use a MySQL Galera clustered database servers to make MySQL highly available too.

The different technologies have been evaluated according to their ability to make different OpenStack components highly available. The following table shows which technologies could be used to make the different OpenStack Services used in MobileCloud suitable to High Availability requirements.

table_ha_evaluation

Table 1.1: OpenStack Services and Clustering Technologies which make them suitable to HA requirements.

It is obvious that the different technologies can be used in different architectural setups. It is obvious that they must be used in a multi-node OpenStack Architecture. An architecture proposal will follow up in a further article.

Pacemaker: clusters to allow HA in OpenStack

Open Stack’s capabilities to support High Availability are very limited. If a virtual machine crashes, there is no automatic recovery. Clustering software seems a to be a great workaround to allow redundancy and implement High Availability (HA).

Pacemaker is a scalable cluster resource manager developed by Clusterlabs. Its advantages are:

  • Support of many different deployment scenarios
  • Monitoring of resources
  • Recovery from outtages

According to the OpenStack documentation website the OpenStack HA environment builds on Pacemaker and Corosync. Corosync is Pacemaker’s message layer which is responsible for the distribution of clustering messages. The Pacemaker software uses resource agents that manage different ressources and communicate via Corosync. Corosync is responsible for synchronizing DRBD block devices which are virtual devices layered on top of the machine node devices themselves (like hard-disks etc.). The DRBD block device layer allows clustering of different machine nodes, while Corosync organizes the synchronicity of data in these clusters. Pacemaker resource agents control the DRBD devices via Corosync and are therefore able to organize high availability of machine nodes in an OpenStack environment.

Integration of Pacemaker into OpenStack is a major step towards creating a HA cloud environment. There’s an ongoing evaluation how Pacemaker fits into the MobileCloud environment, but it is obvious that there should be a test procedure to evaluate availability of cloud resources in different integration scenarios. Follow up information on this subject will be posted in a further blog post.