Cloud High Availability: how to select the right technologies

There are many different technologies which can increase availability of a cloud infrastructure. In our newest Techcouting paper we evaluate several HA technologies in order to define a HA architecture for an OpenStack deployment which is part of the XiFi project. HA technologies can be grouped in the following classes:

  • Resource monitors that check if IT-services are alive and (sometimes automatically) recover them in case of failure.
  • Load balancers that direct end user requests to those resources that are still alive and show reasonable prformance.
  • Distributed disks and file systems that increase redundancy of data and help to prevent data loss in case of failure.
  • Distributed databases which help to prevent loss of database records.

Every OpenStack component has the purpose to deliver a service to an end user. Availability of a cloud instance is dependent on the availability of the delivered end users services as perceived by end users. If we want to use a HA technology to increase availability of OpenStack we have to analyze dependencies of end user services on IT and infrastructure components. Therefore we created a dependability model of the provided IT services and the business services consumed by end users.

dependencies

As availability always depends on the requirements that are defined by end users we asked several OpenStack end users in a survey on the importance of each business service. The result is that end users tended to rate “Infrastructure Management” and “Security Management” as the most important services. Therefore we had to ensure that these services have high availability levels.
By linking the importance of the service to the IT components that provide it, we can assign a target availability level to each component. Furthermore we can compare several HA architectures to each other and check the availability levels they can achieve. We built several fault tree diagrams that represent the link of component failures to service outages:

fta

A simulation of service outages by given inputs of failure rates revealed that adding HA technologies to OpenStack can add up to 7-8 percent points to the average availability level of the provided services.

We tested several technologies that belong to one of the HA technology classes. Our evaluation included chances and risks associated with implementing the technology and technological maturity. We assigned each technology a chances, risks and maturity score.

ha_tech_assess_results

The result of our evaluation is that we prefer to use keepalived, HAProxy, Ceph/RADOS and MySQL Galera as HA technologies to improve availability of our OpenStack installation. These technologies are all open-source. They have been preferred because their performance is not significantly lower than the performance of commercial products, but they are available for free, while commercial products are not. The final HA architecture is able to increase availability levels of all OpenStack services up to three nines – which is a very high availability level in cloud computing.

It is clear that another organization would come to other conclusions when the concrete implementation of a HA technology has to be selected, but the evaluation methodology used in our paper shows how to make more reasonable technology choice decisions by linking end user requirements with system architecture characteristics and rate several architectural alternatives by the availability levels that are reasonably achievable.

Cloud High Availability

Overview

Cloud computing means:

  • On-demand self service
  • Virtualization
  • Elastic resource provisioning

Cloud computing service is comparable to public utility services like gas, telephone or water supply.

Economical value of cloud computing service is determined by reliability, availability and maintainability (RAM) characteristics.

Availability impacts the value of cloud computing as it is perceived by end users. High Availability systems increase guaranteed availability of a cloud computing service. Therefore they increase the economical value of a cloud computing service.

Objectives

Cloud HA initiative has the objectives:

  • To provide a service to analyze problems related with reliability and availability of cloud computing systems
  • To provide systems and services that increase reliability and availability of cloud computing systems

Research Challenges

The following challenges exist currently:

  • Measuring and analyzing availability: how can we experimentally determine reliability of cloud computing systems (VMs, storage etc.)? Design of adequate reliability measurement experiments is difficult, since we often have to rely on simulation of an outage.

  • Adapt reliability engineering methods to cloud computing: many reliability analysis and engineering techniques do exist (Fault Tree Analysis, FME(C)A, HAZOP, Markov Chains). How can we apply them to the area of cloud computing?

  • Analytic and monitoring systems: build systems that automatically monitor reliability of cloud resources and analyze problems.

  • Failure recovery and intelligent event management systems: build systems that intelligently detect and react to failures.

Currently there is almost no data available on reliability of different virtualization technologies like OpenStack or Docker.

Cloud vendors and manufacturers simply claim that their systems operate reliably without providing data to prove their claims. Think about an engineering company (like e. g. ABB or Siemens). Would they still be on the market if they were not able to tell their customers the exact hazard rates and MTBFs of their products? The IT industry is lagging behind other engineering industries. IT reliability engineering could be an interesting discipline that adds value to IT products and services.

Relevance to current and future markets

Business impact

Existing High Availability solutions:

  • Pacemaker: resource monitor that automatically detects failures and recovers failed components. Highly configurable, but also heavyweight. System administrators notoriously complain about its bad configuration interface. A bad configuration can make the system 7-8 times slower than a good configuration.

  • Keepalived: lightweight resource monitor. Unclear if this tool is well supported by its community.

  • IBM Tivoli: extremely heavyweight resource monitor and configuration management tool.

  • HAProxy: light load balancer. Great for web applications, but only applicable to HTTP-based services.

  • DRBD: disk replication technology. Fast and lightweight. Suitable for small disk networks.

  • Ceph: distributed storage and file system. Highly decentralized and great scalability.

  • GlusterFS: distributed storage and file system. Better scalability, but sometimes problem with partition tolerance.

  • Galera: MySQL cluster. True multimaster solution.

  • MySQL NDB Cluster: maps MySQL to simple key,value store. Requires adaption of applications to database interface.

  • Nagios: great monitoring system. Extendability and many plugins available.

  • Elasticsearch, Logstash, Kibana (ELK): log file monitoring system.

There are many HA systems available on the market, but almost no tool to analyze reliability of OpenStack and allow for automated intelligent recovery from failure.

Results

Presentation

HA_initiative_factsheet

Contact

Konstantin Benz
Obere Kirchgasse 2
CH-8400 Winterthur
Mail: benn__(at)__zhaw.ch

Use pacemaker and corosync on Illumos (OmniOS) to run a HA active/passive cluster

In the Linux world, a popular approach to build highly available clusters is with a set of software tools that include pacemaker (as resource manager) and corosync (as the group communication system), plus other libraries on which they depend and some configuration utilities.

On Illumos (and in our particular case, OmniOS), the ihac project is abandoned and I couldn’t find any other platform-specific open source and mature framework for clustering. Porting pacemaker to OmniOS is an option and this post is about our experience with this task.

The objective of the post is to describe how to get an active/passive pacemaker cluster running on OmniOS and to test it with a Dummy resource agent. The use case (or test case) is not relevant, but what should be achieved in a correctly configured cluster is that, if the node of the cluster running the Dummy resource (active node) fails, then that resource should fail-over and be started on the other node (high availability).

I will assume to start from a fresh installation of OmniOS 151012 with a working network configuration (and ssh, for your comfort!). Check the general administration guide, if needed.

This is what we will cover:

  • Configuring the machines
  • Patching and compiling the tools
  • Running pacemaker and corosync from SMF
  • Running an active/passive cluster with two nodes to manage the Dummy resource

Continue reading

OpenStack HA: why is Pacemaker such a slow recovery tool?

If you ever tried to implement High Availability in OpenStack by using Pacemaker, you might be disappointed by Pacemaker’s extremely slow recovery speed. Pacemaker recovers OpenStack at a very low pace – and even worse: it sometimes detects outages when they do not occur. As a result Pacemaker starts unnecessary computationally intensive recovery actions which are very slow and decrease OpenStack’s availability. This article describes why Pacemaker recovery actions are sometimes slow and what we can do against it.

Pacemaker is a distributed software that monitors and controls execution of programs or services on different computers in a cluster. The controlled services are called “resources” and Pacemaker needs a “resource agent” interface in order to be able to manage a resource. Resource management actions are performed by programs that run locally on each computer of the cluster: the “Local Resource Management Daemons” (LRMDs). LRMDs are programs that can monitor execution of services and restart them in case of failure. The LRMD actions are orchestrated by the “Cluster Resource Manager” (CRM). LRMDs know how to manage resources (from the resource agent specifications), but they do not monitor, stop or restart local IT services autonomously: the CRM has to tell them when and at what time interval they have to perform failover actions. The CRM can be configured by a distributed XML-file: the “Cluster Information Base” (CIB). The CIB contains all information that is necessary to orchestrate the LRMD actions. The communication between CRM and LRMDs is performed by a “Cluster Communication Manager” (CCM). Typical CCMs that are used in combination with Pacemaker are Corosync or Heartbeat.

Fig. 1: OpenStack HA with Pacemaker.

Fig. 1: OpenStack HA with Pacemaker.

OpenStack can be made highly available by installing redundant OpenStack services (Keystone, Nova, Glance etc.) on different machines and let Pacemaker control execution of the OpenStack services. Custom resource agents must be installed in order to allow the LRMDs to manage OpenStack resources. Then the CIB must be configured so the CRM can orchestrate the LRMD actions. An example of such a OpenStack HA architecture using Pacemaker is shown in Fig. 1.

Why is Pacemaker slow?

Sometimes one can experience that Pacemaker failover actions are very slow. There could be several reasons why the Pacemaker recovery of OpenStack is such a time-consuming task. The most common ones are these:

  • Suboptimal initialization scripts: OpenStack services do not generate a file containing the process identification (pid) in a pid file per default. Therefore Pacemaker is not able to identify OpenStack services as manageable entities or resources. Some hacking is necessary in order to make OpenStack services Pacemaker-compliant.
  • Custom resource agents: there are no OCF-compliant OpenStack resource agents delivered out of the box. Pacemaker’s Local Resource Management Daemons (LRMDs) are therefore not able to manage OpenStack services.
  • Bad Cluster Information Base (CIB) configuration: The worst thing is a messy CIB configuration. If e. g. recovery tasks are kept in large groups and monitoring intervals are too long to discover outages very fast, the Pacemaker recovery will act very slowly, because Pacemaker has to recover large resource groups and recovery actions are started lately.

What can be done to make Pacemaker faster?

The first and most important step to make Pacemaker recovery faster is to identify the cause of the slowness. Once you have done that, you can take one of the following actions:

  • Optimize initialization scripts: Depending on your initialization system (Init-V, Upstart, Systemd), you must customize the upstart of services in order to generate pid files which help Pacemaker to identify the service on the system. OpenStack services in Ubuntu are upstarted by the Init-V system. If you run OpenStack on Ubuntu, you must customize the upstart scripts so they will generate pid files automatically. This can be done by changing the configuration files in /etc/init. For the quantum server e. g. you have to change the /etc/init/quantum-server.conf file to contain several lines which tell the upstart daemon to create a pidfile and place it in a specified folder (typically /var/run). Creation of pid files can be performed using the start-stop-daemon. For more information on the start-stop-daemon read the manpage.
  • Create custom resource agents: there are no OpenStack resource agents delivered out of the box, but you can create them if you want. Resource agents must be placed in the /usr/lib/ocf/resource.d/ folder. They must contain methods to monitor, start and stop services as well as a method to control the execution status of the service. Some good examples for OpenStack resource agents can be found on the Hastexo website.
  • Improve Cluster Information Base (CIB) configuration: Most improvements can be done by changing the CIB configuration. Ideally OpenStack services should run redundantly at the same time on two different OpenStack nodes which can be reached by using a shared virtual IP. In case of a service failure on one node, Pacemaker just has to route traffic to the node where the service is still running. If the service is not running redundantly on the fallback node before the failure occurs, Pacemaker has to upstart the service on at least one of the nodes. A small context switch is usually faster than the upstart of whole services. Therefore redundant nodes must always keep redundant OpenStack services up and running. It is really important to ensure that parallel execution of redundant services is configured in the CIB file.

If you improve OpenStack initialization scripts, optimize OpenStack resource agents and improve the CIB configuration, Pacemaker should be a great tool to make OpenStack services highly available.

Automated OpenStack High Availability installation now available

The ICCLab developed a new High Availability solution for OpenStack which relies on DRBD and Pacemaker. OpenStack services are installed on top of a redundant 2 node MySQL database. The 2 node MySQL database stores its data tables on a DRBD device which is distributed on the 2 nodes. OpenStack can be reached via a virtual IP address. This makes the user feel that he is dealing with only one OpenStack node. All OpenStack services are monitored by the Pacemaker tool. When a service fails, Pacemaker will restart it on either node.

Fig. 1: Architecture of OpenStack HA.

Fig. 1: Architecture of OpenStack HA.

The 2 node OpenStack solution can be installed automatically using Vagrant and Puppet. The automated OpenStack HA installation is available on a Github repository.

Dependability Modeling on OpenStack: Part 3

In this part of the Dependability Modeling article series we explain how a test framework on an OpenStack architecture can be established. The test procedure has 4 steps: in a first step, we implement the OpenStack environment following the planned system architecture. In the second step we calculate the probabilities of component outages during a given timeframe (e. g. 1 year). Then we start a Chaos Monkey script which “attacks” (randomly disables) the components of the system environment using the calculated probabilities as a base for the attack. As a last step we measure the impact of the Chaos Monkey attack according to the table of failure impact sizes we created in part 2. The impact of the attack should be stored as dataset in a database. Steps 1-4 form one test run. Multiple test runs can be performed on multiple architectures to create a empirical data which allows us to rate the different OpenStack architectures according to their availability.

 Step 1: Implement system architecture

Implementation of an OpenStack architecture can be achieved quite straightforward by using the Vagrant-Devstack installation. Each OpenStack node can be set up as Vagrant-Devstack system. First install Virtualbox, then install Vagrant and then install Vagrant-Devstack. Configure Devstack to support a Multi-node environment. As a next step you should create an SSH Tunnel between the different nodes using Vagrant. Once the different VM nodes are ready, you can start to test the architecture. (Fig.1) includes a typical OpenStack architecture for a single OpenStack node.

Fig. 1: Typical OS architecture for a single OpenStack node.

Fig. 1: Typical OS architecture for a single OpenStack node.

High availability is usually only possible in a multi-node environment, because redundant nodes are needed in case of node failures and consequent failovers. Therefore your architecture must be an architecture which is distributed or clustered over several redundant nodes. An example of such an architecture is shown in (Fig. 2). Once the architecture is defined, you have to implement it by using Vagrant, Puppet and Devstack.

Fig. 2: Sample 2-node architecture using DRBD, Corosync and Pacemaker.

Fig. 2: Sample 2-node architecture using DRBD, Corosync and Pacemaker.

Step 2: Calculate outage probability

Availability is usually measured during a given time period (e. g. one year). It is the fraction of uptime divided by total time. If we want to calculate the risk/probability of outages in the observed period, we must know at least two values: the total downtime of a component (which can be evaluated when the availability is known)  and the average recovery time. Both values are parameters which are needed to estimate the number of outages in the observed time period. In (Tab. 1) we have a list of all OpenStack components which are present in one node of the OpenStack installation. Availability is observed for a time period of one year (= 31’535’000 seconds). If we assign each component an availability value and an average recovery time, we can calculate the downtime and the number of outages per year. Because we are interested in the outage risk, we calculate the risk by dividing the number of total outages by the number of days per year. The calculated outage risks can be used now to simulate a typical operational day of the observed OpenStack system.

Tab. 1: Outage risk estimation of OpenStack components.

Tab. 1: Outage risk estimation of OpenStack components.

Step 3: Run Chaos Monkey attack

Although Chaos Monkey disables devices randomly, a realistic test assumes that outages do not occur completely randomly. A Chaos Monkey attack should be executed only with probability – not with certainty. Therefore we must create a script which disables the OpenStack services with probabilities we defined in (Tab. 1). Such a script could be written in Python – as shown in (Fig. 2). The most important part of the shutdown mechanism is that probabilities should be assignable to the services we want to disable. The probabilities will be taken from the values we have calculated in (Tab. 1). The other part should be that execution of Chaos Monkey attacks follows a random procedure. This can be achieved by using a simple random number generator which generates a number between 0 and 1. If the random number is smaller than the probability, the Chaos Monkey attack is execeuted (otherwise nothing is performed). This way we can simulate random occurence of outages as if it would be the case in a real OpenStack installation that runs in operational mode.

Fig. 3: Excerpt of a Python script which serves to shutdown OpenStack services.

Fig. 3: Excerpt of a Python script which serves to shutdown OpenStack services.

Step 4: Poll impact of failure

Once the Chaos Monkey attack has been performed, one has to check the impact size of the outage. Failure impact size equals the values in the table of failure impact sizes (Tab. 2). The table of failure impact sizes is derived from the execution of Dependability Modeling (as explained in article 2 of this series). The task at hand is now to poll which user interactions are still available after the Chaos Monkey attack. This can be done by performing the use cases which are affected by an outage of a component. The test tool must be a script which programmatically runs the use cases as tests. If a test fails, the failure impact size is raised according of the weight of the use case. The result of such a test run is a failure impact size after the Chaos Monkey attack.

Tab. 2: Failure impact sizes and use cases affected by component failure.

Tab. 2: Failure impact sizes and use cases affected by component failure.

Cleanup and re-run the test

Test results should be stored in a database. This database should contain failure impact sizes, assumed availabilities and average recovery times as well as information about the system architecture that has been used. When a test run has been completed, the results of the Chaos Monkey attacks have to be reverted in order to be able to re-run the test. With each test-run the database is filled up and one can be more certain about the test results.

Further test runs can be started either with the same architectural setup or with another one: instead of a one-node installation one could use a two-node OpenStack installation, one could use Ceph and Pacemaker as HA clustering software and try different technologies. If we perform steps 1-4 repeatedly, we can rate different OpenStack architectures according to their resistance against outages and find out which architecture fits best to High Availability goals.

If the test framework is applied to an OpenStack environment like e. g. Mobile Cloud Network, High Availability characteristics can be ensured more confidently. Dependability modeling is a useful recipe to test OpenStack architectures from an end users’ perspective. The capabilities of the explained method have not been explored in detail yet, but more will follow soon.

 

DRBD-Test environment for Vagrant available

There is always room to test different HA technologies in a simulated VM environment. At ICCLab we have created such a DRBD test environment for PostgreSQL databases. This environment is now available on Github.

The test environment installation uses Vagrant as tool to install VMs, Virtualbox as VM runtime environment and Puppet as VM configurator. It includes a Vagrant installation script (usually called a “Vagrantfile”) which sets up two virtual machines which run a clustered highly available PostgreSQL database.

In order to use the environment, you have to download it and then run the Vagrant installation script. The Vagrant installation script of the test environment essentially does the following things:

  • It creates two virtual machines with 1 GB RAM, one 80 GB harddrive and an extra 5 GB harddrive (which is used as DRBD device).
  • It creates an SSH tunnel between the two VM nodes which is used for DRBD synchronization.
  • It installs, configures and runs the DRBD device on both machines.
  • It installs, configures and runs Corosync and Pacemaker on both machines.
  • It creates a distributed PostgreSQL  database which runs on the DRBD device and which is managed by the Corosync/Pacemaker software.

This environment can easily be installed and then be used for testing of the DRBD technology. It can be downloaded from the following Github repository:

https://github.com/kobe6661/dependability_test_fw.git

Installation instructions can be found here.

Dependability Modeling on OpenStack: Part 2

In the previous article we defined use cases for an OpenStack implementation according to the usage scenario in which the OpenStack environment is deployed. In this part of the Dependability Modeling article series we will show how these use cases relate to functions and services provided by the OpenStack environment and create a set of dependabilities between use cases, functions, services and system components. From this set we will draw the dependency graph and make the impact of component outages computable.

Construct dependency table

The dependency graph can be constructed if we define which functions, services and components allow provision of a use case. In the example below (Fig. 1) we defined the system architecture components, services and functions which allow to create, delete or update details of a Telco Account (account of mobile end user). Since these operations are provided within virtual machines, VM User Management and VM Security Management functions provide availability of this use case. Therefore we draw a column which contains these functions. Because these functions need a User Management, SSH & Password Management service in each VM in order to operate, we draw a second column which contains the required services. Another column is constructed which tells the system components required in order to deliver the required services.

Fig. 1: Dependency Graph Construction.

Fig. 1: Dependency Graph Construction.

The procedure mentioned above is repeated for all use cases. As a result you get a table like the one in (Tab. 1). This dependency table is the starting point for the production of the dependency graph.

Tab. 1: Dependencies between Use Cases, Services, Functions and Components.

Tab. 1: Dependencies between Use Cases, Services, Functions and Components.

Construct dependency graph

For each component that is listed in the table you have to model the corresponding services, functions and use cases. This is performed like in the example in (Fig. 2). We start from the right of the graph with the Ceilometer component and the VM plugin and look which services are provided by those components: it is e. g. the “Ceilometer Monitoring” service. Therefore we draw an icon that represents this service and draw arrows from the Ceilometer and VM plugin components to the service icon (1). In the next step we look which function is provided by the Ceilometer Monitoring service. This is the “Monitoring of VM” function. Therefore we paste an icon for the function and draw an arrow to this function (2). Then we look for the use cases provided by the Monitoring of VM function. Since this is e. g. “Measure SLAs”, we paste an icon for this use case and draw another arrow to “Measure SLAs” (3). The first path between an use case and components on which it depends is drawn. This procedure is repeated on all components in (Tab. 1).

Fig. 2: Dependency Graph Construction from Dependency Table.

Fig. 2: Dependency Graph Construction from Dependency Table.

The result is the dependency graph shown below (Fig. 3).

Fig. 3: Dependency Graph of OpenStack Environment.

Fig. 3: Dependency Graph of OpenStack Environment.

Add weight factors to use cases

Once the dependency graph is constructed, we can calculate the “impact” of component outages. When a component fails, you can simply follow the arrows in the dependency graph to see which user interactions (use cases) stop to be available for end users. If e. g. the Ceilometer component fails, you would not be able to measure SLAs, meter usage of Telco services or monitor the VM infrastructure.

But it would not be a very sophisticated practice to say that each use case is equally important to the end user. Some user interactions like e. g. creation of new VM nodes need not be available all the time (or at least it depends on the OLAs of the Telco). Other actions like e. g. Telco authentication must be available all the time. Therefore, we have to add weight factors to use cases. This can be done by adding another column to the dependency table and name it “Weight factor”. The weight factor should be a score measuring the “importance” of an user interaction in terms of business need. In a productive OpenStack environment, financial values (which correspond to the business value of the user interaction) could be assigned as weight factors to each use case. For reasons of simplicity we take the ordinal values 1, 2 and 3 as weight factors (whereby 1 signifies the least important user transaction and 3 the most important user transaction). For each use case row in the dependency table we add the corresponding weight factor (Fig. 4).

Fig. 4: Assignment of weight factors.

Fig. 4: Assignment of weight factors.

As a next step, we create a pivot table containing the components and use cases as consecutive row fields and the weight factors as data field. In order to avoid duplicate counts (of use cases) we use the maximum function instead of the sum function. As a result we get the pivot table in (Tab. 2).

Tab. 2: Pivot Table of Component/Use Case dependencies.

Tab. 2: Pivot Table of Component/Use Case dependencies.

Calculate outage impacts

Calculation of system component outages is now quite straightforward. Just look at the pivot table and calculate the pivot sum of the weight factors of each component. As a result we have a table of failure impact sizes (Tab.3).

Tab. 3: OpenStack Components and Failure Impact Sizes.

Tab. 3: OpenStack Components and Failure Impact Sizes.

This table reveals which components are very important for the overall reliability of the OpenStack environment and which are not. It is an operationalization of the measurement of “failure impact” for a given IT environment (failure impacts can be measured as number). The advantage of this approach is that we can build a test framework for OpenStack availability based on the failure impact sizes.

Most obviously components whith strong support functionality like e. g. MySQL or the Keystone component have high failure impact sizes and should be strongly protected against outages. VM internal components seem to be not so important because VMs can be easily cloned and recovered in a cloud environment.

In a further article we will show how availability can be tested with the given failure impact size values on a given OpenStack architecture.

 

Dependability Modeling: Testing Availability from an End User’s Perspective

In a former article we spoke about testing High Availability in OpenStack with the Chaos Monkey. While the Chaos Monkey is a great tool to test what happens if some system components fail, it does not reveal anything about the general strengths and weaknesses of different system architectures.  In order to determine if an architecture with 2 redundant controller nodes and 2 compute nodes offers a higher availability level than an architecture with 3 compute nodes and only 1 controller node, a framework for testing different architectures is required. The “Dependability Modeling Framework” seems to be a great opportunity to evaluate different system architectures on their ability to achieve availability levels required by end users.

Overcome biased design decisions

The Dependability Modeling Framework is a hierarchical modeling framework for dependability evaluation of system architectures. Its purpose is to model different alternative architectural solutions for one IT system and then calculate the dependability characteristics of each different IT system realization. The calculated dependability values can help IT architects to rate system architectures before they are implemented and to choose the “best” approach from different possible alternatives. Design decisions which are based on Dependability Modeling Framework have the potential to be more reflective and less biased than purely intuitive design decisions, since no particular architectural design is preferred to others. The fit of a particular solution is tested versus previously defined criteria before any decision is taken.

Build models on different levels

The Dependability Models are built on four levels: the user level, the function level, the service level and the resource level. The levels reflect the method to first identify user interactions as well as system functions and services which are provided to users and then find resources which are contributing to accomplishment of the required functions. Once all user interactions, system functions, services and resources are identified, models are built (on each of the four levels) to assess the impact of component failures on the quality of the service delivered to end users. The models are connected in a dependency graph to show the different dependencies between user interactions, system functions, services and system resources. Once all dependencies are clear, the impact of a system resource outage to user functions can be calculated straightforward: if the failing resource was the only resource which delivered functions which were critical to the end user, the impact of the resource outage is very high. If there are redundant resources, services or functions, the impact is much less severe.
The dependency graph below demonstrates how end user interactions depend on functions, services and resources.
Dependability Graph

Fig. 1: Dependency Graph

The Dependability Model makes the impact of resource outages calculable. One could easily see that a Chaos Monkey test can verify such dependability graphs, since the Chaos Monkey effectively tests outage of system resources by randomly unplugging devices.  The less obvious part of the Dependability Modelling Framework is the calculation of resource outage probabilities. The probability of an outage could only be obtained by regularly measuring unavailability of resources over a long time frame. Since there is no such data available, one must estimate the probabilities and use this estimation as a parameter to calculate the dependability characteristics of resources so far. A sensitivity analysis can reveal if the proposed architecture offers a reliable and highly available solution.


Dependability Modeling on OpenStack HA Environment

Dependability Modeling could also be performed on the OpenStack HA Environment we use at ICCLab. It is obvious that we High Availability could be realized in many different ways: we could use e. g. a distributed DRBD device to store all data used in OpenStack and synchronize the DRBD device with Pacemaker. Another possible solution is to build Ceph clusters and again use Pacemaker as synchronization tool. An alternative to Pacemaker is keepalived which also offers synchronization and control mechanisms for Load Balancing and High Availability. And of course one could also think of using HAProxy for Load Balancing instead of Ceph or DRBD.
In short: different architectures can be modelled. How this is done will be subject of a further blog post.

Evaluation of HA technologies for OpenStack

As proposed in a former article different technologies must be evaluated in order to make the current MobileCloud environment suitable to High Availability (HA) requirements. The following article lists a basic evaluation of the different technologies that could be used.

Basically there are four technologies which allow to build a reliable HA-infrastructure for OpenStack:

  1. Build OpenStack on top of Corosync and use Pacemaker cluster resource manager to replicate cluster OpenStack services over multiple redundant nodes.
  2. For clustering of storage a DRBD block storage solution can be used. DRBD is a software that replicates block storage (hard disks etc.) over multiple nodes.
  3. Object storage services can be clustered via Ceph. Ceph is a clustered storage solution which is able to cluster not only block devices but also data objects and filesystems. Obviously Swift ObjectStore could be made highly available by using Ceph.
  4. OpenStack has MySQL as an underlying database system which is used to manage the different OpenStack Services. Instead of using a MySQL standalone database server one could use a MySQL Galera clustered database servers to make MySQL highly available too.

The different technologies have been evaluated according to their ability to make different OpenStack components highly available. The following table shows which technologies could be used to make the different OpenStack Services used in MobileCloud suitable to High Availability requirements.

table_ha_evaluation

Table 1.1: OpenStack Services and Clustering Technologies which make them suitable to HA requirements.

It is obvious that the different technologies can be used in different architectural setups. It is obvious that they must be used in a multi-node OpenStack Architecture. An architecture proposal will follow up in a further article.