Specification of data to be collected in Dependability Modeling

In part 3 of our article series “Dependability Modeling on OpenStack” we have discussed that we should run Chaos Monkey tests on an OpenStack HA installation and then collect data about the impact of the attack. While we did say that we want to collect data about the implemented OpenStack HA architecture, we were not specific about which data we should actually collect. This article gives some hints what is important when collecting data about HA system architectures.

What should be measured?

A very interesting question is what should be measured during a Chaos Monkey test run. The Dependability Modeling Framework is used to measure the capability of a system architecture to deliver “low” impacts of system outages. Therefore we should measure the impact of outages. The impact is a score which is derived from the Dependability graph. It should be measured as a result of a test run.

What is analysed in Dependability Modeling?

In Dependability Modeling we are interested in correlations between the system architecture and the outage impact. The system architecture data is mainly categorical data (replication technology used, clustering technology etc.) and the impact is a number. All variables that describe the system architecture are meant to be “explanatory” or “independent” variables, i. e. variables that can be chosen freely in the simulation, while the impact of outages is the “explained” (or “dependent”) variable, because the impact is assumed to be the result of the chosen architecture. In order to find significant correlations between system architecture properties and impact, we must collect values for all explanatory variables and then use a dimensionality reduction method to find which properties are interesting.

How much data should be collected?

First we must say that it is not a bad practice to collect “too much” data in a test or a scientific experiment. In classical statistics it is usually said that we should use small samples. The reason why this is said is because the science of classical statistics was developed in the 19th century – a time where measurements were expensive and statements on data sets had to be derived from small sample sets. Nowadays we can collect data automatically, therefore we are not forced to use small sample sets. We can simulate the whole life cycle of a cloud service, e. g. we could say that an OpenStack service will run for about 8 years which is 8 x 365 = 2’920 days and take one Chaos Monkey test for each day. The advantage of the automation is that we do not need to rely on samples.
Of course there is a limitation in terms of computational power: a Chaos Monkey test takes about 0.5-1.5 seconds. If we run 2920 Chaos Monkey tests, the whole simulation run can take up to > 4’300 seconds, which is more than 1 hour. Therefore you either run a simulation as an overnight batch job or you must choose to limit the simulation to a sample size which should adequately represent the overall population. To determine the optimal sample size you could use variance estimation. The sample size can be obtained using the statistical formula for calculation of sample sizes.

With that specification, we can proceed in developing our test framework. A further article will show a sample data set.

 

 

Dependability Modeling on OpenStack: Part 3

In this part of the Dependability Modeling article series we explain how a test framework on an OpenStack architecture can be established. The test procedure has 4 steps: in a first step, we implement the OpenStack environment following the planned system architecture. In the second step we calculate the probabilities of component outages during a given timeframe (e. g. 1 year). Then we start a Chaos Monkey script which “attacks” (randomly disables) the components of the system environment using the calculated probabilities as a base for the attack. As a last step we measure the impact of the Chaos Monkey attack according to the table of failure impact sizes we created in part 2. The impact of the attack should be stored as dataset in a database. Steps 1-4 form one test run. Multiple test runs can be performed on multiple architectures to create a empirical data which allows us to rate the different OpenStack architectures according to their availability.

 Step 1: Implement system architecture

Implementation of an OpenStack architecture can be achieved quite straightforward by using the Vagrant-Devstack installation. Each OpenStack node can be set up as Vagrant-Devstack system. First install Virtualbox, then install Vagrant and then install Vagrant-Devstack. Configure Devstack to support a Multi-node environment. As a next step you should create an SSH Tunnel between the different nodes using Vagrant. Once the different VM nodes are ready, you can start to test the architecture. (Fig.1) includes a typical OpenStack architecture for a single OpenStack node.

Fig. 1: Typical OS architecture for a single OpenStack node.

Fig. 1: Typical OS architecture for a single OpenStack node.

High availability is usually only possible in a multi-node environment, because redundant nodes are needed in case of node failures and consequent failovers. Therefore your architecture must be an architecture which is distributed or clustered over several redundant nodes. An example of such an architecture is shown in (Fig. 2). Once the architecture is defined, you have to implement it by using Vagrant, Puppet and Devstack.

Fig. 2: Sample 2-node architecture using DRBD, Corosync and Pacemaker.

Fig. 2: Sample 2-node architecture using DRBD, Corosync and Pacemaker.

Step 2: Calculate outage probability

Availability is usually measured during a given time period (e. g. one year). It is the fraction of uptime divided by total time. If we want to calculate the risk/probability of outages in the observed period, we must know at least two values: the total downtime of a component (which can be evaluated when the availability is known)  and the average recovery time. Both values are parameters which are needed to estimate the number of outages in the observed time period. In (Tab. 1) we have a list of all OpenStack components which are present in one node of the OpenStack installation. Availability is observed for a time period of one year (= 31’535’000 seconds). If we assign each component an availability value and an average recovery time, we can calculate the downtime and the number of outages per year. Because we are interested in the outage risk, we calculate the risk by dividing the number of total outages by the number of days per year. The calculated outage risks can be used now to simulate a typical operational day of the observed OpenStack system.

Tab. 1: Outage risk estimation of OpenStack components.

Tab. 1: Outage risk estimation of OpenStack components.

Step 3: Run Chaos Monkey attack

Although Chaos Monkey disables devices randomly, a realistic test assumes that outages do not occur completely randomly. A Chaos Monkey attack should be executed only with probability – not with certainty. Therefore we must create a script which disables the OpenStack services with probabilities we defined in (Tab. 1). Such a script could be written in Python – as shown in (Fig. 2). The most important part of the shutdown mechanism is that probabilities should be assignable to the services we want to disable. The probabilities will be taken from the values we have calculated in (Tab. 1). The other part should be that execution of Chaos Monkey attacks follows a random procedure. This can be achieved by using a simple random number generator which generates a number between 0 and 1. If the random number is smaller than the probability, the Chaos Monkey attack is execeuted (otherwise nothing is performed). This way we can simulate random occurence of outages as if it would be the case in a real OpenStack installation that runs in operational mode.

Fig. 3: Excerpt of a Python script which serves to shutdown OpenStack services.

Fig. 3: Excerpt of a Python script which serves to shutdown OpenStack services.

Step 4: Poll impact of failure

Once the Chaos Monkey attack has been performed, one has to check the impact size of the outage. Failure impact size equals the values in the table of failure impact sizes (Tab. 2). The table of failure impact sizes is derived from the execution of Dependability Modeling (as explained in article 2 of this series). The task at hand is now to poll which user interactions are still available after the Chaos Monkey attack. This can be done by performing the use cases which are affected by an outage of a component. The test tool must be a script which programmatically runs the use cases as tests. If a test fails, the failure impact size is raised according of the weight of the use case. The result of such a test run is a failure impact size after the Chaos Monkey attack.

Tab. 2: Failure impact sizes and use cases affected by component failure.

Tab. 2: Failure impact sizes and use cases affected by component failure.

Cleanup and re-run the test

Test results should be stored in a database. This database should contain failure impact sizes, assumed availabilities and average recovery times as well as information about the system architecture that has been used. When a test run has been completed, the results of the Chaos Monkey attacks have to be reverted in order to be able to re-run the test. With each test-run the database is filled up and one can be more certain about the test results.

Further test runs can be started either with the same architectural setup or with another one: instead of a one-node installation one could use a two-node OpenStack installation, one could use Ceph and Pacemaker as HA clustering software and try different technologies. If we perform steps 1-4 repeatedly, we can rate different OpenStack architectures according to their resistance against outages and find out which architecture fits best to High Availability goals.

If the test framework is applied to an OpenStack environment like e. g. Mobile Cloud Network, High Availability characteristics can be ensured more confidently. Dependability modeling is a useful recipe to test OpenStack architectures from an end users’ perspective. The capabilities of the explained method have not been explored in detail yet, but more will follow soon.

 

Dependability Modeling: Testing Availability from an End User’s Perspective

In a former article we spoke about testing High Availability in OpenStack with the Chaos Monkey. While the Chaos Monkey is a great tool to test what happens if some system components fail, it does not reveal anything about the general strengths and weaknesses of different system architectures.  In order to determine if an architecture with 2 redundant controller nodes and 2 compute nodes offers a higher availability level than an architecture with 3 compute nodes and only 1 controller node, a framework for testing different architectures is required. The “Dependability Modeling Framework” seems to be a great opportunity to evaluate different system architectures on their ability to achieve availability levels required by end users.

Overcome biased design decisions

The Dependability Modeling Framework is a hierarchical modeling framework for dependability evaluation of system architectures. Its purpose is to model different alternative architectural solutions for one IT system and then calculate the dependability characteristics of each different IT system realization. The calculated dependability values can help IT architects to rate system architectures before they are implemented and to choose the “best” approach from different possible alternatives. Design decisions which are based on Dependability Modeling Framework have the potential to be more reflective and less biased than purely intuitive design decisions, since no particular architectural design is preferred to others. The fit of a particular solution is tested versus previously defined criteria before any decision is taken.

Build models on different levels

The Dependability Models are built on four levels: the user level, the function level, the service level and the resource level. The levels reflect the method to first identify user interactions as well as system functions and services which are provided to users and then find resources which are contributing to accomplishment of the required functions. Once all user interactions, system functions, services and resources are identified, models are built (on each of the four levels) to assess the impact of component failures on the quality of the service delivered to end users. The models are connected in a dependency graph to show the different dependencies between user interactions, system functions, services and system resources. Once all dependencies are clear, the impact of a system resource outage to user functions can be calculated straightforward: if the failing resource was the only resource which delivered functions which were critical to the end user, the impact of the resource outage is very high. If there are redundant resources, services or functions, the impact is much less severe.
The dependency graph below demonstrates how end user interactions depend on functions, services and resources.
Dependability Graph

Fig. 1: Dependency Graph

The Dependability Model makes the impact of resource outages calculable. One could easily see that a Chaos Monkey test can verify such dependability graphs, since the Chaos Monkey effectively tests outage of system resources by randomly unplugging devices.  The less obvious part of the Dependability Modelling Framework is the calculation of resource outage probabilities. The probability of an outage could only be obtained by regularly measuring unavailability of resources over a long time frame. Since there is no such data available, one must estimate the probabilities and use this estimation as a parameter to calculate the dependability characteristics of resources so far. A sensitivity analysis can reveal if the proposed architecture offers a reliable and highly available solution.


Dependability Modeling on OpenStack HA Environment

Dependability Modeling could also be performed on the OpenStack HA Environment we use at ICCLab. It is obvious that we High Availability could be realized in many different ways: we could use e. g. a distributed DRBD device to store all data used in OpenStack and synchronize the DRBD device with Pacemaker. Another possible solution is to build Ceph clusters and again use Pacemaker as synchronization tool. An alternative to Pacemaker is keepalived which also offers synchronization and control mechanisms for Load Balancing and High Availability. And of course one could also think of using HAProxy for Load Balancing instead of Ceph or DRBD.
In short: different architectures can be modelled. How this is done will be subject of a further blog post.

The core components of any HA strategy

In his excellent article in Linux Technical Review #04 Jens-Christoph Brendel proposes a new way how to implement High Availability (HA) in current IT architectures. According to Bendel, modern IT architectures continually gain in complexity. This fact makes it difficult to guarantee availability on a certain level. Nevertheless High Availability is not merely a competitional advantage: for many companies keeping availability levels above 99,999 % per year is a matter of existence. Therefore a few systematic steps should help in planning and implementing high availability in your IT environment. This article shows a possible strategy on how to plan High Availability in the Mobile Cloud environment.

Redundancy vs. Complexity

According to Brendel, every HA-strategy starts with an evaluation of necessary degrees of availability each architecture component requires. Basically availability can be increased by adding redundant components (as mentioned in my former article). On the other hand, every new component makes the overall system more complex and increases the risk of component failures.  In short: there is always a trade off between avoiding system component outages and adding complexity (and possible points of failure) to the overall architecture by adding redundant components to an IT architecture. For the OpenStack environment this means one has to classify the different OpenStack components according to the availability an OpenStack user requires.

AEC-classification proposal for OpenStack

One possible classification for IT components is the AEC-classification developed by the Harvard Research Group. The AEC-classes reach from AEC-0 (non-critical systems, typically 90% availability) to AEC-5 (disaster-tolerant systems, 99.99999% or “Five-Nines” availability). OpenStack basically consists in the following components: Nova (including Nova-Compute, Nova-Volume and Nova-Network), Horizon, Swift (ObjectStore), Glance, Cinder, Quantum and Keystone. A typical OpenStack end user has to deal with these components in order to be able to handle his cloud installation. One has to think about the targeted availability levels of these components in order to know more about the overall stability of the OpenStack cloud environment. Some components need not be AEC-5, but for others AEC-5 is a must. The following table is a proposal of AEC-classes for each of the OpenStack components.

table_aec

Of course the real availability architecture of a productive OpenStack implementation also depends on how many OpenStack nodes are used and on the underlying virtual and even physical infrastructure, but this proposal serves as a good starting point to think about adequate levels of availability in productive OpenStack architectures. How do we secure critical components like Nova or Keystone against failures? Any OpenStack HA strategy must focus on this question first.

Risk Management and the “Chaos Monkey”

The next steps towards developing an OpenStack HA strategy are risk identification and risk management. It is obvious that the risk of a component failure depends on the underlying physical and virtual infrastructure of the current OpenStack implementation and also on the requirements of the end users, but to investigate risk probabilities and impacts, we must have a test on what happens to the OpenStack cloud if some components fail. One such test is the “Chaos Monkey” test developed by Netflix. A “Chaos Monkey” is a service which identifies groups of systems in an IT architecture environment and randomly terminates some of the systems. The random termination of some components serves as a test on what happens if some systems in a complex IT environment randomly fail. The risk of component failures in an OpenStack implementation could be tested by using such Chaos Monkey services. By running multiple tests on multiple OpenStack configurations one can easily learn if the current architecture is able to reach the required availability level or not.

Further toughts

Should OpenStack increase in terms of availability and redundancy? According to TechTarget, the OpenStack Grizzly release should become more scalable and reliable than former releases. A Chaos Monkey test could reveal if the decentralization of components like Keystone or Cinder can lead to enhanced availability levels.