VM Reliability Tester for Openstack

vmrelt

“Measure and benchmark reliability of your OpenStack virtual machines.”

“VM Reliability Tester” is a software that tests performance and reliability of virtual machines that are hosted in an OpenStack cloud platform. It evaluates the failure rate of VMs by performing a stress test on them. VM Reliability Tester installs OpenStack virtual machines, uploads a test program to them, runs this test program remotely and then captures program execution times to determine reliability of the virtual machines. If the test program takes a significant amount of time to complete, this is considered to result in a VM failure. Such deviations in execution time are an important benchmark for testing performance and reliability of your OpenStack environment.

Why VM Reliability Tester?

Cloud computing (or to be more precise: virtualization) is providing virtual resources instead of physical ones. The performance of virtual resources is hidden from the user, because virtual resources are abstracting form the physical hardware layer. As a system administrator you still might want to know how your virtual machines react under heavy load and you want true performance measurements – instead of promises by your cloud vendor. Therefore it might be an advantage to test the reaction of virtual machines that you have created in your OpenStack cloud and measuring the VM performance before creating a productive infrastructure and deploy productive applications on them. VM Reliability Tester delivers you estimates on how your VM performs when it is running applications. With the data produced by VM Reliability Tester you will be able to:

  • Check if your VM is performing well enough to serve your performance requirements.
  • Benchmark VM images in terms of application performance.
  • Benchmark OpenStack platforms from different vendors.
  • Acquire data that helps you to shape SLAs and underpinning contracts.

How does it work?

VM Reliability Tester uses a “master” VM which serves to create test VMs and upload test programs to them. The master VM first configures the test VMs and then runs the uploaded test programs. Test program runs are repeated in a (configurable) batch of several program runs. The test programs executes for the configured number of times on the test VMs and logs execution time of each test program run. After a batch of test program runs has finished, the master VM captures the logged execution times and calculates the mean and standard deviation of execution times in the batch. If a test program run took longer than the batch mean plus 3 standard deviations, it is considered as a failure and logged by the master VM in a file called “f_rates.csv”.

Based on the numbers of batches and test program runs per batch as well as the number of failures, VM Reliability Tester computes a failure rate sample. This sample is then used to predict failure rate estimates in productive VM infrastructures.

Setup and Installation

Prerequisites for installation of VM Reliability Tester are:

  • You must have valid OpenStack authentication credentials and provide them in the setup file “openrc.py”.
  • You have to provide a private/public keypair for authentication with the VMs that you own. Local path to your public and private key file must be added to a “config.ini” and “remote_config.ini” file.
  • You must own a PC or labtop and have Python and some Python libraries installed on top of it.

Installation of the tool is done easily by cloning the Github repository and changing the contents of the files openrc.py, config.ini and remote_config.ini. Once you have cloned VM Reliability Tester repository and performed the configuration file changes, you must only run vm-reliability-tester.py. The script will create some csv files that contain failure rates of the VMs and the possible distributions of the failure rate.

Github page

VM Reliability Tester is available on the following Github-Page:

https://github.com/icclab/vm-reliability-tester

Reliability Analysis of OpenStack VMs using Python, fabric and R – Part 2: Reliability Measurements

After having completed part 1 of our series about reliability analysis, we now start with our first reliability measurement experiment. According to reliabili theory there are three things we could measure: survival probability, hazard rate and failure rate. The last one is the easiest one in practice. Therefore we design an experiment to measure the failure rate of OpenStack VMs under heavy load.

Failure rates can be constant, ascending or declining over time. In order to measure the general tendency of a failure rate we have to perform a time series analysis. We start up several OpenStack VMs, put them under stress by running a certain task on them and then count how many of the VMs are still alive after a certain amount of time. The stress task is performed several times on the same VMs and the number of machines that are still alive is counted repeatedly in order to get a time series of failure rates.

Continue reading

Dependability Modeling on OpenStack: Part 2

In the previous article we defined use cases for an OpenStack implementation according to the usage scenario in which the OpenStack environment is deployed. In this part of the Dependability Modeling article series we will show how these use cases relate to functions and services provided by the OpenStack environment and create a set of dependabilities between use cases, functions, services and system components. From this set we will draw the dependency graph and make the impact of component outages computable.

Construct dependency table

The dependency graph can be constructed if we define which functions, services and components allow provision of a use case. In the example below (Fig. 1) we defined the system architecture components, services and functions which allow to create, delete or update details of a Telco Account (account of mobile end user). Since these operations are provided within virtual machines, VM User Management and VM Security Management functions provide availability of this use case. Therefore we draw a column which contains these functions. Because these functions need a User Management, SSH & Password Management service in each VM in order to operate, we draw a second column which contains the required services. Another column is constructed which tells the system components required in order to deliver the required services.

Fig. 1: Dependency Graph Construction.

Fig. 1: Dependency Graph Construction.

The procedure mentioned above is repeated for all use cases. As a result you get a table like the one in (Tab. 1). This dependency table is the starting point for the production of the dependency graph.

Tab. 1: Dependencies between Use Cases, Services, Functions and Components.

Tab. 1: Dependencies between Use Cases, Services, Functions and Components.

Construct dependency graph

For each component that is listed in the table you have to model the corresponding services, functions and use cases. This is performed like in the example in (Fig. 2). We start from the right of the graph with the Ceilometer component and the VM plugin and look which services are provided by those components: it is e. g. the “Ceilometer Monitoring” service. Therefore we draw an icon that represents this service and draw arrows from the Ceilometer and VM plugin components to the service icon (1). In the next step we look which function is provided by the Ceilometer Monitoring service. This is the “Monitoring of VM” function. Therefore we paste an icon for the function and draw an arrow to this function (2). Then we look for the use cases provided by the Monitoring of VM function. Since this is e. g. “Measure SLAs”, we paste an icon for this use case and draw another arrow to “Measure SLAs” (3). The first path between an use case and components on which it depends is drawn. This procedure is repeated on all components in (Tab. 1).

Fig. 2: Dependency Graph Construction from Dependency Table.

Fig. 2: Dependency Graph Construction from Dependency Table.

The result is the dependency graph shown below (Fig. 3).

Fig. 3: Dependency Graph of OpenStack Environment.

Fig. 3: Dependency Graph of OpenStack Environment.

Add weight factors to use cases

Once the dependency graph is constructed, we can calculate the “impact” of component outages. When a component fails, you can simply follow the arrows in the dependency graph to see which user interactions (use cases) stop to be available for end users. If e. g. the Ceilometer component fails, you would not be able to measure SLAs, meter usage of Telco services or monitor the VM infrastructure.

But it would not be a very sophisticated practice to say that each use case is equally important to the end user. Some user interactions like e. g. creation of new VM nodes need not be available all the time (or at least it depends on the OLAs of the Telco). Other actions like e. g. Telco authentication must be available all the time. Therefore, we have to add weight factors to use cases. This can be done by adding another column to the dependency table and name it “Weight factor”. The weight factor should be a score measuring the “importance” of an user interaction in terms of business need. In a productive OpenStack environment, financial values (which correspond to the business value of the user interaction) could be assigned as weight factors to each use case. For reasons of simplicity we take the ordinal values 1, 2 and 3 as weight factors (whereby 1 signifies the least important user transaction and 3 the most important user transaction). For each use case row in the dependency table we add the corresponding weight factor (Fig. 4).

Fig. 4: Assignment of weight factors.

Fig. 4: Assignment of weight factors.

As a next step, we create a pivot table containing the components and use cases as consecutive row fields and the weight factors as data field. In order to avoid duplicate counts (of use cases) we use the maximum function instead of the sum function. As a result we get the pivot table in (Tab. 2).

Tab. 2: Pivot Table of Component/Use Case dependencies.

Tab. 2: Pivot Table of Component/Use Case dependencies.

Calculate outage impacts

Calculation of system component outages is now quite straightforward. Just look at the pivot table and calculate the pivot sum of the weight factors of each component. As a result we have a table of failure impact sizes (Tab.3).

Tab. 3: OpenStack Components and Failure Impact Sizes.

Tab. 3: OpenStack Components and Failure Impact Sizes.

This table reveals which components are very important for the overall reliability of the OpenStack environment and which are not. It is an operationalization of the measurement of “failure impact” for a given IT environment (failure impacts can be measured as number). The advantage of this approach is that we can build a test framework for OpenStack availability based on the failure impact sizes.

Most obviously components whith strong support functionality like e. g. MySQL or the Keystone component have high failure impact sizes and should be strongly protected against outages. VM internal components seem to be not so important because VMs can be easily cloned and recovered in a cloud environment.

In a further article we will show how availability can be tested with the given failure impact size values on a given OpenStack architecture.