VM Reliability Tester for Openstack

vmrelt

“Measure and benchmark reliability of your OpenStack virtual machines.”

“VM Reliability Tester” is a software that tests performance and reliability of virtual machines that are hosted in an OpenStack cloud platform. It evaluates the failure rate of VMs by performing a stress test on them. VM Reliability Tester installs OpenStack virtual machines, uploads a test program to them, runs this test program remotely and then captures program execution times to determine reliability of the virtual machines. If the test program takes a significant amount of time to complete, this is considered to result in a VM failure. Such deviations in execution time are an important benchmark for testing performance and reliability of your OpenStack environment.

Why VM Reliability Tester?

Cloud computing (or to be more precise: virtualization) is providing virtual resources instead of physical ones. The performance of virtual resources is hidden from the user, because virtual resources are abstracting form the physical hardware layer. As a system administrator you still might want to know how your virtual machines react under heavy load and you want true performance measurements – instead of promises by your cloud vendor. Therefore it might be an advantage to test the reaction of virtual machines that you have created in your OpenStack cloud and measuring the VM performance before creating a productive infrastructure and deploy productive applications on them. VM Reliability Tester delivers you estimates on how your VM performs when it is running applications. With the data produced by VM Reliability Tester you will be able to:

  • Check if your VM is performing well enough to serve your performance requirements.
  • Benchmark VM images in terms of application performance.
  • Benchmark OpenStack platforms from different vendors.
  • Acquire data that helps you to shape SLAs and underpinning contracts.

How does it work?

VM Reliability Tester uses a “master” VM which serves to create test VMs and upload test programs to them. The master VM first configures the test VMs and then runs the uploaded test programs. Test program runs are repeated in a (configurable) batch of several program runs. The test programs executes for the configured number of times on the test VMs and logs execution time of each test program run. After a batch of test program runs has finished, the master VM captures the logged execution times and calculates the mean and standard deviation of execution times in the batch. If a test program run took longer than the batch mean plus 3 standard deviations, it is considered as a failure and logged by the master VM in a file called “f_rates.csv”.

Based on the numbers of batches and test program runs per batch as well as the number of failures, VM Reliability Tester computes a failure rate sample. This sample is then used to predict failure rate estimates in productive VM infrastructures.

Setup and Installation

Prerequisites for installation of VM Reliability Tester are:

  • You must have valid OpenStack authentication credentials and provide them in the setup file “openrc.py”.
  • You have to provide a private/public keypair for authentication with the VMs that you own. Local path to your public and private key file must be added to a “config.ini” and “remote_config.ini” file.
  • You must own a PC or labtop and have Python and some Python libraries installed on top of it.

Installation of the tool is done easily by cloning the Github repository and changing the contents of the files openrc.py, config.ini and remote_config.ini. Once you have cloned VM Reliability Tester repository and performed the configuration file changes, you must only run vm-reliability-tester.py. The script will create some csv files that contain failure rates of the VMs and the possible distributions of the failure rate.

Github page

VM Reliability Tester is available on the following Github-Page:

https://github.com/icclab/vm-reliability-tester

Reliability Analysis of OpenStack VMs using Python, fabric and R – Part 2: Reliability Measurements

After having completed part 1 of our series about reliability analysis, we now start with our first reliability measurement experiment. According to reliabili theory there are three things we could measure: survival probability, hazard rate and failure rate. The last one is the easiest one in practice. Therefore we design an experiment to measure the failure rate of OpenStack VMs under heavy load.

Failure rates can be constant, ascending or declining over time. In order to measure the general tendency of a failure rate we have to perform a time series analysis. We start up several OpenStack VMs, put them under stress by running a certain task on them and then count how many of the VMs are still alive after a certain amount of time. The stress task is performed several times on the same VMs and the number of machines that are still alive is counted repeatedly in order to get a time series of failure rates.

Continue reading

Reliability Analysis of OpenStack VMs using Python, fabric and R – Part 1: Reliability Concepts

How reliable are your OpenStack VMs? How many outages do you expect to occur during 8 months of operation? Do your VMs crash regularily, randomly or do VM outages increase over time? These questions can only be answered if we perform a reliability analysis of the virtual machines that we manage. In this small guide we show you how to check reliability of VMs in your OpenStack environment. In part 1 of this 4 part series we explain the basic concepts of reliability engineering.

The vast field of reliability engineering has been used widely in various engineering disciplines like aircraft design, civil engineering, electricity management or product management. Though reliability engineering has proven to help in successfully building high quality engineering products, it has almost never been used in cloud computing so far. There might be some distrust among programmers in these scientifically proven reliability analysis methods, since they involve math and statistical exploration. But with a little introduction this is not a severe problem that we should worry about.

Reliability engineering simply deals with analyzing and measuring the outage behavior of engineered systems, trying out and testing system improvements that make the system more reliable, implementing system improvements and validating if the system improvements have reduced the occurence of outages or not. The first step is the analysis of outage behavior. How can outages be analyzed?

Continue reading