Reliability Analysis of OpenStack VMs using Python, fabric and R – Part 2: Reliability Measurements

After having completed part 1 of our series about reliability analysis, we now start with our first reliability measurement experiment. According to reliabili theory there are three things we could measure: survival probability, hazard rate and failure rate. The last one is the easiest one in practice. Therefore we design an experiment to measure the failure rate of OpenStack VMs under heavy load.

Failure rates can be constant, ascending or declining over time. In order to measure the general tendency of a failure rate we have to perform a time series analysis. We start up several OpenStack VMs, put them under stress by running a certain task on them and then count how many of the VMs are still alive after a certain amount of time. The stress task is performed several times on the same VMs and the number of machines that are still alive is counted repeatedly in order to get a time series of failure rates.

Design of experiment

Like in every experiment we cannot simply measure real world behavior of VMs, since that would be practically not possible. Under normal conditions a server (which is used in a productive environment) should run for 5-8 years. Normally we don’t have time to run an experiment for such a long time frame. How can we then say something about the failure rate or the number of outages that a server is expected to have during that time frame? We simply replace the time frame parameter by a smart grouping of tested items. Instead of running one VM for 60 days (=2 months) to create an experiment that sufficiently reflects conditions in the real world, we could run 60 VMs for just one day. And instead of counting the days until the VM faces an outage, we simply count the number of VMs that are still alive after a certain amount of time.

The same shift from the time dimension to an increase of the number of tested items is done in other engineering disciplines. In light bulb testing the experimenters usually do not run one light bulb for 100’000 hours to evaluate its expected life time. They test 100 light bulbs and run them for 1’000 hours. Then they count how many light bulbs are still alive after 1’000 hours and deduce the expected life time of a single light bulb from the percentage of light bulbs that did not survive the 1’000 hours of the experiment.

The first part of our experiment is getting several VMs in OpenStack up and running. The bottleneck in OpenStack VM creation is usually the number of public IPs, since the world is running out of public IPv4 addresses and OpenStack admins usually severely restrict the number of floating IPs that are made available to OpenStack users. For the sake of experiment we will create VMs that don’t have a public IP. In order to coordinate the experiment, they must be reachable from a single OpenStack VM that can be accessed from the outside world. Thereby we can run multiple VMs without having to worry about the number of public IPs that we have to assign to them.

Creating OpenStack VMs

The best way to programmatically create and run OpenStack VMs is the Python OpenStack API. A manual on how to install the Python OpenStack API can be found here. We have prepared a Python script that can be downloaded from Github and that can be used for OpenStack VM creation. The essential piece of code for creating the VMs is the following:

vm_list = []
for i in range(5):
    vm_name = str('Test_VM%s' % i)    
    if not (VM_MANAGER.findall(name=vm_name)):
        vm = VM_MANAGER.create(name=vm_name,
                           image=image.id, 
                           flavor=flavor.id,
                           security_groups=[sec_group.human_id],
                           key_name=pk.name,
                           nics=nics,
                           availability_zone='nova')
    else:
        vm = VM_MANAGER.findall(name=vm_name)[0]
    while (vm.status != 'ACTIVE'):
        vm = VM_MANAGER.findall(name=vm_name)[0]
        if (vm.status == 'ERROR'):
            print("VM ID: %s name: %s CREATION FAILED!!" % (vm.id, vm.name))
        break
        print("VM ID: %s name: %s in status: %s" % (vm.id, vm.name, vm.status))
        time.sleep(1)
    print("VM ID: %s name: %s CREATION SUCCESSFUL." % (vm.id, vm.name))
    vm_list.append(vm)

The VM is created and the API waits until its status is ‘active’. Remember that you have to change the details of the ‘config.ini’ file to match to your OpenStack credentials.

Test program

Once we have created our test VMs, we are ready to upload our test program. The test program runs some tasks on the VMs to put them under stress. Thereby we simulate real world situations like e. g. many users accessing the same machine at the same time and creating heavy load on the VM. This could happen e. g. if you drive a shopping website and you sell some special “Black Friday Sale” offer. All shoppers go to your website, buying items, using resources of the VM that drives the website and as a result of resource usage, the VM becomes irresponsive and finally dies. We want to find out how probable such a situation is for OpenStack VMs and simulate such a situation. Therefore we run a Python program on each VM and let it create heavy load on that VM.

We will use a simple Python program that calculates many different Fibonacci numbers in parallel by spawning multiple parallel processes.

The program could be like the following:

import time, random, csv
from multiprocessing import Process, Queue, cpu_count, current_process
import logging

logger = logging.getLogger()  
logger.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(message)s')

ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
logger.addHandler(ch)

fibo_dict = {}
number_of_cpus = cpu_count()

data_queue = Queue()

def producer_task(q, fibo_dict):
    for i in range(15):
        value = random.randint(10000,50000)
        fibo_dict[value] = None
        logger.info("Producer [%s] putting value [%d] into queue... " % (current_process().name, value))
        q.put(value)

def consumer_task(q, fibo_dict):
    while not q.empty():
        value = q.get(True, 0.05)
        a,b = 0, 1
        for item in range(value):
            a, b = b, a + b
            fibo_dict[value] = a
        logger.info("Consumer [%s] getting value [%d] from queue... " % (current_process().name, value))

if __name__ == "__main__":  
    start = time.time()
    producer = Process(target=producer_task, args=(data_queue,fibo_dict))
    producer.start()
    producer.join()
    consumer_list = []
    for i in range(number_of_cpus):
        consumer = Process(target=consumer_task, args=(data_queue,fibo_dict)) 
        consumer.start()
        consumer_list.append(consumer)
    [consumer.join() for consumer in consumer_list]
    runtime = time.time() - start
    print("Runtime: %s" % runtime)
    data_file = open('/opt/response_time.csv', 'ab')
    data_writer = csv.writer(data_file,delimiter=';',quotechar='|')
    data_writer.writerow((runtime,))

 

Thereby we measure the time it takes to completely execute the task. The time value is stored on the VM where it is executed.

Test runner

The test program is executed from a remote VM using fabric. A fabric task to install and run the test program could look like this:

from fabric.api import env, execute, task, parallel, get
import cuisine

@task
def update(package=None):
    cuisine.package_update(package)
    
@task
def upgrade(package=None):
    cuisine.package_upgrade(package)

@task
def install(package):
    cuisine.package_install(package)
    cuisine.package_ensure(package)
    
@task
def pip_install(package):
    cuisine.package_ensure('python-pip')
    command = str('pip install %s' % package)
    cuisine.sudo(command)

@task
def upload_file(remote_location, local_location, sudo=False):
    cuisine.file_upload(remote_location, local_location, sudo=sudo)
    cuisine.file_ensure(remote_location)

@task
@parallel
def run_python_program(program=None, sudo=False):
    cuisine.file_ensure('/usr/bin/python')
    if sudo:
        cuisine.sudo(('/usr/bin/python %s' % program))
    else:
        cuisine.run(('/usr/bin/python %s' % program))

@task
def collect_response_times():
    get('/opt/response_time.csv','/home/ubuntu/response_time_'+env.host+'.csv')  

env.hosts = <VM_LIST>
env.user = <SSH_USERNAME>
env.password = <SSH_PASSWORD>
env.key_filename = <SSH_KEYFILE>

execute(upload_file, '/opt/testprogram.py', 
                               'test_program.py', sudo=True)
execute(run_python_program, program='/opt/testprogram.py', sudo=True)
execute(collect_response_times)
 

Note: Don’t forget to replace <VM_LIST> with the list of IPs of the VMs where you run the test program and <SSH_USERNAME>, <SSH_PASSWORD> and <SSH_KEYFILE> with the SSH credentials that allow you to create an SSH connection to the VMs.

This test runnr program runs the test program that calculates the Fibonacci numbers from a remote location. The test program on the VM calculates the numbers and stores them in a .csv-file. The .csv-files on the VMs are then collected by the test runner program running in the remote VM.

That way we generate a sample of execution times and store them on the remote VM. How can we turn this into reliability data? We simply threat particularly long execution times as “failures”. The threshold when an execution is considered to be failed must be settled after a first series of test runs. By repeatedly running the test runner program above we can gather reliability data about the VM.

In the next article of this series we will learn how to analyze reliability data collected with the programs mentioned in this article.

15 Comments

  1. Thank you for this great tutorial. It is very well written. Could you provide as with the next one please?

  2. benn

    23. March 2015 at 10:40

    Sure. It takes some time as I am writing a Python software that automates these things (and some scientific papers about it too… ;-)) Next article will be about statistical analysis of reliability data with R.

  3. Thanks for your valuable posting. It was very informative.Am working in a web design in chennai

  4. I would like to thank you for this article. But I have some Questions:
    1- Is it compulsory to have a public IP address for the remote VM ?
    2- How to find the ? I must mention that I am working with devstack.

    • How to find the VM_LIST : the IP addresses of the VMs where the test program is executed.

    • benn

      30. March 2015 at 11:20

      1- No. But if you want to execute a test program remotely on the VM, you must have remote access to it. A public IP assigned as floating IP to a VM is just the simplest way to achieve this. If you don’t want to consume public IPs, you could upload the test program to your VM, ssh into your VM and execute it locally inside your VM.
      2- The fabric library needs a list of IPs to create ssh connections to VMs. In this example code these IPs are the floating IPs associated to VMs. The list of floating IPs can be found in the OpenStack dashboard under the “security” tab. Alternatively you could use the command “nova floating-ip-list” from your shell, if you authenticate to OpenStack.

      • Thank you for your answer benn. So if I don’t have a public IP, I cannot execute the test program from a remote VM?

        • benn

          30. March 2015 at 12:59

          In this case (executing the test program from a remote VM) you can use the fixed IP of the VM where the test program runs. You could e. g. run the test runner in one VM and the test program in another one. In this case you don’t need a public IP. You can use the fixed IP of the VM where the test program runs. The fixed IP is normally visible in the table if you type the “nova list” command in a shell. If you can’t get the fixed IP, log into the VM and type “ifconfig | grep inet” and check the output. The fixed IP in OpenStack is normally something like “10.x.x.x”.

          • Thank you very much for your help. I still have other questions.

            1- In the config.ini, where to find the
            os.username , os.password , os.tenant
            os.auth_url = http://your_os_url:35357/v2.0
            2- Isn’t the os.vms_number equal to 5 because it is indicated in openstack VMs creation vm_list = []
            for i in range(5):
            3- In the test runner, where are saved the SSH_username, the SSH_password and the SSH_keyfile ?

            Please excuse me but I am asking all these questions because I am a beginner in both openstack and phython. This makes the task difficult for me.

  5. Thank you benn for this tutorial.
    I would like to ask you about the network topology; can you specify if the remote vm is on a public network and connected to the other networks over a private network. Am I right?
    Also, I think there is no longer tenant but rather project ?

    • benn

      14. April 2015 at 14:19

      Hello,

      You mean that part of the code?

      network = NETWORK_MANAGER.findall(label='public-net')[0]
      nics = [{'net-id': network.id}]

      The remote VM is on a public network (the VM has a “floating IP”) and it is connected to a private network (the “fixed IP” in OpenStack talk).
      In newer versions of OpenStack the tenant is called “project”. Some Python client API versions use the term “project”, others use the term “tenant”. Unfortunately there is no unique naming scheme in the Python OpenStack API which confuses everything a little bit.

  6. I would like to thank you for this tutorial but I don’t understand why we need to specify SSH password when we use public and private keys.

    • benn

      10. April 2015 at 15:52

      You don’t have to use password-based authentication. By “password” I mean the SSH password that has been used to generate the private/public keypair. This password is generally useless without knowing the private key. Of course it is optional to use a SSH password to secure your SSH key. It is only that an attacker has to get the content of your private key file and crack the SSH password too in order to succeed. So it is a little bit less convenient for the attacker.

      • Thanks. I think that SSH_username is the username of the image I am going to run on the instance. Isn’t it?

        • benn

          14. April 2015 at 14:08

          If you login via ssh into a machine, you normally type in the following line:

          $ ssh -i /path/to/private_key_file user@host.com

          If your private key file has no passphrase, you can directly log into the server. If your private key file has a passphrase, then you are prompted to enter a password to your private key and the the following is displayed in your shell:

          Enter passphrase for key '/path/to/private_key_file':

          Then you enter the password for your private key file and log in.

          Thereby you need:

          Username of the account from which you log in the server. In this example: user
          Address or name of the server to which you log in. In this example: host.com
          Full path to private key file. In this example: /path/to/private_key_file
          A password to your private key file.

          In the article the parameters are the following:

          env.hosts =
          env.user =
          env.password =
          env.key_filename =
          VM_LIST: corresponds to host.com in the above example (‘$ ssh -i /path/to/private_key_file user@host.com‘). The IP or URL of the server you want to log in.
          SSH_USERNAME: corresponds to user in the above example (‘$ ssh -i /path/to/private_key_file user@host.com’). The user name of the user on the server.
          SSH_PASSWORD: corresponds to the password you enter in the prompt above (‘Enter passphrase for key ‘/path/to/private_key_file’:’). The password for the private key file.
          SSH_KEYFILE: corresponds to user in the above example (‘$ ssh -i /path/to/private_key_file user@host.com‘). The path to your private key file.

          I hope this should clarify the most things here.

Leave a Reply

Your email address will not be published. Required fields are marked *