ElasTest Passes European Commission’s Review Successfully!

On July 18th in Brussels project partners presented ElasTest results and progress to a tribunal of three independent experts appointed by the European Commission and the EC Project Officer. The key project objective is to improve the efficiency of testing large-scale complex software systems. The ElasTest project is coordinated by URJC. ZHAW’s ICCLab is a key project partner delivering research and technology in the area of service delivery, monitoring and billing. 

The objective of this review was to evaluate the project progress and to show all technical evolution and of course check on the administrative coordination of the first 18 months. For assessing the project, the three reviewers analysed all the public and private information related to the project.

We had an 8 hours evaluation meeting and we were able to show the progress made in research, innovation, demos, exploitation plans, sustainability, and coordination issues of course were  also presented. The most challenging part was to show the demonstration of the software developed by the different project partners: a one-hour session in which all the software artifacts were successfully demonstrated, including the ZHAW work. All of these efforts were welcomed by the reviewers. Finally, after an initial deliberation, the reviewers communicated their decision to approve the project and congratulated the team on a successful review!

The project is now focused on the second phase: once the initial platform has been developed is integrated and its up-and-running, most of our efforts will aim to dedicate to research and create a community of users around ElasTest.

For more information on ElasTest checkout our site and code repositories.

ElasTest KickOff Meeting

The most limiting factor in development today is software validation, which typically requires very costly and complex testing processes. It will develop a novel test orchestration theory and toolbox enabling the creation of complex test suites as the composition of simple testing units. The ElasTest project wants to develop an elastic platform for testing complex distributed large software systems.

The ElasTest kickoff meeting took place during January in Madrid. ElasTest’s consortium comprises of 10 partners including IBM, ATOS, Technical University of Berlin and is coordinated by the University of Rey Juan Carlos.

Consortium:

  • Universidad Rey Juan Carlos (URJC)
  • Fraunhofer Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. (Fraunhofer)
  • Technische Universitaet Berlin (TUB)
  • Consiglio Nazionale Delle Ricerche (CNR)
  • Fundación Imdea Software (IMDEA)
  • Atos Spain S.A. (ATOS)
  • Zürcher Hochschule Für Angewandte Wissenschaften (ZHAW)
  • Tikal Technologies S.L. (NAEVATEC)
  • IBM Ireland Limited (IBM IRE)
  • Production Trade And Support of Machinable Products of Software and Informatics – Relational Technology A.E. (RELATIONAL)

For more information on the ElasTest project visit our ElasTest section!

Elastest on the EU Portal

Testing PyMongo applications with MockupDB

In one of our projects, we needed to test some mongo based backend functionality: we wrote a small application which comprised of a mongo backend and a python app which communicated with the backend via pymongo. We like the flexibility of mongo in a rapid prototyping context and did not want to go with a full fledged ORM model for this app. Here we describe how we used MockupDB to perform some unit testing on this app. Continue reading

VM Reliability Tester for Openstack

vmrelt

“Measure and benchmark reliability of your OpenStack virtual machines.”

“VM Reliability Tester” is a software that tests performance and reliability of virtual machines that are hosted in an OpenStack cloud platform. It evaluates the failure rate of VMs by performing a stress test on them. VM Reliability Tester installs OpenStack virtual machines, uploads a test program to them, runs this test program remotely and then captures program execution times to determine reliability of the virtual machines. If the test program takes a significant amount of time to complete, this is considered to result in a VM failure. Such deviations in execution time are an important benchmark for testing performance and reliability of your OpenStack environment.

Why VM Reliability Tester?

Cloud computing (or to be more precise: virtualization) is providing virtual resources instead of physical ones. The performance of virtual resources is hidden from the user, because virtual resources are abstracting form the physical hardware layer. As a system administrator you still might want to know how your virtual machines react under heavy load and you want true performance measurements – instead of promises by your cloud vendor. Therefore it might be an advantage to test the reaction of virtual machines that you have created in your OpenStack cloud and measuring the VM performance before creating a productive infrastructure and deploy productive applications on them. VM Reliability Tester delivers you estimates on how your VM performs when it is running applications. With the data produced by VM Reliability Tester you will be able to:

  • Check if your VM is performing well enough to serve your performance requirements.
  • Benchmark VM images in terms of application performance.
  • Benchmark OpenStack platforms from different vendors.
  • Acquire data that helps you to shape SLAs and underpinning contracts.

How does it work?

VM Reliability Tester uses a “master” VM which serves to create test VMs and upload test programs to them. The master VM first configures the test VMs and then runs the uploaded test programs. Test program runs are repeated in a (configurable) batch of several program runs. The test programs executes for the configured number of times on the test VMs and logs execution time of each test program run. After a batch of test program runs has finished, the master VM captures the logged execution times and calculates the mean and standard deviation of execution times in the batch. If a test program run took longer than the batch mean plus 3 standard deviations, it is considered as a failure and logged by the master VM in a file called “f_rates.csv”.

Based on the numbers of batches and test program runs per batch as well as the number of failures, VM Reliability Tester computes a failure rate sample. This sample is then used to predict failure rate estimates in productive VM infrastructures.

Setup and Installation

Prerequisites for installation of VM Reliability Tester are:

  • You must have valid OpenStack authentication credentials and provide them in the setup file “openrc.py”.
  • You have to provide a private/public keypair for authentication with the VMs that you own. Local path to your public and private key file must be added to a “config.ini” and “remote_config.ini” file.
  • You must own a PC or labtop and have Python and some Python libraries installed on top of it.

Installation of the tool is done easily by cloning the Github repository and changing the contents of the files openrc.py, config.ini and remote_config.ini. Once you have cloned VM Reliability Tester repository and performed the configuration file changes, you must only run vm-reliability-tester.py. The script will create some csv files that contain failure rates of the VMs and the possible distributions of the failure rate.

Github page

VM Reliability Tester is available on the following Github-Page:

https://github.com/icclab/vm-reliability-tester

Specification of data to be collected in Dependability Modeling

In part 3 of our article series “Dependability Modeling on OpenStack” we have discussed that we should run Chaos Monkey tests on an OpenStack HA installation and then collect data about the impact of the attack. While we did say that we want to collect data about the implemented OpenStack HA architecture, we were not specific about which data we should actually collect. This article gives some hints what is important when collecting data about HA system architectures.

What should be measured?

A very interesting question is what should be measured during a Chaos Monkey test run. The Dependability Modeling Framework is used to measure the capability of a system architecture to deliver “low” impacts of system outages. Therefore we should measure the impact of outages. The impact is a score which is derived from the Dependability graph. It should be measured as a result of a test run.

What is analysed in Dependability Modeling?

In Dependability Modeling we are interested in correlations between the system architecture and the outage impact. The system architecture data is mainly categorical data (replication technology used, clustering technology etc.) and the impact is a number. All variables that describe the system architecture are meant to be “explanatory” or “independent” variables, i. e. variables that can be chosen freely in the simulation, while the impact of outages is the “explained” (or “dependent”) variable, because the impact is assumed to be the result of the chosen architecture. In order to find significant correlations between system architecture properties and impact, we must collect values for all explanatory variables and then use a dimensionality reduction method to find which properties are interesting.

How much data should be collected?

First we must say that it is not a bad practice to collect “too much” data in a test or a scientific experiment. In classical statistics it is usually said that we should use small samples. The reason why this is said is because the science of classical statistics was developed in the 19th century – a time where measurements were expensive and statements on data sets had to be derived from small sample sets. Nowadays we can collect data automatically, therefore we are not forced to use small sample sets. We can simulate the whole life cycle of a cloud service, e. g. we could say that an OpenStack service will run for about 8 years which is 8 x 365 = 2’920 days and take one Chaos Monkey test for each day. The advantage of the automation is that we do not need to rely on samples.
Of course there is a limitation in terms of computational power: a Chaos Monkey test takes about 0.5-1.5 seconds. If we run 2920 Chaos Monkey tests, the whole simulation run can take up to > 4’300 seconds, which is more than 1 hour. Therefore you either run a simulation as an overnight batch job or you must choose to limit the simulation to a sample size which should adequately represent the overall population. To determine the optimal sample size you could use variance estimation. The sample size can be obtained using the statistical formula for calculation of sample sizes.

With that specification, we can proceed in developing our test framework. A further article will show a sample data set.

 

 

Dependability Modeling on OpenStack: Part 3

In this part of the Dependability Modeling article series we explain how a test framework on an OpenStack architecture can be established. The test procedure has 4 steps: in a first step, we implement the OpenStack environment following the planned system architecture. In the second step we calculate the probabilities of component outages during a given timeframe (e. g. 1 year). Then we start a Chaos Monkey script which “attacks” (randomly disables) the components of the system environment using the calculated probabilities as a base for the attack. As a last step we measure the impact of the Chaos Monkey attack according to the table of failure impact sizes we created in part 2. The impact of the attack should be stored as dataset in a database. Steps 1-4 form one test run. Multiple test runs can be performed on multiple architectures to create a empirical data which allows us to rate the different OpenStack architectures according to their availability.

 Step 1: Implement system architecture

Implementation of an OpenStack architecture can be achieved quite straightforward by using the Vagrant-Devstack installation. Each OpenStack node can be set up as Vagrant-Devstack system. First install Virtualbox, then install Vagrant and then install Vagrant-Devstack. Configure Devstack to support a Multi-node environment. As a next step you should create an SSH Tunnel between the different nodes using Vagrant. Once the different VM nodes are ready, you can start to test the architecture. (Fig.1) includes a typical OpenStack architecture for a single OpenStack node.

Fig. 1: Typical OS architecture for a single OpenStack node.

Fig. 1: Typical OS architecture for a single OpenStack node.

High availability is usually only possible in a multi-node environment, because redundant nodes are needed in case of node failures and consequent failovers. Therefore your architecture must be an architecture which is distributed or clustered over several redundant nodes. An example of such an architecture is shown in (Fig. 2). Once the architecture is defined, you have to implement it by using Vagrant, Puppet and Devstack.

Fig. 2: Sample 2-node architecture using DRBD, Corosync and Pacemaker.

Fig. 2: Sample 2-node architecture using DRBD, Corosync and Pacemaker.

Step 2: Calculate outage probability

Availability is usually measured during a given time period (e. g. one year). It is the fraction of uptime divided by total time. If we want to calculate the risk/probability of outages in the observed period, we must know at least two values: the total downtime of a component (which can be evaluated when the availability is known)  and the average recovery time. Both values are parameters which are needed to estimate the number of outages in the observed time period. In (Tab. 1) we have a list of all OpenStack components which are present in one node of the OpenStack installation. Availability is observed for a time period of one year (= 31’535’000 seconds). If we assign each component an availability value and an average recovery time, we can calculate the downtime and the number of outages per year. Because we are interested in the outage risk, we calculate the risk by dividing the number of total outages by the number of days per year. The calculated outage risks can be used now to simulate a typical operational day of the observed OpenStack system.

Tab. 1: Outage risk estimation of OpenStack components.

Tab. 1: Outage risk estimation of OpenStack components.

Step 3: Run Chaos Monkey attack

Although Chaos Monkey disables devices randomly, a realistic test assumes that outages do not occur completely randomly. A Chaos Monkey attack should be executed only with probability – not with certainty. Therefore we must create a script which disables the OpenStack services with probabilities we defined in (Tab. 1). Such a script could be written in Python – as shown in (Fig. 2). The most important part of the shutdown mechanism is that probabilities should be assignable to the services we want to disable. The probabilities will be taken from the values we have calculated in (Tab. 1). The other part should be that execution of Chaos Monkey attacks follows a random procedure. This can be achieved by using a simple random number generator which generates a number between 0 and 1. If the random number is smaller than the probability, the Chaos Monkey attack is execeuted (otherwise nothing is performed). This way we can simulate random occurence of outages as if it would be the case in a real OpenStack installation that runs in operational mode.

Fig. 3: Excerpt of a Python script which serves to shutdown OpenStack services.

Fig. 3: Excerpt of a Python script which serves to shutdown OpenStack services.

Step 4: Poll impact of failure

Once the Chaos Monkey attack has been performed, one has to check the impact size of the outage. Failure impact size equals the values in the table of failure impact sizes (Tab. 2). The table of failure impact sizes is derived from the execution of Dependability Modeling (as explained in article 2 of this series). The task at hand is now to poll which user interactions are still available after the Chaos Monkey attack. This can be done by performing the use cases which are affected by an outage of a component. The test tool must be a script which programmatically runs the use cases as tests. If a test fails, the failure impact size is raised according of the weight of the use case. The result of such a test run is a failure impact size after the Chaos Monkey attack.

Tab. 2: Failure impact sizes and use cases affected by component failure.

Tab. 2: Failure impact sizes and use cases affected by component failure.

Cleanup and re-run the test

Test results should be stored in a database. This database should contain failure impact sizes, assumed availabilities and average recovery times as well as information about the system architecture that has been used. When a test run has been completed, the results of the Chaos Monkey attacks have to be reverted in order to be able to re-run the test. With each test-run the database is filled up and one can be more certain about the test results.

Further test runs can be started either with the same architectural setup or with another one: instead of a one-node installation one could use a two-node OpenStack installation, one could use Ceph and Pacemaker as HA clustering software and try different technologies. If we perform steps 1-4 repeatedly, we can rate different OpenStack architectures according to their resistance against outages and find out which architecture fits best to High Availability goals.

If the test framework is applied to an OpenStack environment like e. g. Mobile Cloud Network, High Availability characteristics can be ensured more confidently. Dependability modeling is a useful recipe to test OpenStack architectures from an end users’ perspective. The capabilities of the explained method have not been explored in detail yet, but more will follow soon.

 

DRBD-Test environment for Vagrant available

There is always room to test different HA technologies in a simulated VM environment. At ICCLab we have created such a DRBD test environment for PostgreSQL databases. This environment is now available on Github.

The test environment installation uses Vagrant as tool to install VMs, Virtualbox as VM runtime environment and Puppet as VM configurator. It includes a Vagrant installation script (usually called a “Vagrantfile”) which sets up two virtual machines which run a clustered highly available PostgreSQL database.

In order to use the environment, you have to download it and then run the Vagrant installation script. The Vagrant installation script of the test environment essentially does the following things:

  • It creates two virtual machines with 1 GB RAM, one 80 GB harddrive and an extra 5 GB harddrive (which is used as DRBD device).
  • It creates an SSH tunnel between the two VM nodes which is used for DRBD synchronization.
  • It installs, configures and runs the DRBD device on both machines.
  • It installs, configures and runs Corosync and Pacemaker on both machines.
  • It creates a distributed PostgreSQL  database which runs on the DRBD device and which is managed by the Corosync/Pacemaker software.

This environment can easily be installed and then be used for testing of the DRBD technology. It can be downloaded from the following Github repository:

https://github.com/kobe6661/dependability_test_fw.git

Installation instructions can be found here.

Dependability Modeling on OpenStack: Part 2

In the previous article we defined use cases for an OpenStack implementation according to the usage scenario in which the OpenStack environment is deployed. In this part of the Dependability Modeling article series we will show how these use cases relate to functions and services provided by the OpenStack environment and create a set of dependabilities between use cases, functions, services and system components. From this set we will draw the dependency graph and make the impact of component outages computable.

Construct dependency table

The dependency graph can be constructed if we define which functions, services and components allow provision of a use case. In the example below (Fig. 1) we defined the system architecture components, services and functions which allow to create, delete or update details of a Telco Account (account of mobile end user). Since these operations are provided within virtual machines, VM User Management and VM Security Management functions provide availability of this use case. Therefore we draw a column which contains these functions. Because these functions need a User Management, SSH & Password Management service in each VM in order to operate, we draw a second column which contains the required services. Another column is constructed which tells the system components required in order to deliver the required services.

Fig. 1: Dependency Graph Construction.

Fig. 1: Dependency Graph Construction.

The procedure mentioned above is repeated for all use cases. As a result you get a table like the one in (Tab. 1). This dependency table is the starting point for the production of the dependency graph.

Tab. 1: Dependencies between Use Cases, Services, Functions and Components.

Tab. 1: Dependencies between Use Cases, Services, Functions and Components.

Construct dependency graph

For each component that is listed in the table you have to model the corresponding services, functions and use cases. This is performed like in the example in (Fig. 2). We start from the right of the graph with the Ceilometer component and the VM plugin and look which services are provided by those components: it is e. g. the “Ceilometer Monitoring” service. Therefore we draw an icon that represents this service and draw arrows from the Ceilometer and VM plugin components to the service icon (1). In the next step we look which function is provided by the Ceilometer Monitoring service. This is the “Monitoring of VM” function. Therefore we paste an icon for the function and draw an arrow to this function (2). Then we look for the use cases provided by the Monitoring of VM function. Since this is e. g. “Measure SLAs”, we paste an icon for this use case and draw another arrow to “Measure SLAs” (3). The first path between an use case and components on which it depends is drawn. This procedure is repeated on all components in (Tab. 1).

Fig. 2: Dependency Graph Construction from Dependency Table.

Fig. 2: Dependency Graph Construction from Dependency Table.

The result is the dependency graph shown below (Fig. 3).

Fig. 3: Dependency Graph of OpenStack Environment.

Fig. 3: Dependency Graph of OpenStack Environment.

Add weight factors to use cases

Once the dependency graph is constructed, we can calculate the “impact” of component outages. When a component fails, you can simply follow the arrows in the dependency graph to see which user interactions (use cases) stop to be available for end users. If e. g. the Ceilometer component fails, you would not be able to measure SLAs, meter usage of Telco services or monitor the VM infrastructure.

But it would not be a very sophisticated practice to say that each use case is equally important to the end user. Some user interactions like e. g. creation of new VM nodes need not be available all the time (or at least it depends on the OLAs of the Telco). Other actions like e. g. Telco authentication must be available all the time. Therefore, we have to add weight factors to use cases. This can be done by adding another column to the dependency table and name it “Weight factor”. The weight factor should be a score measuring the “importance” of an user interaction in terms of business need. In a productive OpenStack environment, financial values (which correspond to the business value of the user interaction) could be assigned as weight factors to each use case. For reasons of simplicity we take the ordinal values 1, 2 and 3 as weight factors (whereby 1 signifies the least important user transaction and 3 the most important user transaction). For each use case row in the dependency table we add the corresponding weight factor (Fig. 4).

Fig. 4: Assignment of weight factors.

Fig. 4: Assignment of weight factors.

As a next step, we create a pivot table containing the components and use cases as consecutive row fields and the weight factors as data field. In order to avoid duplicate counts (of use cases) we use the maximum function instead of the sum function. As a result we get the pivot table in (Tab. 2).

Tab. 2: Pivot Table of Component/Use Case dependencies.

Tab. 2: Pivot Table of Component/Use Case dependencies.

Calculate outage impacts

Calculation of system component outages is now quite straightforward. Just look at the pivot table and calculate the pivot sum of the weight factors of each component. As a result we have a table of failure impact sizes (Tab.3).

Tab. 3: OpenStack Components and Failure Impact Sizes.

Tab. 3: OpenStack Components and Failure Impact Sizes.

This table reveals which components are very important for the overall reliability of the OpenStack environment and which are not. It is an operationalization of the measurement of “failure impact” for a given IT environment (failure impacts can be measured as number). The advantage of this approach is that we can build a test framework for OpenStack availability based on the failure impact sizes.

Most obviously components whith strong support functionality like e. g. MySQL or the Keystone component have high failure impact sizes and should be strongly protected against outages. VM internal components seem to be not so important because VMs can be easily cloned and recovered in a cloud environment.

In a further article we will show how availability can be tested with the given failure impact size values on a given OpenStack architecture.

 

How to Test your OpenStack Deployment?

Like us in the ICCLab, you have likely spent lots of time researching the best means to deploy OpenStack and you’ve decided upon a particular method (at the ICCLab we use foreman and puppet). You’ve implemented OpenStack with your chosen deployment plan and technologies and you now have an operational OpenStack cluster. The question you now have to ask is:

“How do I test that all functionality is operating correctly?”

You could certainly take the time to write a suite of tests using the various OpenStack python clients and maintain those. However there is an OpenStack project already available that can save you a lot of time. OpenStack Tempest is a project and suite that comprises of a set of integration tests. Tempest is used to validate the OpenStack code base through it’s integration with Jenkins (continuous integration server). Tempests calls against OpenStack service API endpoints and uses the python unittest2 and nosetest frameworks at its core.

If you wish to experiment with Tempest locally, try it out with devstack. Devstack automatically configures Tempest for use with it. To ease things, simply use vagrant-devstack (README here) do the following:

  1. Install VirtualBox
  2. Install vagrant
  3. git clone https://github.com/dizz/vagrant-devstack.git
  4. vagrant up
  5. vagrant ssh
  6. cd /opt/stack/tempest
  7. ./run_tests.sh

You will now see quite an amount of tests being run against your devstack installation. It will take time! If you wish to integrate Tempest with your Jenkins CI server see information on devstack gate. There is also a Tempest Jenkins plugin. Finally, if you wish to run Tempest against a “real” installation of OpenStack you will need to configure the Tempest configuration file (etc/tempest.conf) and change the relevant information (more here).