Specification of data to be collected in Dependability Modeling

In part 3 of our article series “Dependability Modeling on OpenStack” we have discussed that we should run Chaos Monkey tests on an OpenStack HA installation and then collect data about the impact of the attack. While we did say that we want to collect data about the implemented OpenStack HA architecture, we were not specific about which data we should actually collect. This article gives some hints what is important when collecting data about HA system architectures.

What should be measured?

A very interesting question is what should be measured during a Chaos Monkey test run. The Dependability Modeling Framework is used to measure the capability of a system architecture to deliver “low” impacts of system outages. Therefore we should measure the impact of outages. The impact is a score which is derived from the Dependability graph. It should be measured as a result of a test run.

What is analysed in Dependability Modeling?

In Dependability Modeling we are interested in correlations between the system architecture and the outage impact. The system architecture data is mainly categorical data (replication technology used, clustering technology etc.) and the impact is a number. All variables that describe the system architecture are meant to be “explanatory” or “independent” variables, i. e. variables that can be chosen freely in the simulation, while the impact of outages is the “explained” (or “dependent”) variable, because the impact is assumed to be the result of the chosen architecture. In order to find significant correlations between system architecture properties and impact, we must collect values for all explanatory variables and then use a dimensionality reduction method to find which properties are interesting.

How much data should be collected?

First we must say that it is not a bad practice to collect “too much” data in a test or a scientific experiment. In classical statistics it is usually said that we should use small samples. The reason why this is said is because the science of classical statistics was developed in the 19th century – a time where measurements were expensive and statements on data sets had to be derived from small sample sets. Nowadays we can collect data automatically, therefore we are not forced to use small sample sets. We can simulate the whole life cycle of a cloud service, e. g. we could say that an OpenStack service will run for about 8 years which is 8 x 365 = 2’920 days and take one Chaos Monkey test for each day. The advantage of the automation is that we do not need to rely on samples.
Of course there is a limitation in terms of computational power: a Chaos Monkey test takes about 0.5-1.5 seconds. If we run 2920 Chaos Monkey tests, the whole simulation run can take up to > 4’300 seconds, which is more than 1 hour. Therefore you either run a simulation as an overnight batch job or you must choose to limit the simulation to a sample size which should adequately represent the overall population. To determine the optimal sample size you could use variance estimation. The sample size can be obtained using the statistical formula for calculation of sample sizes.

With that specification, we can proceed in developing our test framework. A further article will show a sample data set.



Future trends and technologies in Mobile and Internet communications @ CFIC 2013

Future trends of Mobile and Internet Communications are revealed at the Conference on Future Internet Communications 2013 in Coimbra, Portugal. The many different speeches and talks show that Cloud Computing could play a major role in future Mobile Communication networks.

Alexander Sayenko explains future trends in 3GPP standardization.

Alexander Sayenko explains future trends in 3GPP standardization.

The first keynote speech of the Conference was held by Alexander Sayenko, researcher at Nokia Siemens Networks, where he is responsible for standardization activities of the 3GPP-specification. In his keynote speech he presented the new HetNet multicarrier solution for the enhancement of current mobile communication traffic. While mobile communication traffic is expected to grow exponentially for the next decade, very diverse requirements concerning reliability and mobility of IT services offer a major challenge to the telecommunication industry. In order to be able to handle the growing mobile traffic, widening the available radio spectrum breadth and enhancing spectral efficiency as well as offloading of communication data to clusters of mobile base stations should enhance capacity of current mobile networks. HetNet offers a solution to enhance the radio spectrum and use the radio spectrum more efficiently by meshing up multiple heterogenous access networks. The future trend in mobile communication is going towards managing heterogeneous network infrastructures since the new standards like LTE and HSPA+ are still not used broadly and will not likely replace older technologies as fast as expected by the mobile end users. While the number of mobile devices and applications grows rapidly, changes in the infrastructure of mobile communication providers are performed much more slowly. New standards in mobile communications are a necessity in order to avoid a situation where the network infrastructure becomes a bottleneck to the mobile communication market.

Bottleneck: low efficiency of current access networks

The message is clear: mobile networks should be used more efficiently. An efficiency gain could be provided by the use of Cloud Computing for mobile networks. Andreas Kassler, researcher at Karlstads University in Sweden, showed CloudMAC – a new approach on how to allow location-independent routing of mobile devices in Wireless networks without introducing additional routing protocol overhead like e. g. in the Mobile IP protocol. The solution is to source the routing logic from Wireless Termination Endpoints into a virtualized infrastructure like e. g. an OpenStack cloud. Such an approach shows that Cloud Computing could become very important for the development of more efficient mobile networks. Therefore projects like e. g. the Mobile Cloud Network at ICCLab can make mobile communication ready for the challenges of the next decade.

ICCLab: enhance Quality of Cloud Services

The ICCLab had also a chance to present the benefits of Cloud Services for future Internet communications. Konstantin Benz, researcher at ICCLab, presented different technologies for OpenStack which should enable High Availability. He also showed how the Chaos Monkey tool could be transformed in a test framework which can add HA readiness of OpenStack architectures. The ongoing research about Cloud Automation, Cloud Dependability, Cloud Interoperability, Cloud Monitoring and Cloud Performance at ICCLab improves the overall quality of Cloud Computing as a service. Therefore Cloud Computing offered by ICCLab is able

Rating, Charging, Billing


Financial accounting is a very critical process in the monetization process of any service. In the telecommunication world, these processes have long been documented, used, and standardized. Cloud computing being a relatively new paradigm, is still undergoing a transition phase. Many new services are being defined and there is still a huge untapped potential to be exploited.

Rating, Charging, and Billing (RCB) are key activities that allows a service provider to fix monetary values for the resources and services it offers, and allows it to bill the customers consuming the services offered.

Problem Statement

Given a general service scenario, how can the key metrics be identified. The identification of measurable metrics is essential for determining a useful pricing function to be attached to the metric. The challenges we are trying to address under this initiative are multi-dimensional. Is it possible to come up with a general enough RCB model that can address the needs of multiple cloud services – IaaS, PaaS, SaaS, and many more that would be defined in the future?

Where is the correct boundary between real-time charging strategy, which could be very resource intensive, versus a periodic strategy which has the risk of over-utilization of resources by the consumers between two cycles? Can a viable middle-path strategy be established for cloud based services. Can pre-paid pricing model be adapted for the cloud?

Simplified workflow



User Data Recordshttps://github.com/icclab/cyclops-udr
Rating & Charginghttps://github.com/icclab/cyclops-rc


  • rule engine and pricing strategies
  • prediction engine and alarming
  • revenue sharing and SLAs
  • usage collectors
  • scalability


  • vBrownBag Talk, OpenStack Summit, Paris, 2014

  • Swiss Open Cloud Day, Bern, 2014

  • CYCLOPS Demo


  • OpenStack Meetup, Winterthur, 2014

Articles and Info

Research publications

Technology transfer

Research Approach

Following the ICCLab research approach



  • icclab-rcb-cyclops[at]dornbirn[dot]zhaw[dot]ch


ICCLab Colloquium: Byte-Code

Many thanks to Davide Panelli and Raffaele Cigni (Solutions Architects and co-founders) from Byte-Code for their visit and talk about a scalable e-commerce platform.

Byte-Code is a SME based in Milan, Italy, providing IT consulting services to customers around the world. The company has a strong focus on Open Source solutions and strategic partnerships with leaders of different markets.

The presentation (slides) introduces the company as well as the challenges of bringing Cloud Computing into the Enterprise world. Davide then went on to introduce a novel e-commerce platform capable of scaling dinamically thanks to features offered by technologies such as MongoDB and Amazon Web Services.

About Davide & Raffaele

Davide Panelli is a Solutions Architect and Scrum Master at Byte-Code. He’s responsible for creating Enterprise Architecture based on open source products.

Raffaele Cigni is a Solutions Architect and Groovy Specialist at Byte-Code. He worked for many years on mission critical J2EE/JEE projects constantly researching for better technologies and methodologies to develop enterprise class software, and for those reasons he started working with Groovy and Grails. Recently he’s focused on developing data processing systems based on DSLs created with Groovy.



Solidna is a project that is funded by the Commission for Technology and Innovation. Solidna will develop a core strategic cloud-based storage product and service area for a major Infrastructure as a Service provider (CloudSigma). The three key innovations that will be developed in Solidna are:

1. Upgraded Compute Storage Performance: this will focus on stability and dependability, guaranteeing a minimum performance level for critical systems. Here we will focus on stability and dependability, guaranteeing a minimum performance level for critical systems. Customers directly affect each other’s performance on many IaaS platforms today. This is the most critical problem for a public cloud provider to solve. By having a virtual drive in a public cloud and storing that drive across hundreds of physical drives, that performance limitation can be reduced significantly.

Solidna will develop the means to deliver a cloud storage solution with a high level of stability and dependability and to guarantee a minimum performance level for critical systems. This innovation will be delivered through the following technical innovations of:

  • Mechanisms to guarantee a minimum expected performance
  • Reliable clients that ensure the data is read/written consistently
  • Definition of specific performance critical system metrics and reporting of those metrics
  • Optimisation of the system based on system metrics (e.g. variable block sizing based on data stored)
  • Data segmentation optimisations including block-size optimisation and distributed striping
  • On-demand performance guarantees can grow as requested by the user

2. Advanced Storage Management Functionality will be another focus in Solidna. This will focus on enabling a number of abilities including creation of live snapshots, the backup of virtual drives and to geo-replicate a drive to one or more additional locations. The same rich feature-set as a high end commercial SAN product will result from this project but using standard low-cost, commodity hardware and a new upgraded software storage system. These new features will form the basis of new revenue streams. Key features include:

  • The ability to create live snapshots and backups of virtual drives. This ability allows to backup data from drives and keep these as separate copies and is important for the data resilience and security reasons,
  • The capability to geo-replicate a drive to one or more additional locations.

This innovation will be delivered through the follwing technical innovations:

  • System agents to watch and discover failed or potential failing system nodes
  • Mechanisms and algorithms for deregistration, recreation and associated redistribution and rebalancing of the storage nodes
  • Active reliability automated testing of the cloud storage service
  • Logically centralised control centre for the entire system
  • Storage system with the ability to rebalance the storage nodes
  • Expansion the ICCLab framework to accommodate the DFS
  • Functionality of policy-defined geo-replication
  • Functionality of volume migration

3. Object-based Storage Environment: Massive capacity cloud storage and multi-modal API access to a reliable storage. The scalability of the storage offered to customers is limited to the maximum drive size of 2TB per drive. Although a server can mount multiple drives to form a larger storage volume, the practical maximum size per server could be estimated at around 20-30 TB. Even with multiple drives this becomes difficult to manage. As well as the usual API interface allowing access to virtual drives, the proposed work aims to expose directories of files in the object storage as network mount points to the compute cloud. In effect this gives customers two access points to their storage, based on usage needs, a network drive API and an object storage API interface. This innovation will be delivered through the following technical innovations:

  • Accessing stored data using POSIX and HTTP from within the VM with implementation of file system drivers and HTTP API
  • Review of existing storage APIs and recommendation

ICCLab will host the 5th European Future Internet Summit in 2014

The 4th European Future Internet Summit is coming up and features a talk by Thomas M. Bohnert on Cloud Computing and the Future Internet. This is meanwhile the third invited talk by our lab to this event series.47617_170796792959021_6005876_n

We are therefore particulary proud to have the honor to host the 5th European Future Internet Summit the 5th and 6th June 2014 in Winterthur.

ICCLab Presents OCCI @ Future Internet Assembly

The ICCLab presented on the latest developments (PDF) in the Open Cloud Computing Interface at the Future Internet Assembly in Dublin. The session was organised by Cloud4SOA and the main theme was MultiCloud. In this regard, OCCI figures in many projects striving from this including EGI FedCloud, CompatibleOne and BonFire. In the presentation some future points of work that will be carried out in Mobile Cloud Networking, which took the audience’s interest.

OpenStack Grizzly installation for the lazy

As kindof advertisement for the new OpenStack Grizzly release we have created an automated single-node OpenStack Grizzly installation which uses Vagrant and Puppet. The automated installation can be downloaded from Github using the following URL: https://github.com/kobe6661/vagrant_grizzly_install.git

Please feel free to install it on your machine and test the new release.

Dependability Modeling on OpenStack: Part 3

In this part of the Dependability Modeling article series we explain how a test framework on an OpenStack architecture can be established. The test procedure has 4 steps: in a first step, we implement the OpenStack environment following the planned system architecture. In the second step we calculate the probabilities of component outages during a given timeframe (e. g. 1 year). Then we start a Chaos Monkey script which “attacks” (randomly disables) the components of the system environment using the calculated probabilities as a base for the attack. As a last step we measure the impact of the Chaos Monkey attack according to the table of failure impact sizes we created in part 2. The impact of the attack should be stored as dataset in a database. Steps 1-4 form one test run. Multiple test runs can be performed on multiple architectures to create a empirical data which allows us to rate the different OpenStack architectures according to their availability.

 Step 1: Implement system architecture

Implementation of an OpenStack architecture can be achieved quite straightforward by using the Vagrant-Devstack installation. Each OpenStack node can be set up as Vagrant-Devstack system. First install Virtualbox, then install Vagrant and then install Vagrant-Devstack. Configure Devstack to support a Multi-node environment. As a next step you should create an SSH Tunnel between the different nodes using Vagrant. Once the different VM nodes are ready, you can start to test the architecture. (Fig.1) includes a typical OpenStack architecture for a single OpenStack node.

Fig. 1: Typical OS architecture for a single OpenStack node.

Fig. 1: Typical OS architecture for a single OpenStack node.

High availability is usually only possible in a multi-node environment, because redundant nodes are needed in case of node failures and consequent failovers. Therefore your architecture must be an architecture which is distributed or clustered over several redundant nodes. An example of such an architecture is shown in (Fig. 2). Once the architecture is defined, you have to implement it by using Vagrant, Puppet and Devstack.

Fig. 2: Sample 2-node architecture using DRBD, Corosync and Pacemaker.

Fig. 2: Sample 2-node architecture using DRBD, Corosync and Pacemaker.

Step 2: Calculate outage probability

Availability is usually measured during a given time period (e. g. one year). It is the fraction of uptime divided by total time. If we want to calculate the risk/probability of outages in the observed period, we must know at least two values: the total downtime of a component (which can be evaluated when the availability is known)  and the average recovery time. Both values are parameters which are needed to estimate the number of outages in the observed time period. In (Tab. 1) we have a list of all OpenStack components which are present in one node of the OpenStack installation. Availability is observed for a time period of one year (= 31’535’000 seconds). If we assign each component an availability value and an average recovery time, we can calculate the downtime and the number of outages per year. Because we are interested in the outage risk, we calculate the risk by dividing the number of total outages by the number of days per year. The calculated outage risks can be used now to simulate a typical operational day of the observed OpenStack system.

Tab. 1: Outage risk estimation of OpenStack components.

Tab. 1: Outage risk estimation of OpenStack components.

Step 3: Run Chaos Monkey attack

Although Chaos Monkey disables devices randomly, a realistic test assumes that outages do not occur completely randomly. A Chaos Monkey attack should be executed only with probability – not with certainty. Therefore we must create a script which disables the OpenStack services with probabilities we defined in (Tab. 1). Such a script could be written in Python – as shown in (Fig. 2). The most important part of the shutdown mechanism is that probabilities should be assignable to the services we want to disable. The probabilities will be taken from the values we have calculated in (Tab. 1). The other part should be that execution of Chaos Monkey attacks follows a random procedure. This can be achieved by using a simple random number generator which generates a number between 0 and 1. If the random number is smaller than the probability, the Chaos Monkey attack is execeuted (otherwise nothing is performed). This way we can simulate random occurence of outages as if it would be the case in a real OpenStack installation that runs in operational mode.

Fig. 3: Excerpt of a Python script which serves to shutdown OpenStack services.

Fig. 3: Excerpt of a Python script which serves to shutdown OpenStack services.

Step 4: Poll impact of failure

Once the Chaos Monkey attack has been performed, one has to check the impact size of the outage. Failure impact size equals the values in the table of failure impact sizes (Tab. 2). The table of failure impact sizes is derived from the execution of Dependability Modeling (as explained in article 2 of this series). The task at hand is now to poll which user interactions are still available after the Chaos Monkey attack. This can be done by performing the use cases which are affected by an outage of a component. The test tool must be a script which programmatically runs the use cases as tests. If a test fails, the failure impact size is raised according of the weight of the use case. The result of such a test run is a failure impact size after the Chaos Monkey attack.

Tab. 2: Failure impact sizes and use cases affected by component failure.

Tab. 2: Failure impact sizes and use cases affected by component failure.

Cleanup and re-run the test

Test results should be stored in a database. This database should contain failure impact sizes, assumed availabilities and average recovery times as well as information about the system architecture that has been used. When a test run has been completed, the results of the Chaos Monkey attacks have to be reverted in order to be able to re-run the test. With each test-run the database is filled up and one can be more certain about the test results.

Further test runs can be started either with the same architectural setup or with another one: instead of a one-node installation one could use a two-node OpenStack installation, one could use Ceph and Pacemaker as HA clustering software and try different technologies. If we perform steps 1-4 repeatedly, we can rate different OpenStack architectures according to their resistance against outages and find out which architecture fits best to High Availability goals.

If the test framework is applied to an OpenStack environment like e. g. Mobile Cloud Network, High Availability characteristics can be ensured more confidently. Dependability modeling is a useful recipe to test OpenStack architectures from an end users’ perspective. The capabilities of the explained method have not been explored in detail yet, but more will follow soon.


DRBD-Test environment for Vagrant available

There is always room to test different HA technologies in a simulated VM environment. At ICCLab we have created such a DRBD test environment for PostgreSQL databases. This environment is now available on Github.

The test environment installation uses Vagrant as tool to install VMs, Virtualbox as VM runtime environment and Puppet as VM configurator. It includes a Vagrant installation script (usually called a “Vagrantfile”) which sets up two virtual machines which run a clustered highly available PostgreSQL database.

In order to use the environment, you have to download it and then run the Vagrant installation script. The Vagrant installation script of the test environment essentially does the following things:

  • It creates two virtual machines with 1 GB RAM, one 80 GB harddrive and an extra 5 GB harddrive (which is used as DRBD device).
  • It creates an SSH tunnel between the two VM nodes which is used for DRBD synchronization.
  • It installs, configures and runs the DRBD device on both machines.
  • It installs, configures and runs Corosync and Pacemaker on both machines.
  • It creates a distributed PostgreSQL  database which runs on the DRBD device and which is managed by the Corosync/Pacemaker software.

This environment can easily be installed and then be used for testing of the DRBD technology. It can be downloaded from the following Github repository:


Installation instructions can be found here.