On November 4th 2015 Konstantin Benz, researcher at ICCLab, presented an adaptive cloud application in the “Complex Adaptive Systems” conference in San Jose, California. “Complex Adaptive Systems” is a conference organized by Missouri University of Science and Technology (MST) which takes place every year and which includes topics like machine learning, data analytics and smart system architectures. Conference proceedings are published in the Procedia Computer Science journal by Elsevier.
The 5th Conference about “Complex Adaptive Systems” is dedicated to technologies that provide solutions to complex problems we face in everyday life. Complexity is everywhere. A complex system may be the traffic system in California which produces unforeseen traffic jams. Another complex system may be the power grid that delivers electric power to every household every day without any interruption. Or a complex system may be just the order of your favorite cereals that land in your bowl for breakfast. Complex systems are more than just systems which are a little bit complicated to observe. Continue reading →
“The autoscaling cloud monitoring system that requires no manual reconfiguration”
“Nagios OS autoconfigurator” (N_O_conf) is a cloud monitoring system that automatically adapts its monitoring behavior to the current user-initiated VM infrastructure. N_O_conf works by installing a cloud environment change listener daemon which is repeatedly polling the OpenStack API for changes in the VM infrastructure. As soon as a VM destruction is detected, it initiates a reconfiguration of the Nagios monitoring server. Nagios OS autoconfigurator can be installed on top of every OpenStack-based cloud environment without interfering with the cloud providers infrastructure, because it can be installed inside virtual machines so cloud consumers can use it as their own monitoring system. N_O_conf monitoring system monitors all VMs that are owned by the user that installed it.
“Measure and benchmark reliability of your OpenStack virtual machines.”
“VM Reliability Tester” is a software that tests performance and reliability of virtual machines that are hosted in an OpenStack cloud platform. It evaluates the failure rate of VMs by performing a stress test on them. VM Reliability Tester installs OpenStack virtual machines, uploads a test program to them, runs this test program remotely and then captures program execution times to determine reliability of the virtual machines. If the test program takes a significant amount of time to complete, this is considered to result in a VM failure. Such deviations in execution time are an important benchmark for testing performance and reliability of your OpenStack environment.
Why VM Reliability Tester?
Cloud computing (or to be more precise: virtualization) is providing virtual resources instead of physical ones. The performance of virtual resources is hidden from the user, because virtual resources are abstracting form the physical hardware layer. As a system administrator you still might want to know how your virtual machines react under heavy load and you want true performance measurements – instead of promises by your cloud vendor. Therefore it might be an advantage to test the reaction of virtual machines that you have created in your OpenStack cloud and measuring the VM performance before creating a productive infrastructure and deploy productive applications on them. VM Reliability Tester delivers you estimates on how your VM performs when it is running applications. With the data produced by VM Reliability Tester you will be able to:
Check if your VM is performing well enough to serve your performance requirements.
Benchmark VM images in terms of application performance.
Benchmark OpenStack platforms from different vendors.
Acquire data that helps you to shape SLAs and underpinning contracts.
How does it work?
VM Reliability Tester uses a “master” VM which serves to create test VMs and upload test programs to them. The master VM first configures the test VMs and then runs the uploaded test programs. Test program runs are repeated in a (configurable) batch of several program runs. The test programs executes for the configured number of times on the test VMs and logs execution time of each test program run. After a batch of test program runs has finished, the master VM captures the logged execution times and calculates the mean and standard deviation of execution times in the batch. If a test program run took longer than the batch mean plus 3 standard deviations, it is considered as a failure and logged by the master VM in a file called “f_rates.csv”.
Based on the numbers of batches and test program runs per batch as well as the number of failures, VM Reliability Tester computes a failure rate sample. This sample is then used to predict failure rate estimates in productive VM infrastructures.
Setup and Installation
Prerequisites for installation of VM Reliability Tester are:
You must have valid OpenStack authentication credentials and provide them in the setup file “openrc.py”.
You have to provide a private/public keypair for authentication with the VMs that you own. Local path to your public and private key file must be added to a “config.ini” and “remote_config.ini” file.
You must own a PC or labtop and have Python and some Python libraries installed on top of it.
Installation of the tool is done easily by cloning the Github repository and changing the contents of the files openrc.py, config.ini and remote_config.ini. Once you have cloned VM Reliability Tester repository and performed the configuration file changes, you must only run vm-reliability-tester.py. The script will create some csv files that contain failure rates of the VMs and the possible distributions of the failure rate.
VM Reliability Tester is available on the following Github-Page:
There are many different technologies which can increase availability of a cloud infrastructure. In our newest Techcouting paper we evaluate several HA technologies in order to define a HA architecture for an OpenStack deployment which is part of the XiFi project. HA technologies can be grouped in the following classes:
Resource monitors that check if IT-services are alive and (sometimes automatically) recover them in case of failure.
Load balancers that direct end user requests to those resources that are still alive and show reasonable prformance.
Distributed disks and file systems that increase redundancy of data and help to prevent data loss in case of failure.
Distributed databases which help to prevent loss of database records.
Every OpenStack component has the purpose to deliver a service to an end user. Availability of a cloud instance is dependent on the availability of the delivered end users services as perceived by end users. If we want to use a HA technology to increase availability of OpenStack we have to analyze dependencies of end user services on IT and infrastructure components. Therefore we created a dependability model of the provided IT services and the business services consumed by end users.
As availability always depends on the requirements that are defined by end users we asked several OpenStack end users in a survey on the importance of each business service. The result is that end users tended to rate “Infrastructure Management” and “Security Management” as the most important services. Therefore we had to ensure that these services have high availability levels.
By linking the importance of the service to the IT components that provide it, we can assign a target availability level to each component. Furthermore we can compare several HA architectures to each other and check the availability levels they can achieve. We built several fault tree diagrams that represent the link of component failures to service outages:
A simulation of service outages by given inputs of failure rates revealed that adding HA technologies to OpenStack can add up to 7-8 percent points to the average availability level of the provided services.
We tested several technologies that belong to one of the HA technology classes. Our evaluation included chances and risks associated with implementing the technology and technological maturity. We assigned each technology a chances, risks and maturity score.
The result of our evaluation is that we prefer to use keepalived, HAProxy, Ceph/RADOS and MySQL Galera as HA technologies to improve availability of our OpenStack installation. These technologies are all open-source. They have been preferred because their performance is not significantly lower than the performance of commercial products, but they are available for free, while commercial products are not. The final HA architecture is able to increase availability levels of all OpenStack services up to three nines – which is a very high availability level in cloud computing.
It is clear that another organization would come to other conclusions when the concrete implementation of a HA technology has to be selected, but the evaluation methodology used in our paper shows how to make more reasonable technology choice decisions by linking end user requirements with system architecture characteristics and rate several architectural alternatives by the availability levels that are reasonably achievable.
After having completed part 1 of our series about reliability analysis, we now start with our first reliability measurement experiment. According to reliabili theory there are three things we could measure: survival probability, hazard rate and failure rate. The last one is the easiest one in practice. Therefore we design an experiment to measure the failure rate of OpenStack VMs under heavy load.
Failure rates can be constant, ascending or declining over time. In order to measure the general tendency of a failure rate we have to perform a time series analysis. We start up several OpenStack VMs, put them under stress by running a certain task on them and then count how many of the VMs are still alive after a certain amount of time. The stress task is performed several times on the same VMs and the number of machines that are still alive is counted repeatedly in order to get a time series of failure rates.
Cloud computing service is comparable to public utility services like gas, telephone or water supply.
Economical value of cloud computing service is determined by reliability, availability and maintainability (RAM) characteristics.
Availability impacts the value of cloud computing as it is perceived by end users. High Availability systems increase guaranteed availability of a cloud computing service. Therefore they increase the economical value of a cloud computing service.
Cloud HA initiative has the objectives:
To provide a service to analyze problems related with reliability and availability of cloud computing systems
To provide systems and services that increase reliability and availability of cloud computing systems
The following challenges exist currently:
Measuring and analyzing availability: how can we experimentally determine reliability of cloud computing systems (VMs, storage etc.)? Design of adequate reliability measurement experiments is difficult, since we often have to rely on simulation of an outage.
Adapt reliability engineering methods to cloud computing: many reliability analysis and engineering techniques do exist (Fault Tree Analysis, FME(C)A, HAZOP, Markov Chains). How can we apply them to the area of cloud computing?
Analytic and monitoring systems: build systems that automatically monitor reliability of cloud resources and analyze problems.
Failure recovery and intelligent event management systems: build systems that intelligently detect and react to failures.
Currently there is almost no data available on reliability of different virtualization technologies like OpenStack or Docker.
Cloud vendors and manufacturers simply claim that their systems operate reliably without providing data to prove their claims. Think about an engineering company (like e. g. ABB or Siemens). Would they still be on the market if they were not able to tell their customers the exact hazard rates and MTBFs of their products? The IT industry is lagging behind other engineering industries. IT reliability engineering could be an interesting discipline that adds value to IT products and services.
Relevance to current and future markets
Existing High Availability solutions:
Pacemaker: resource monitor that automatically detects failures and recovers failed components. Highly configurable, but also heavyweight. System administrators notoriously complain about its bad configuration interface. A bad configuration can make the system 7-8 times slower than a good configuration.
Keepalived: lightweight resource monitor. Unclear if this tool is well supported by its community.
IBM Tivoli: extremely heavyweight resource monitor and configuration management tool.
HAProxy: light load balancer. Great for web applications, but only applicable to HTTP-based services.
DRBD: disk replication technology. Fast and lightweight. Suitable for small disk networks.
Ceph: distributed storage and file system. Highly decentralized and great scalability.
GlusterFS: distributed storage and file system. Better scalability, but sometimes problem with partition tolerance.
Galera: MySQL cluster. True multimaster solution.
MySQL NDB Cluster: maps MySQL to simple key,value store. Requires adaption of applications to database interface.
Nagios: great monitoring system. Extendability and many plugins available.
How reliable are your OpenStack VMs? How many outages do you expect to occur during 8 months of operation? Do your VMs crash regularily, randomly or do VM outages increase over time? These questions can only be answered if we perform a reliability analysis of the virtual machines that we manage. In this small guide we show you how to check reliability of VMs in your OpenStack environment. In part 1 of this 4 part series we explain the basic concepts of reliability engineering.
The vast field of reliability engineering has been used widely in various engineering disciplines like aircraft design, civil engineering, electricity management or product management. Though reliability engineering has proven to help in successfully building high quality engineering products, it has almost never been used in cloud computing so far. There might be some distrust among programmers in these scientifically proven reliability analysis methods, since they involve math and statistical exploration. But with a little introduction this is not a severe problem that we should worry about.
Reliability engineering simply deals with analyzing and measuring the outage behavior of engineered systems, trying out and testing system improvements that make the system more reliable, implementing system improvements and validating if the system improvements have reduced the occurence of outages or not. The first step is the analysis of outage behavior. How can outages be analyzed?
There are many tools available which can be used to monitor operation of the Opentack infrastructure, but as OpenStack user you might not be interested in monitoring OpenStack itself. Your primary interest should be the operation of the VMs that are hosted on OpenStack. Nagios OpenStack Installer is a tool for exactly that purpose: it uses a Nagios VM inside the OpenStack environment and configures it to monitor all VMs that you own.
Nagios OpenStack Installer configures your OpenStack monitoring environment remotely from your desktop PC or labtop. In order to use Nagios OpenStack Installer you need to fulfil the following prerequisites.
You must have an SSH Key for securely accessing the Nagios VM and the VMs you own and you must know the SSH credentials to access the VMs.
You must know your OpenStack user account (name and id), your OpenStack password, the OpenStack Keystone authentication URL and the OpenStack tenant (“project”) (name and id) you work with.
You must be able to create a VM that serves as Nagios VM and you must own a publicly available IP (“floating IP”) to make the Nagios dashboard accessible to the outside world.
Nagios OpenStack Installer is a Python tool and requires some Python packages. Make sure to install Python 2.7 on your desktop. Additionally you need the following packages:
pip: The package manager to install Python packages from the PyPI repository (Windows users should refer to the pip developer’s “get pip” manual to install pip, Cygwin users are recommended to follow these guidelines in atbrox blog).
fabric: This package is used to access OpenStack VMs via SSH and remotely execute tasks on the VMs.
python-keystoneclient: To access the OpenStack Keystone API and authenticate to your OpenStack environment.
python-novaclient: To manage VMs which are hosted on OpenStack.
cuisine: This is a configuration management tool and lightweight alternative to configuration managers like Puppet or Chef. cuisine is required to manage the packages and configuration files on the Nagios VM and the monitored VMs.
pickle: pickle is a object serialization tool that can store objects and their current state in a file dump. Object serilaization is used to get the list of VMs which should be monitored.
We recommend to use pip for installation of the required packages, since pip automatically installs package dependencies.
You must have Git downloaded and installed.
After having installed the prerequisites on your local PC or labtop, you can use Nagios OpenStack Installer by performing the following steps.
Create a new directory and clone the Nagios OpenStack Installer Github repository in it.git clone https://github.com/icclab/kobe6661-nagios-openstack-installer.git
Edit the credentials in install_autoconfig.py, remote.py, remote_server_config.py and vm_list_extractor.py to match your OpenStack and SSH credentials.
Run remote_server_config.py from Python console. This installs and configures Nagios server on your Nagios VM. After installation you should be able to access the Nagios Dashboard by pointing your webbrowser to “http://<your_nagios_public_ip>/nagios” and providing your Nagios login credentials.
Run vm_list_extractor.py from Python console. This will extract the list of VMs on OpenStack that should be monitored and save the list as pickle file dump on your computer.
Run install_autoconfig.py from Python console. This will upload the Python scripts required to automatically update the Nagios configuration in case of changes in the OpenStack VM environment (nagios_config_updater.py, config_transporter.py, config_generator.py, vm_list_extractor.py). Additionally it will run these Python scripts on the Nagios VM to let Nagios capture the VMs which should be monitored, install and run the required Nagios and NRPE plugins on these VMs and reconfigure and restart Nagios server to monitor these VMs remotely.
Now the Nagios environment is installed and you should be able to monitor your VMs. Nagios OpenStack Installer is available on ICCLab’s Github repository. Feel free to try it out and give feedback about future improvements.
ICCLab Cloud HA initiative Leader Konstantin Benz explains the OpenStack Nagios integration to the interested audience.
The icclab participated on the Nagios World Conference 2014 which took place Oct 13th-16th, 2014 in St. Paul, MN, USA. Icclab’s Cloud High Availability-initiative leader Konstantin Benz presented an approach on how to use Nagios Core to monitor utilization of OpenStack resources. The key point he mentioned was that Nagios has to be reconfigured elastically in order to monitor virtual machines in an OpenStack environment. Depending on implementation requirements, it can be useful to exploit configuration management tools like Puppet or Chef to automatically reconfigure the Nagios server as soon as new VMs are commissioned or decommissioned by cloud users. Another approach could be to exploit OpenStack’s Ceilometer component though an integration of Nagios with Ceilometer could lead to data duplication which can be problematic for some systems, said Benz. Besides the Nagios-Ceilometer plugin Benz was able to show how elastic Nagios reconfiguration could work with Python fabric and the Cuisine library. This approach seems to be a lightweight solution to monitor VM utilization in OpenStack with Nagios. Benz also discussed a similar approach which has been chosen in the XIFI-project. The eXtensible Infrastructures for Future Internet cloud project uses Nagios as main monitoring tool to monitor OpenStack instances and resources provided by OpenStack.
Nagios Founder Ethan Galstad presents Nagios Log Server to the audience.
A highlight of the Nagios conference was a demo presentation of Nagios Log Server which was announced by Nagios Founder Ethan Galstad. Nagios Log Server allows for scalable and fast querying of log files – fully replacing “ELK”-Stack (ElasticSearch, LogStash, Kibana) solutions. Nagios Log Server is available under a perpetual licence that costs $995. Compared to commercial solutions this is a very modest price. In contrast to ELK-Stack solutions, Nagios Log Server offers user authentication to protect sensitive data in logfiles to be viewable by unauthorized website visitors. Another advantage are customizable visual dashboards that show log file findings. Visualization makes the task of reporting incidents to higher management a lot easier and allows for better monitoring.
We are conducting research in order to find out which features of the XIFI platform are most important to end users. The results will be used in order to improve the platform. If you are an application developer interested in XIFI, please feel free to participate in the survey which can be found following this link: