Tag: monitoring

ElasTest Passes European Commission’s Review Successfully!

On July 18th in Brussels project partners presented ElasTest results and progress to a tribunal of three independent experts appointed by the European Commission and the EC Project Officer. The key project objective is to improve the efficiency of testing large-scale complex software systems. The ElasTest project is coordinated by URJC. ZHAW’s ICCLab is a key project partner delivering research and technology in the area of service delivery, monitoring and billing. 

The objective of this review was to evaluate the project progress and to show all technical evolution and of course check on the administrative coordination of the first 18 months. For assessing the project, the three reviewers analysed all the public and private information related to the project.

We had an 8 hours evaluation meeting and we were able to show the progress made in research, innovation, demos, exploitation plans, sustainability, and coordination issues of course were  also presented. The most challenging part was to show the demonstration of the software developed by the different project partners: a one-hour session in which all the software artifacts were successfully demonstrated, including the ZHAW work. All of these efforts were welcomed by the reviewers. Finally, after an initial deliberation, the reviewers communicated their decision to approve the project and congratulated the team on a successful review!

The project is now focused on the second phase: once the initial platform has been developed is integrated and its up-and-running, most of our efforts will aim to dedicate to research and create a community of users around ElasTest.

For more information on ElasTest checkout our site and code repositories.

A Tool for Understanding OpenStack Cloud Performance using Stacktach and the OpenStack Notification System

In one of our projects, FICORE the continuation of FIWARE, we need to offer an Openstack-based service. One aspect of service operations is to understand the performance of the system and one particular aspect of this is to understand how long basic operations take; it is interesting to see how this evolves over time as, for example, a system may get more and more loaded. To address this, we first looked at using an approach based on log files but it was not workable as the information regarding an operation is spread across multiple hosts and services. An alternative approach is to use the Openstack notification system where a lot of key events occurring within the system are published – this is a single point for all the information we need. We then used Stacktach to consume, filter and store this data and built a web application on top of it. In this blog post we give a brief overview of the Openstack notification system, the Stacktach filtering tool and the basic web tool we developed.
Continue reading

N_O_conf: nagios-based monitoring of Openstack made easy

noconf

“The autoscaling cloud monitoring system that requires no manual reconfiguration”

“Nagios OS autoconfigurator” (N_O_conf) is a cloud monitoring system that automatically adapts its monitoring behavior to the current user-initiated VM infrastructure. N_O_conf works by installing a cloud environment change listener daemon which is repeatedly polling the OpenStack API for changes in the VM infrastructure. As soon as a VM destruction is detected, it initiates a reconfiguration of the Nagios monitoring server. Nagios OS autoconfigurator can be installed on top of every OpenStack-based cloud environment without interfering with the cloud providers infrastructure, because it can be installed inside virtual machines so cloud consumers can use it as their own monitoring system. N_O_conf monitoring system monitors all VMs that are owned by the user that installed it.

Continue reading

Full-stack Monitoring for OpenStack

by Josef Spillner

Introduction

The Ceilometer project is a core OpenStack service that provide collection of metering data on managed virtual resources (e.g. compute, storage, networking). Before it was only possible to collect data from OpenStack virtual resources which are running upon an OpenStack deployment. The work presented hereafter addresses this issue outlines a means to monitor both virtual and physical resources in an integrated common approach. This work has been appearsd in the upcoming OpenStack release named “Icehouse”.

Continue reading

Ceilometer Performance issues

Update: This does not apply to Icehouse. This flag was to activate an experimental feature  -this option no longer exists in Icehouse. (It is in Havana, however).

There have been some criticisms of the implementation of Ceilometer (or Telemetry as of Icehouse) – however, it’s still the main show in town for understanding what’s going on inside your Openstack.

We’ve been doing a bit of work with it in multiple projects. In one of our efforts – pulling in energy info via kwapi – we noticed that Ceilometer really crawls to a halt with the API giving a response in 20s when trying to enter just a single energy consumption data point. (Yes, it might make more sense to batch these up…). For our simple scenario, this performance was completely unworkable.

Our Ceilometer installation just used the basic Mirantis Fuel v4.0 which installed a variant of Havana. The db backend was mysql (chosen by Fuel) and we just went with the default configuration parameters.

There are known performance issues with Ceilometer (issue, presentation mentioning it, mailing list discussion) and it seems that Icehouse has made some significant strides in improving performance of Ceilometer/Telemetry; however, we have not managed to perform the upgrade as yet – maybe some of these issues have already been fixed.

For our work, we were able to significantly improve the performance of the Ceilometer API by activating (experimental!) thread pooling on the db: this had the effect of making entering single energy consumption data points take less than one second (down from 20s) and a larger query of the list of available meters took 5s compared to a previous 34s. It just involved setting

use_tpool=true

in /etc/ceilometer/ceilometer.conf and bingo – significant uptick in performance (for our small, experimental system).

Not sure how widely applicable this is, and not sure if it’s realistic for production environments – for our experimental system, it turned an unworkable system into something which is usable (but certainly not speedy!)

 

Nagios / Ceilometer integration: new plugin available

The famous Nagios open source monitoring system has become a de facto standard in recent years. Unlike commercial monitoring solutions Nagios does not come as a one-size-fits-all monitoring system with thousands of monitoring agents and monitoring functions. Nagios is rather a small, lightweight monitoring system reduced to the bare essential of monitoring: an event management and notification engine. Nagios is very lightweight and flexible, but it must be extended in order to become a solution which is valuable for your organization. Plugins are a very important part in setting up a Nagios environment. Though Nagios is extremely customizable, there are no plugins that capture OpenStack specific metrics like number of floating IPs or network packets entering a virtual machine (even if there are some Nagios plugins to check that OpenStack services are up and running).

Ceilometer is the OpenStack component that captures these metrics. OpenStack measures typical performance indices like CPU utilization, Memory allocation, disk space used etc. for all VM instances within OpenStack. When an OpenStack environment has to be metered and monitored, Ceilometer is the right tool to do the job. Though Ceilometer is a quite powerful and flexible metering tool for OpenStack, it lacks capabilities to visualize the collected data.

It can easily be seen that Nagios and Ceilometer are complementary products which can be used in an integrated solution. There are no Nagios plugins to integrate the Ceilometer API (though Enovance has developed plugins to check that OpenStack components alive) with the Nagios monitoring environment and therefore allow Nagios to monitor not only the OpenStack components, but also all the hosted VMs and other services.

The ICCLab has developed a Nagios plugin which can be used to capture metrics through the Ceilometer API. The plugin is available download on Github. The Ceilometer call plugin can be used to capture a Ceilometer metric and define thresholds for employing the nagios alerting system.

In order to use the plugin simply copy it into your Nagios plugins folder (e. g. /usr/lib/nagios/plugins/) and define a Nagios command in your commands.cfg file (in /etc/nagios/objects/commands.cfg). Don’t forget to make your Nagios plugin executable to the Nagios API (chmod u+x).

A command to monitor the CPU utilization could look like this:

define command {
command_name    check_ceilometer-cpu-util
command_line    /usr/lib/nagios/plugins/ceilometer-call -s "cpu_util" -t 50.0 -T 80.0
}

Then you have to define a service that uses this command.

define service {
check_command check_ceilometer-cpu-util
host_name
normal_check_interval 1
service_description OpenStack instances CPU utilization
use generic-service
}

Now Nagios can employ Ceilometer API to monitor VMs inside OpenStack.

When do you need to scale up?

A big issue in cloud computing is knowing when you should upstart more VMs or switch to a more powerful virtual machine in order to process user requests efficiently. Monitoring system utilization is very important for detecting if VM utilization is too high to guarantee stable and high performing IT services. But how can one determine if upscaling of a VM-infrastructure is required? Part of the answer lies in trend detection algorithms. This article describes two of the most popular ones that can be applied to VM-infrastructures.

Autocorrelations and moving averages

If a series of measurements is correlated to the time of measurement, it is said that the series is “autocorrelated”. If you measure VM utilization several times you might discover that utilization will increase or decrease from time to time. A (linear) regression of measurement values will reveal growth trends. If such a trend appears, the average utilization is increasing, it is a “moving average”. The movement of the average causes the regression to produce errors, because regression models are computed on constant average values. Therefore one has to consider the errors produced by the moving average of measured values.

Moving average and autocorrelation can be combined in the “AutoRegressive Integrated Moving Average” (ARIMA) model. The ARIMA model has two advantages: on one side the autocorrelation function of a set of values is computed, on the other side the errors that are produced by performing this calculation are minimized. ARIMA integrates aspects of autocorrelation and moving average. Therefore it is a quite feasible model to predict trends.

When the ARIMA is applied to VM utilization one can predict (with a certain probability) that some threshold of utilization will be reached in the future. Defining acceptance criteria for probabilities of growth trends and  for reaching a threshold in the future is a major steps towards determine the “ideal” point in time when an upscaling of a VM-infrastructure is required.

Two things must be done:

  1. Define threshold values for VM utilization metrics that tell when a VM is overutilized. One could e. g. say that if mean CPU-utilization of the last 5 minutes is over 90%, the VM with that CPU is inacceptably overutilized and therefore such a value is athreshold for VM utilization.
  2. Define a threshold for ARIMA growth trends that result in VM overutilization (which is the threshold for VM utilization). For this purpose you have to measure values for VM utilization metrics and repeatedly calculate growth trends following the ARIMA model. If such a calculation results in reaching a threshold for VM utilization, an upscaling of VM utilization is required.

With threshold values for VM utilization metrics and ARIMA growth trends one can construct an event management system that catches problems of VM overutilization by repeatedly measuring  metrics and calculating growth trends.

The advantages of the ARIMA model are:

  • It gives an extrapolated estimation of the future growth trend and tries to assign a value to predicted VM utilization.
  • It takes the fact that average VM utilization changes over time into account by repeatedly calculating a moving average.

The drawbacks of the ARIMA model are:

  • The model states a prediction which appears to be “exact” to the statistically inexperienced viewer, but in fact there is only a probability that the future values will be most likely in the neighbourhood of the predicted values. Given any ARIMA prediction it is still possible that growth trends will discontinue in the future. Therefore predicted values can never be seen as “guaranteed” future values.

Control charts

Another model which can be used to predict upscaling utilizes Shewhart control charts. These charts are used in business process management for controlling process quality based on statistical measurements. The idea behind control charts is the following: we have to take n repeated samples of i measurements and then calculate the range and the average of the each sample. The ranges are then put as data points in an “R-chart” and the averages are filled in an “X-chart”. Then we calculate the average μ and the standard deviation σ of all n data points in the R- and the X-chart. Then we do the following: we define some upper and lower bound for the data points which are considered as “natural” process limits and check if there are data points lying above or below these “control limits”. The upper and lower control limit (UCL and LCL) are proportional to the standard error which is σ divided by the square root of n. As a rule of thumb the UCL is defined as the average of all data points plus two times the standard error, while the LCL is the average minus two times the standard error. By calculating the UCL and LCL for the X- and R-chart, we can check if there are data points below or above the UCL.

Control charts assume that if all data points lie within the UCL and LCL limits, the process will most likely continue as it is. It is said then that the process is “in control”. The interesting thing about control charts is that data points which lie outside the UCL or LCL can be used as indicators of process changes. If multiple points lie above the UCL, a growth trend can be indicated.

When control charts are applied to VM utilization one must first define the sample size i and the number of data points n. Let us say that we want to measure average CPU utilization of the last 5 minutes. One could e. g. measure CPU utilization at 20 random points (i=20) in the time interval between 0 and 5 minutes.  Then one can calculate the average of the sample as well as the range which is the difference between the maximum and minimum of the 20 values. As a result we get one data point for the X-chart and one for the R-chart. Then one should take n samples to populate the X- and R-charts. If we chose n=5, we can then compute the standard deviation, standard error and average of all samples. This values can be used to define the UCL and LCL for the process. As a next step we must define a decision criterion for when do we say that a process will result in a growth or decline trend. We could e. g. say that if 2 or more points lie above the UCL, a growth trend will occur in the future.

The upscaling is necessary, when either a process contains 2 or more data points above the UCL and the average is near some critical threshold (where the low performance VM reaches its maximum capacity) or when a process is in control but the UCL lies above the critical threshold. In both cases an upscaling is necessary, either because the next data points will probably lie above the threshold as a result of some growth trend or because the future data points can reach the threshold even when the process continues as it is.

Control charts are a quite simple means to predict process changes. They have the following advantages:

  • Control charts are a relatively reliable indicator for future growth trends and can therefore indicate possibly critical growth trends very early.
  • They do not bias viewers towards giving “exact” predictions for future VM utilization values.

Despite these advantages, control charts also have the following drawbacks:

  • They need a lot of parameter estimations (e. g. choice of n or i). If those parameters are not chosen well, control charts lead to many “false alarms” that indicate overutilization when there is none.
  • Control charts can predict growth trends, but they do not tell anything about the strength of the growth trend. Therefore they tend to either overestimate small process changes or underestimate large changes. They are insensitive to trend sizes.

Both models, the ARIMA and the control charts have some advantages and some drawbacks. Like many tools they are just as good as the person that uses them. Often it is advisable to test both tools first and then decide which instrument should be used for VM utilization prediction. But predicting future growth trends is still more an art than a craft. Therefore it can not be decided which method is “better”, but it is clear that both of them are better than do nothing about VM performance measurements.

 

High Availability on OpenStack

Motivation for OpenStack High Availability

ICCLab’s MobileCloud Networking solution is supposed to offer private cloud services to end users. MobileCloud is based on OpenStack. Since our OpenStack installation is supposed to be used mainly by end users, it is necessary to provide High Availability.

As mobile end users we all know that we want our IT services to be available everytime and everywhere – 24 hours per day, 7 days per week, 365 days per year. End users normally don’t reflect that this requirement is challenge for system architects, developers and engineers who offer the IT services. Cloud components must be kept under regular maintenance to remain stable and secure. While performing maintenance changes, engineers have to shut down components. At the same time the service should still remain available for the end user. Achieving High Availability in a cloud environment is a very complex and challenging task.

Requirements for OpenStack High Availability

For delivering High Availability on an OpenStack environment there are different requirements:

  • Availability of a cloud service is the result of the availability of all its participating components. An app hosted in the cloud is only available if its supporting OS is available. The OS is only available if its underlying virtual or physical server is available. And everything breaks down if the network devices between service user and service provider break down. If one crucial component participating in the service fails, the whole service becomes unavailable. Therefore “High Availability on OpenStack” means High Availability on all components managed by OpenStack.
  • To maintain availability of service componenets, it is necessary to implement redundancy. If a crucial service component fails, a redundant component must take over its function to maintain availability of the service.
  • There’s a trade off between redundancy and costs: if you establish redundancy of MobileCloud service by doubling its components you double the overall availability of the service, but you also double the costs of the service.
  • 100%-Availability is an illusion since no service component can be available all the time. A better solution is to define availability levels or classes of availability for every component that define the possible idle time of service components. Availability classes have to be assigned to service components according to their importance to the total availability of the service.
  • High Availability is related to the concept of Event Management. An event is Service components must be able to react to events that could lead to outages in order to maintain their stability.
  • High Availability closely depends on monitoring tools. High Availability can only be implemented if outages and events which are harmful to availability of components can be monitored. The High Availability on OpenStack project depends on Monitoring on OpenStack project.
  • The High Availability solution for the OpenStack installation must contain the following parts: architecturial overview of all components (virtual and physical servers, network devices, operating systems, subsystems and software) that are crucial for service operation, assignment of availability levels for all those components, redundant components, a monitoring tool that captures events (traffic, load, outage etc.) and an event management systems that reacts to events.
  • Availability information of the monitored resources must be assignable to its tenant.
  • The metered values must be collected and correlated automatically.
  • The collection of values must be able to trigger events.
  • The event management system must be able to drive changes (e. g. switch traffic to a redundant device) in the service architecture and reconfigure components automatically.
  • Monitoring tool and event management system must be as generic as possible to ensure support of any device.
  • The monitoring tool and event management system must offer an API.

Architecture

OpenStack_HA

OpenStack High Availability Architecture

As-is state

Currently an extended version of the Ceilometer monitoring tool is used for the OpenStack environment of the ICCLab. An evaluation of possible Event Management functionality is currently performed. There is also an ongoing evaluation on solutions that implement redundancy in OpenStack.

ICCLab Present on Ceilometer at 2nd Swiss OpenStack User Group Meeting

On the 19th February the 2nd Swiss OpenStack User Group Meeting took place. One of the presentations was held on Ceilometer by Toni and Lucas from the ICCLab. They talked about the history, the current and future features, the architecture and the requirements of ceilometer and explained how to use and extend it. You can take a look at the presentation here:

A video of the presentation is available here

Monitoring and OpenStack

Motivation for OpenStack monitoring

Many people think it maybe an unnecessary burden to set up a monitoring system for their infrastructure. However this, when it comes to an OpenStack installation should be considered indispensable and required. Knowing which resources are used by which VMs (and tenants) is crucial for cloud computing providers as well for their customers from billing and usage perspectives.

Customers want to be sure they get what they pay for at any time whereas the cloud provider needs the information for his billing and rating system. Furthermore this information can be useful when it comes to dimension and scalability questions.

Requirements for OpenStack monitoring

For monitoring an OpenStack environment there are different requirements:

  • An OpenStack monitoring tool must be able to monitor not only physical machines but also virtual machines or network devices.
  • The information of the monitored resources must be assignable to its tenant.
  • The metered values must be collected and correlated automatically
  • The monitoring tool must be as generic as possible to ensure support of any device.
  • The monitoring tool must offer an API.

 

architecture monitoring tool

Architecture Monitoring Tool

 

As-is state

There exist a lot of tools for network and server monitoring like Nagios, Zabbix and Munin. Most of them do not easily support OpenStack monitoring.

Zenoss is one of the few monitoring tools that supports an integration for OpenStack. It is possible to download and install an extension for OpenStack monitoring (https://github.com/zenoss/ZenPacks.zenoss.OpenStack). Unfortunately the latest version of this extension does only support the OpenStack API Version 1.1. The Folsom release ships with an OpenStack API version 2.0. The extension allows Zenoss to collect data only from a single tenant. That is not good enough because we need some more data to do rating and billing.

Another promising monitoring tool will be included in the upcoming OpenStack release Grizzly (March 2013) and is known as Ceilometer.  It will be part of the OpenStack core. Ceilometer makes it easy to monitor VMs belonging to a tenant. What Ceilometer cannot offer at the moment is physical device monitoring.

After an evaluation we decided to extend Ceilometer to monitor the physical devices as well. With this extension Ceilometer will be able to monitor the whole OpenStack environment of the ICCLab and provide the data for further systems like the billing module.