Author: benn (page 2 of 4)

ICCLab @ Nagios World Conference 2014: Nagios Ceilometer Integration

The ICCLab will be present at the Nagios World Conference 2014, which takes place Oct. 13th-16th in St. Paul, MN, USA. Konstantin Benz will speak about the employment of Nagios as a tool to monitor OpenStack clouds. While Nagios is the de facto Open Source standard for monitoring IT systems, the OpenStack community uses Ceilometer to monitor VMs and other resources in the cloud. The main reason for that is the special requirements of rating, charging and billing in the cloud: VM usage must be stored persistently, even when a VM has been shut down and deleted by an end user. In standard monitoring contexts it is not necessary to store data of resources which are not present in the system anymore. Ceilometer does a quite good job in monitoring virtual resources. On the other hand a system administrator might not be interested only in monitoring the virtual resources which are provided by OpenStack: monitoring OpenStack itself is also a major task in administration of an OpenStack cloud. While Ceilometer is mainly used for monitoring resources provided by OpenStack, it does not monitor availability and performance of servers that deploy OpenStack services. Nagios is the industry standard for monitoring physical IT infrastructures. Therefore we will discuss the question on how to integrate Nagios with Ceilometer on the Nagios World Conference.

If you want to know more about the conference, follow this link:
Nagios World Conference

Benchmarking OpenStack by using Rally – part 1

As system administrators it is difficult to gather performance data before going productive. Benchmarking tools offer a comfortable way to gather performance data by simulating usage of a productive system. In the OpenStack world we can employ the Mirantis Rally tool to benchmark VM performance of our cloud environment.

Rally comes with some predefined benchmarking tasks like e. g. booting new VMs, upstarting VMs and running shell scripts on them, concurrently building new VMs and many more. The nice drawing below shows the performance of booting VMs in an OpenStack instance in a Shewhart Control Chart (often called “X-Chart” or “X-Bar-Chart”). As you can see it takes almost 7.2 seconds to upstart a VM on average and sometimes the upstarting process is outside the usual six sigma range. For a system administrator this could be quite useful data.

A X-Chart of VM boot performance in OpenStack.

A X-Chart of VM boot performance in OpenStack.

The data above was collected employing the Rally benchmark software. The Python-based Rally tool is free, open-source and extremely easy to deploy. First you have to download Rally from this Github link.

Rally comes with an install script just clone the Github repository in a folder of your choice, cd into that folder and run:

$ ./rally/install_rally.sh

Then deploy Rally by filling your OpenStack credentials in a JSON-file:

And then type:

$ rally deployment create --filename=existing.json --name=existing
+----------+----------------------------+----------+-----------------+
|   uuid   |         created_at         |   name   |      status     |
+----------+----------------------------+----------+-----------------+
|   UUID   | 2014-04-15 11:00:28.279941 | existing | deploy-finished |
+----------+----------------------------+----------+-----------------+
Using deployment : UUID 

Remember to use the UUID you got after running the previous command.
Then type:

$ rally use deployment --deploy-id=UUID
Using deployment : UUID

Then you are ready to use Rally. Rally comes with some pre-configured test scenarios in its doc-folder. Just copy a folder like e. g. rally/doc/samples/tasks/nova/boot-and-delete.json to your favourite location like e. g. /etc/rally/mytask.json:


$ cp rally/doc/samples/tasks/nova/boot-and-delete.json /etc/rally/mytask.json

Before you can run a Rally task, you have to configure the tasks. This can be done either via JSON- or via YAML-files. The Rally API can deal with both file format types.
If you edit the JSON-file mytask.json, you see something like the following:


{
    "NovaServers.boot_and_delete_server": [
        {
            "args": {
                "flavor_id": 1,
                "image_id": "Glance UUID"
            },
            "runner": {
                "type": "constant",
                "times": 10,
                "concurrency": 2
            },
            "context": {
                "users": {
                    "tenants": 3,
                    "users_per_tenant": 2
                }
            }
        }
    ]
}

You have to add the correct UUID of a Glance image in order to configure the test run properly. The UUID can be retrieved by typing:


$ rally show images
+--------------------------------------+--------+----------+
|                 UUID                 |  Name  | Size (B) |
+--------------------------------------+--------+----------+
| d3db863b-ebff-4156-a139-5005ec34cfb7 | Cirros | 13147648 |
| d94f522f-008a-481c-9330-1baafe4933be | TestVM | 14811136 |
+--------------------------------------+--------+----------+

Update the mytask.json file with the UUID of the Glance image.

If we want to run the task simply type (the “-v” flag for “verbose” output):


$ rally -v task start /etc/rally/mytask.json

=================================================================
Task  ... is started
------------------------------------------------------------------
2014-05-12 11:54:07.060 . INFO rally.benchmark.engine [-] Task ... 
2014-05-12 11:54:07.864 . INFO rally.benchmark.engine [-] Task ... 
2014-05-12 11:54:07.864 . INFO rally.benchmark.engine [-] Task ... 
...
+--------------------+-------+---------------+---------------+
|       action       | count |   max (sec)   |   avg (sec)   |
+--------------------+-------+---------------+---------------+
|  nova.boot_server  |   10  | 8.28417992592 | 5.87529754639 | |
| nova.delete_server |   10  | 6.39436888695 | 4.54159021378 |
+--------------------+-------+---------------+---------------+

---------------+---------------+---------------+---------------+
   avg (sec)   |   min (sec)   | 90 percentile | 95 percentile |
---------------+---------------+---------------+---------------+
 5.87529754639 | 4.68817186356 | 7.33927609921 | 7.81172801256 |
 4.54159021378 | 4.31421685219 | 4.61614284515 | 5.50525586605 |
---------------+---------------+---------------+---------------+

+---------------+---------------+---------------+---------------+
|   max (sec)   |   avg (sec)   |   min (sec)   |  90 pecentile | 
+---------------+---------------+---------------+---------------+
| 13.6288781166 | 10.4170130491 | 9.01177096367 | 12.7189923525 |
+---------------+---------------+---------------+---------------+...
...

The statistical output is now of major interest: it shows how long it takes to boot a VM instance in OpenStack and gives some useful information about the performance of your current OpenStack deployment. It can be viewed as a sample in the Shewhart control chart. Rally takes 10 test runs and measures the average runtime of each run. This technique is called statistical sampling. So each Rally run can be viewed as a sample which is represented as one data point in a control chart.

But how did we get our data into a Shewhart Control chart? This will be explained further in part 2.

Nagios / Ceilometer integration: new plugin available

The famous Nagios open source monitoring system has become a de facto standard in recent years. Unlike commercial monitoring solutions Nagios does not come as a one-size-fits-all monitoring system with thousands of monitoring agents and monitoring functions. Nagios is rather a small, lightweight monitoring system reduced to the bare essential of monitoring: an event management and notification engine. Nagios is very lightweight and flexible, but it must be extended in order to become a solution which is valuable for your organization. Plugins are a very important part in setting up a Nagios environment. Though Nagios is extremely customizable, there are no plugins that capture OpenStack specific metrics like number of floating IPs or network packets entering a virtual machine (even if there are some Nagios plugins to check that OpenStack services are up and running).

Ceilometer is the OpenStack component that captures these metrics. OpenStack measures typical performance indices like CPU utilization, Memory allocation, disk space used etc. for all VM instances within OpenStack. When an OpenStack environment has to be metered and monitored, Ceilometer is the right tool to do the job. Though Ceilometer is a quite powerful and flexible metering tool for OpenStack, it lacks capabilities to visualize the collected data.

It can easily be seen that Nagios and Ceilometer are complementary products which can be used in an integrated solution. There are no Nagios plugins to integrate the Ceilometer API (though Enovance has developed plugins to check that OpenStack components alive) with the Nagios monitoring environment and therefore allow Nagios to monitor not only the OpenStack components, but also all the hosted VMs and other services.

The ICCLab has developed a Nagios plugin which can be used to capture metrics through the Ceilometer API. The plugin is available download on Github. The Ceilometer call plugin can be used to capture a Ceilometer metric and define thresholds for employing the nagios alerting system.

In order to use the plugin simply copy it into your Nagios plugins folder (e. g. /usr/lib/nagios/plugins/) and define a Nagios command in your commands.cfg file (in /etc/nagios/objects/commands.cfg). Don’t forget to make your Nagios plugin executable to the Nagios API (chmod u+x).

A command to monitor the CPU utilization could look like this:

define command {
command_name    check_ceilometer-cpu-util
command_line    /usr/lib/nagios/plugins/ceilometer-call -s "cpu_util" -t 50.0 -T 80.0
}

Then you have to define a service that uses this command.

define service {
check_command check_ceilometer-cpu-util
host_name
normal_check_interval 1
service_description OpenStack instances CPU utilization
use generic-service
}

Now Nagios can employ Ceilometer API to monitor VMs inside OpenStack.

When do you need to scale up?

A big issue in cloud computing is knowing when you should upstart more VMs or switch to a more powerful virtual machine in order to process user requests efficiently. Monitoring system utilization is very important for detecting if VM utilization is too high to guarantee stable and high performing IT services. But how can one determine if upscaling of a VM-infrastructure is required? Part of the answer lies in trend detection algorithms. This article describes two of the most popular ones that can be applied to VM-infrastructures.

Autocorrelations and moving averages

If a series of measurements is correlated to the time of measurement, it is said that the series is “autocorrelated”. If you measure VM utilization several times you might discover that utilization will increase or decrease from time to time. A (linear) regression of measurement values will reveal growth trends. If such a trend appears, the average utilization is increasing, it is a “moving average”. The movement of the average causes the regression to produce errors, because regression models are computed on constant average values. Therefore one has to consider the errors produced by the moving average of measured values.

Moving average and autocorrelation can be combined in the “AutoRegressive Integrated Moving Average” (ARIMA) model. The ARIMA model has two advantages: on one side the autocorrelation function of a set of values is computed, on the other side the errors that are produced by performing this calculation are minimized. ARIMA integrates aspects of autocorrelation and moving average. Therefore it is a quite feasible model to predict trends.

When the ARIMA is applied to VM utilization one can predict (with a certain probability) that some threshold of utilization will be reached in the future. Defining acceptance criteria for probabilities of growth trends and  for reaching a threshold in the future is a major steps towards determine the “ideal” point in time when an upscaling of a VM-infrastructure is required.

Two things must be done:

  1. Define threshold values for VM utilization metrics that tell when a VM is overutilized. One could e. g. say that if mean CPU-utilization of the last 5 minutes is over 90%, the VM with that CPU is inacceptably overutilized and therefore such a value is athreshold for VM utilization.
  2. Define a threshold for ARIMA growth trends that result in VM overutilization (which is the threshold for VM utilization). For this purpose you have to measure values for VM utilization metrics and repeatedly calculate growth trends following the ARIMA model. If such a calculation results in reaching a threshold for VM utilization, an upscaling of VM utilization is required.

With threshold values for VM utilization metrics and ARIMA growth trends one can construct an event management system that catches problems of VM overutilization by repeatedly measuring  metrics and calculating growth trends.

The advantages of the ARIMA model are:

  • It gives an extrapolated estimation of the future growth trend and tries to assign a value to predicted VM utilization.
  • It takes the fact that average VM utilization changes over time into account by repeatedly calculating a moving average.

The drawbacks of the ARIMA model are:

  • The model states a prediction which appears to be “exact” to the statistically inexperienced viewer, but in fact there is only a probability that the future values will be most likely in the neighbourhood of the predicted values. Given any ARIMA prediction it is still possible that growth trends will discontinue in the future. Therefore predicted values can never be seen as “guaranteed” future values.

Control charts

Another model which can be used to predict upscaling utilizes Shewhart control charts. These charts are used in business process management for controlling process quality based on statistical measurements. The idea behind control charts is the following: we have to take n repeated samples of i measurements and then calculate the range and the average of the each sample. The ranges are then put as data points in an “R-chart” and the averages are filled in an “X-chart”. Then we calculate the average μ and the standard deviation σ of all n data points in the R- and the X-chart. Then we do the following: we define some upper and lower bound for the data points which are considered as “natural” process limits and check if there are data points lying above or below these “control limits”. The upper and lower control limit (UCL and LCL) are proportional to the standard error which is σ divided by the square root of n. As a rule of thumb the UCL is defined as the average of all data points plus two times the standard error, while the LCL is the average minus two times the standard error. By calculating the UCL and LCL for the X- and R-chart, we can check if there are data points below or above the UCL.

Control charts assume that if all data points lie within the UCL and LCL limits, the process will most likely continue as it is. It is said then that the process is “in control”. The interesting thing about control charts is that data points which lie outside the UCL or LCL can be used as indicators of process changes. If multiple points lie above the UCL, a growth trend can be indicated.

When control charts are applied to VM utilization one must first define the sample size i and the number of data points n. Let us say that we want to measure average CPU utilization of the last 5 minutes. One could e. g. measure CPU utilization at 20 random points (i=20) in the time interval between 0 and 5 minutes.  Then one can calculate the average of the sample as well as the range which is the difference between the maximum and minimum of the 20 values. As a result we get one data point for the X-chart and one for the R-chart. Then one should take n samples to populate the X- and R-charts. If we chose n=5, we can then compute the standard deviation, standard error and average of all samples. This values can be used to define the UCL and LCL for the process. As a next step we must define a decision criterion for when do we say that a process will result in a growth or decline trend. We could e. g. say that if 2 or more points lie above the UCL, a growth trend will occur in the future.

The upscaling is necessary, when either a process contains 2 or more data points above the UCL and the average is near some critical threshold (where the low performance VM reaches its maximum capacity) or when a process is in control but the UCL lies above the critical threshold. In both cases an upscaling is necessary, either because the next data points will probably lie above the threshold as a result of some growth trend or because the future data points can reach the threshold even when the process continues as it is.

Control charts are a quite simple means to predict process changes. They have the following advantages:

  • Control charts are a relatively reliable indicator for future growth trends and can therefore indicate possibly critical growth trends very early.
  • They do not bias viewers towards giving “exact” predictions for future VM utilization values.

Despite these advantages, control charts also have the following drawbacks:

  • They need a lot of parameter estimations (e. g. choice of n or i). If those parameters are not chosen well, control charts lead to many “false alarms” that indicate overutilization when there is none.
  • Control charts can predict growth trends, but they do not tell anything about the strength of the growth trend. Therefore they tend to either overestimate small process changes or underestimate large changes. They are insensitive to trend sizes.

Both models, the ARIMA and the control charts have some advantages and some drawbacks. Like many tools they are just as good as the person that uses them. Often it is advisable to test both tools first and then decide which instrument should be used for VM utilization prediction. But predicting future growth trends is still more an art than a craft. Therefore it can not be decided which method is “better”, but it is clear that both of them are better than do nothing about VM performance measurements.

 

How to model service quality in the cloud

Why is service quality important?

A cloud can be seen as a service which is provided by a cloud provider and consumed by an end user. The cloud provider has the goal to maximize profit by providing cloud services to end users. Usually there are no fixed prices for using cloud services: users have to pay a variable price that depends on the consumption of cloud services. Service quality is a constraint to the cloud provider’s optimization goal of profit-maximization. The cloud provider should deliver cloud services with sufficiently good performance, capacity, security and availability and maximize his profit. Since quality costs more, a low quality cloud service is preferable to a high quality service, because it costs less. So why should profit-oriented cloud providers bother at all with quality?

A new view of service quality

In the new view, we see service quality not as a restriction to profit maximization. Cloud service quality is an enabler of further service consumption and therefore a force that increases profit of cloud providers. If we think of cloud computing as a low quality service with low degrees of availability (many outages), running slowly and in an insecure environment, one can easily see that cloud consumers will stop using the cloud service as soon as there are alternatives to it. But there is another argument in favour of using clouds with a high degree of quality of service (QoS): if cloud service consumption is performing well, it can be used more often and by more users at once. Therefore an operator of a quality cloud service can handle more user requests and at lower costs than a non-quality-oriented cloud provider.

What is quality in the cloud?

Quality can have different meanings: for us it must be measured in terms of availability, performance, capacity and security. For each of these four terms we have to define metrics that measure quality. The following metrics are used in service management practice:

  1. Availability: Availability can be calculated only indirectly by measuring the downtime, because outages are directly observable while normal operation of a system is not. When an outage occurs, the downtime is reported as the time difference between discovery of an outage and restoration of the service. Availability is then the ratio of total operating time minus downtime to the total operating time. Availability of a system can be tested by using the Dependability Modeling Framework, i. e. a series of simulated random outages which tell system operators how stable their system is.
  2. Performance: Performance is usually the tested by measurement of the time it takes to perform a set of sample queries in a computer program. Such a time measurement is called a benchmark test. Performance of a cloud service can be measured by running multiple standard user queries and then measure their execution time.
  3. Capacity: By capacity we mean storage which is free for service consumption. Capacity on disks can be measured directly by checking how much storage is used and how much storage is free. If we want to know how much working memory must be available, the whole measurement becomes a little bit more complicated: we must measure memory consumption during certain operations. Usually this can be done by profiling the system: like in benchmarking we run a set of sample queries and measure how much memory is consumed. Then we calculate the memory which is necessary to operate the cloud service.
  4. Security: Security is the most abstract quality indicator, because it can not be measured directly. A common practice is to create a vector of potential security threats and estimate the probability that a threat will lead to an attack and estimate the potential damage in case of an attack. Threats can be measured as the product of the attack probability and the potential of a damage. The goal should be  to mitigate the biggest risks with a given budget. A risk is mitigated when there are countermeasures against identified security threats (risk avoidance), minimization measures for potential damages (damage minimization), transfer of security risks to other organizations (e. g. insurances) and (authorized) risk acceptance. Because nobody can know all potential threats in advance there is always an unknown rest risk which cannot be avoided. Security management of a cloud service is good, when the security threat vector is regularly updated and the worst risks are mitigated.

The given metrics are a good starting point for modelling service quality. In optimization there are two types of models: descriptive models and optimization models.

A descriptive model of service quality in the cloud

Descriptive models describe how a process is performed and are used to explore how the process works. Usually descriptive models answer “What If?”-questions. They consist in an input of variables, a function that transforms the input in output and a set of (unchangeable) parameters that influence the transformation function. A descriptive model of cloud service quality would describe how a particular configuration of service components (service assets like hardware, software etc. and management of the service assets) delivers a particular set of output in terms of service quality metrics. If we can e. g. increase availability of the cloud service by using a recovery tool like Pacemaker, a descriptive model is able to tell us how the quality of the cloud service changes.

Sets are all possible resources we can use in our model to produce an outcome. In OpenStack we use hardware, software and labour. Parameters are attributes of the set entities which are not variable: e. g. labour cost, price of hardware assets etc. All other attributes are called variables:  The goal of the modeler is to change these variables and see what comes out. The outcomes are called consequences.

A descriptive model of the OpenStack service could be described as follows:

  • Sets:
    • Technology used in the OpenStack environment
      • Hardware (e. g. physical servers, CPU, RAM, harddisks and storage,  network devices, cables, routers)
      • Operating system (e. g. Ubuntu, openSUSE)
      • Services used in OpenStack (e. g. Keystone, Glance, Quantum, Nova, Cinder, Horizon, Heat, Ceilometer)
      • HA Tools (e. g. Pacemaker, Keepalive, HAProxy)
      • Monitoring tools (e. g.
      • Benchmark tools
      • Profiling tools
      • Security Tools (e. g. ClamAV)
    • Management of the OpenStack environment
      • Interval of availability tests.
      • Interval of performance benchmark tests.
      • Interval of profiling and capacity tests.
      • Interval of security tests.
      • Interval of Risk Management assessments (reconsideration of threat vector).
  • Parameters:
    • Budget to run the OpenStack technology and service management actions
      • Hardware costs
      • Energy costs
      • Software costs (you don’t have to pay licence fees in the Open Source world, but you still have maintenance costs)
      • Labor cost to handle tests
      • Labor costs to install technologies
      • Labor costs to maintain technologies
    • Price of technology installation, maintenance and service management actions
      • Price of tangible assets (hardware) and intangible assets (software, energy consumption)
      • Salaries, wages
    • Quality improvement by operation of particular technology or by performing service management actions
      • Price of tangible assets (hardware) and intangible assets (software, energy consumption)
      • Salaries, wages
  • Variables:
    • Quantities of a particular technology which should be installed and maintained:
      • Hardware (e. g. quantitity of physical servers, CPU-speed, RAM-size, harddisks and storage size, number of network devices, speed of cables, routers)
      • Operating system of each node (e. g. Ubuntu, openSUSE)
      • OpenStack services per node(e. g. Keystone, Glance, Quantum, Nova, Cinder, Horizon, Heat, Ceilometer)
      • HA Tools per node (e. g. Pacemaker, Keepalive, HAProxy)
      • Monitoring tools (e. g.
      • Benchmark tools
      • Profiling tools
      • Security Tools (e. g. ClamAV)
  • Consequences:
    • Costs for installation and maintenance of the OpenStack environment:
      • Infrastructure costs
      • Labour costs
    • Quality of the OpenStack service in terms of:
      • Availability
      • Performance
      • Capacity
      • Security

In the following picture we show a generic descriptive model for optimization of quality of an IT service:

Fig. 1: Descriptive model of service quality of an IT service.

Fig. 1: Descriptive model of service quality of an IT service.

Such a descriptive model is good to exploit the quality improvements delivered by different system architectures and service management operations. The input variables form a vector of systems and operations: Hardware, network architecture, operating systems, OpenStack services, HA tools, benchmark tools, profiling monitors, security software and service operations performed by system administrators. One can experiment with different systems and operations and then check the outcomes. The outcomes are the costs (as a product of prices and systems) and the service quality. The service quality is then measured by our metrics we have defined.

Even if the descriptive model is quite useful, it is very hard to actually optimize service quality. Therefore the descriptive model has to be extended to an optimization model.

An optimization model of service quality in the cloud

Optimization models enhance descriptive models by adding constraints to the inputs of the descriptive model and by defining an objective function.  Optimization models answer “What’s Best?”-questions. They consist in an input of variables, a function that transforms the input in output and a set of (unchangeable) parameters that influence the transformation function. Additionally they contain constraints that restrict the number of possible inputs and an objective function which tells the model user what output should be achieved.

An optimization model of the OpenStack service could be described as follows:

  • Sets:
    • Technology used in the OpenStack environment
      • Hardware (e. g. physical servers, CPU, RAM, harddisks and storage, network devices, cables, routers)
      • Operating system (e. g. Ubuntu, openSUSE)
      • Services used in OpenStack (e. g. Keystone, Glance, Quantum, Nova, Cinder, Horizon, Heat, Ceilometer)
      • HA Tools (e. g. Pacemaker, Keepalive, HAProxy)
      • Monitoring tools (e. g.
      • Benchmark tools
      • Profiling tools
      • Security Tools (e. g. ClamAV)
    • Management of the OpenStack environment
      • Interval of availability tests.
      • Interval of performance benchmark tests.
      • Interval of profiling and capacity tests.
      • Interval of security tests.
      • Interval of Risk Management assessments (reconsideration of threat vector).
  • Parameters:
    • Budget to run the OpenStack technology and service management actions
      • Hardware costs
      • Energy costs
      • Software costs (you don’t have to pay licence fees in the Open Source world, but you still have maintenance costs)
      • Labor cost to handle tests
      • Labor costs to install technologies
      • Labor costs to maintain technologies
    • Price of technology installation, maintenance and service management actions
      • Price of tangible assets (hardware) and intangible assets (software, energy consumption)
      • Salaries, wages
    • Quality improvement by operation of particular technology or by performing service management actions
      • Price of tangible assets (hardware) and intangible assets (software, energy consumption)
      • Salaries, wages
  • Variables:
    • Quantities of a particular technology which should be installed and maintained:
      • Hardware (e. g. quantitity of physical servers, CPU-speed, RAM-size, harddisks and storage size, number of network devices, speed of cables, routers)
      • Operating system of each node (e. g. Ubuntu, openSUSE)
      • OpenStack services per node(e. g. Keystone, Glance, Quantum, Nova, Cinder, Horizon, Heat, Ceilometer)
      • HA Tools per node (e. g. Pacemaker, Keepalive, HAProxy)
      • Monitoring tools (e. g.
      • Benchmark tools
      • Profiling tools
      • Security Tools (e. g. ClamAV)
  • Constraints:
    • Budget limitation for installation and maintenance of the OpenStack environment:
      • Infrastructure costs
      • Labour costs
    • Technological constraints:
      • Incompatible technologies
      • Limited knowledge of system administrators
    • Objective Function:
      • Maximization of service quality in terms of:
        • Availability
        • Performance
        • Capacity
        • Security

The following picture shows a generic optimization model for an IT service:

Fig. 2: Service quality optimization model for an IT service.

Fig. 2: Service quality optimization model for an IT service.

With such an optimization model at hand we are able to optimize service quality of an OpenStack environment. What we need are clearly defined values for the sets, parameters, constraints and objective functions. We must be able to create a formal notation for all model elements.

What further investigations are required?

The formal model can be created if we get to know all information required to assign concrete values to all model elements. This infomration is:

  • List of all set items (OpenStack system environment plus regular maintenance operations): First we must know all possible values for the systems and operations used in our OpenStack environment. We must know which hardware, OS and software we can use to operate OpenStack and which actions (maintenance) must be performed regularly in order to keep OpenStack up and running.
  • List of all parameters (costs of OpenStack system environment  elements, labour cost for maintenance operations and quality improvement per set item): In a second step we must obtain all prices for our set items. This means we must know how much it costs to install a particular hardware, OS or software and we must know how much the maintance operations cost in terms of salaries. Additionally we must know the quality improvement which is delivered per set item: this can be done by testing the environment with and without the item (additional system or service operation) and using our quality metrics.
  • List of constraints (budget limit and technical constraints): In a third step we must get to know the constraints, i. e. budget limits and technical constraints. A technical constraint can be a restriction like that you can use only one profiling tool.
  • Required outcomes (targeted quality metric value maximization): Once we know the sets, parameters and constraints, we must define how quality is measured in a function. Again we can use our quality metrics for that.
  • Computation of optimal variable values (which items should be bought): Once we know all model elements, we can compute the optimal variables. Since we will not get a strict mathematical formula for the target function and since we may also work with incomplete information, it is obvious that we should use a metaheuristic (like e. g. evolutionary algorithms) to find a way on how to optimize service quality.

We have seen that creating a model for service quality optimization in the cloud requires a lot of investigation. Some details about it will be revealed in further articles.

 

CLEEN 2014 – Call for Papers

The Future Internet will consists in more flexible radio access networks which are less centralized than the network infrastructure we have today. Integration of flexible heterogeneous radio access networks in HetNets will allow further social diffusion of the mobile internet. Therefore the IEEE is exploring novel concepts to allow for flexibly centralised radio access networks using cloud-processing based on open IT platforms. The Second International Workshop on Cloud Technologies and Energy Efficiency in Mobile Communication Networks” (CLEEN) 2014 is scheduled for April 2014 in Istanbul, Turkey. The goal of the workshop is research and discussion of technologies which enable cloud-based radio access networks that allow for high quality networking in terms of energy efficiency and cost-effectiveness. Building the Future Internet as a cloud-based Internet requires new concepts for the design, operation, and
optimization of radio access networks and backhaul networks as well as a tight integration of networks in cloud-processing IT infrastructures. Therefore a call for papers is assigned.

Paper submission deadline is: October 15th 2013

Acceptance Notification: December 15th 2013
Camera-ready: January 10th 2013

Further Information about the CLEEN 2014 can be found here:

http://www.ict-ijoin.eu/cleen2014/

VTC Fall 2013: The future of Cloud Operating Systems

Thomas M. Bohnert explains his view on the future of Cloud Computing to the audience.

Thomas M. Bohnert explains his view on the future of Cloud Computing to the audience.

Telecommunication and IT industries must work much more closely together than they do it nowadays in order to successfully manage the Internet in the future – especially if we consider that the Internet has become mobile. The future challenges for the mobile internet lie in increasing numbers of mobile end users, rapidly growing data traffic and massive consumption of energy and network resources. Technologies like HetNet and “Small Cell” networks must be used in combination to efficiently manage the data traffic flows and energy consumption. The most promising approach for creating HetNets is “Software Defined Networking” (SDN) integrated into a Cloud environment. Cloud-based SDN provides the elasticity required to integrate heterogeneous network devices that form a HetNet. Since SDN makes network management better understandable to software developers, the disrupting technology has the potential to build a bridge between IT architects and Network managers. This could be the resumé of the “First International Workshop on Cloud Technologies and Energy Efficiency in Mobile Communication Networks” (CLEEN) which took place September 2 – in conjunction with the IEEE Vehicular Technology Conference (VTC).

Cloud-based SDN will be a key technology for facing the challenges of the future Internet, said Keynote Speaker Artur Hecker, researcher working at Chinese telecommunication provider Huawei. Though this is a positive message, Hecker admits that Cloud Computing has some difficulties in delivering the required levels of  availability, performance and security. Therefore Cloud Operating Systems must be made more secure and reliable than they are at the moment. A major technical challenge will be carrier grade availability (99.999% availability) for Cloud environments. TelCos need high degrees of availability when using Cloud software for delivering telecommunication services due to strong restrictions in their SLA requirements.

The ICCLab was also present with a presentation about the “Dependability Modeling Framework”. Konstantin Benz explained the framework which is used to test availability capabilities of Cloud environments. The audience was particularly interested in such methods to make Cloud Computing ready to be used as an essential part of the architecture of the “Future Internet”. Cloud-based SDN should follow strict carrier grade requirements.

The CLEEN workshop closed with a panel discussion about the future of Cloud Computing, Cloud-based SDN and its role in the “Future Internet”. Thomas M. Bohnert stated that Cloud Computing should be more than just a marketing buzz word: it should simplify delivery of telecommunication services rather than becoming a redesign of current telecommunication architectures – especially because access to the Internet has become a commodity in recent years. Cloud Computing can enhance telecommunication services by adding elasticity to (currently) inflexible network architectures, but network architects should be aware that it is not the goal of cloud-based telecommunication technology to simply mimic behaviour of conventional networks. An advantage could be the combination of cloud-based SDN with data collection applications which combines the Cloud Computing approach with Big Data and turns telecommunication services into a data-driven product. This could turn networking which is a commodity into a promising new product for the Future Internet.

 

VTC Fall 2013: can vehicles be intelligent?

VTC Fall 2013 Conference

VTC Fall 2013 Conference

Making vehicles intelligent was one of the major topics this Wednesday. Transportation is an influential factor to economic growth, but it faces many challenges in the areas of safety, mobility (traffic jams) and protection of the environment. Building intelligent traffic surveillance systems is a major goal of US Department of Transportation researchers James Pol and Walton Fehr.

Intelligent traffic surveillance systems heavily rely on in-vehicle information systems which are connected to each other. The information systems in cars can help to collect transportation data which is required in order to predict traffic jams, avoid car accidents and measure pollution. As long as vehicles are not equipped with information systems, data-driven analysis of traffic is not possible. James Pol refers to this issue as a “chicken-egg-problem”: we need intelligent vehicles in order to have a functioning traffic surveillance, but the individual driver might require an intelligent traffic surveillance in the first place in order to invest into (possibly expensive) in-vehicle information systems.

There is much research activity going on in both areas: intelligent cars with in-vehicle information systems are developed as well as networks that connect the different in-vehicle information systems.
Walton Fehr presented some of the ongoing US Department of Transportation “Research and Innovative Technology Administration” (RITA) projects that face the current challenges in building intelligent traffic and transportation systems. Connected Vehicle Technology is a project which has the goal to develop and deploy a communication platform to fully connect all vehicles in a transportation system. Vehicles (or more precisely: in-vehicle information systems) are building mobile ad-hoc networks (MANets) which allow vehicle drivers to collect and exchange data on the overall traffic as well as other vehicles. This technology offers opportunities to researchers which can develop applications for the connected vehicle MANets.

Future research directions will be:

  • Interoperability of different in-vehicle information systems,
  • Automation of traffic data collection and
  • Management of the traffic data.

For more information about the ongoing projects visit the following site:

http://www.its.dot.gov/

OpenStack HA: why is Pacemaker such a slow recovery tool?

If you ever tried to implement High Availability in OpenStack by using Pacemaker, you might be disappointed by Pacemaker’s extremely slow recovery speed. Pacemaker recovers OpenStack at a very low pace – and even worse: it sometimes detects outages when they do not occur. As a result Pacemaker starts unnecessary computationally intensive recovery actions which are very slow and decrease OpenStack’s availability. This article describes why Pacemaker recovery actions are sometimes slow and what we can do against it.

Pacemaker is a distributed software that monitors and controls execution of programs or services on different computers in a cluster. The controlled services are called “resources” and Pacemaker needs a “resource agent” interface in order to be able to manage a resource. Resource management actions are performed by programs that run locally on each computer of the cluster: the “Local Resource Management Daemons” (LRMDs). LRMDs are programs that can monitor execution of services and restart them in case of failure. The LRMD actions are orchestrated by the “Cluster Resource Manager” (CRM). LRMDs know how to manage resources (from the resource agent specifications), but they do not monitor, stop or restart local IT services autonomously: the CRM has to tell them when and at what time interval they have to perform failover actions. The CRM can be configured by a distributed XML-file: the “Cluster Information Base” (CIB). The CIB contains all information that is necessary to orchestrate the LRMD actions. The communication between CRM and LRMDs is performed by a “Cluster Communication Manager” (CCM). Typical CCMs that are used in combination with Pacemaker are Corosync or Heartbeat.

Fig. 1: OpenStack HA with Pacemaker.

Fig. 1: OpenStack HA with Pacemaker.

OpenStack can be made highly available by installing redundant OpenStack services (Keystone, Nova, Glance etc.) on different machines and let Pacemaker control execution of the OpenStack services. Custom resource agents must be installed in order to allow the LRMDs to manage OpenStack resources. Then the CIB must be configured so the CRM can orchestrate the LRMD actions. An example of such a OpenStack HA architecture using Pacemaker is shown in Fig. 1.

Why is Pacemaker slow?

Sometimes one can experience that Pacemaker failover actions are very slow. There could be several reasons why the Pacemaker recovery of OpenStack is such a time-consuming task. The most common ones are these:

  • Suboptimal initialization scripts: OpenStack services do not generate a file containing the process identification (pid) in a pid file per default. Therefore Pacemaker is not able to identify OpenStack services as manageable entities or resources. Some hacking is necessary in order to make OpenStack services Pacemaker-compliant.
  • Custom resource agents: there are no OCF-compliant OpenStack resource agents delivered out of the box. Pacemaker’s Local Resource Management Daemons (LRMDs) are therefore not able to manage OpenStack services.
  • Bad Cluster Information Base (CIB) configuration: The worst thing is a messy CIB configuration. If e. g. recovery tasks are kept in large groups and monitoring intervals are too long to discover outages very fast, the Pacemaker recovery will act very slowly, because Pacemaker has to recover large resource groups and recovery actions are started lately.

What can be done to make Pacemaker faster?

The first and most important step to make Pacemaker recovery faster is to identify the cause of the slowness. Once you have done that, you can take one of the following actions:

  • Optimize initialization scripts: Depending on your initialization system (Init-V, Upstart, Systemd), you must customize the upstart of services in order to generate pid files which help Pacemaker to identify the service on the system. OpenStack services in Ubuntu are upstarted by the Init-V system. If you run OpenStack on Ubuntu, you must customize the upstart scripts so they will generate pid files automatically. This can be done by changing the configuration files in /etc/init. For the quantum server e. g. you have to change the /etc/init/quantum-server.conf file to contain several lines which tell the upstart daemon to create a pidfile and place it in a specified folder (typically /var/run). Creation of pid files can be performed using the start-stop-daemon. For more information on the start-stop-daemon read the manpage.
  • Create custom resource agents: there are no OpenStack resource agents delivered out of the box, but you can create them if you want. Resource agents must be placed in the /usr/lib/ocf/resource.d/ folder. They must contain methods to monitor, start and stop services as well as a method to control the execution status of the service. Some good examples for OpenStack resource agents can be found on the Hastexo website.
  • Improve Cluster Information Base (CIB) configuration: Most improvements can be done by changing the CIB configuration. Ideally OpenStack services should run redundantly at the same time on two different OpenStack nodes which can be reached by using a shared virtual IP. In case of a service failure on one node, Pacemaker just has to route traffic to the node where the service is still running. If the service is not running redundantly on the fallback node before the failure occurs, Pacemaker has to upstart the service on at least one of the nodes. A small context switch is usually faster than the upstart of whole services. Therefore redundant nodes must always keep redundant OpenStack services up and running. It is really important to ensure that parallel execution of redundant services is configured in the CIB file.

If you improve OpenStack initialization scripts, optimize OpenStack resource agents and improve the CIB configuration, Pacemaker should be a great tool to make OpenStack services highly available.

ICCLab @ CLEEN 2013 in Las Vegas

The “Dependability Modeling Framework” (DMF) becomes famous: Konstantin Benz and Thomas M. Bohnert will present their newest paper about the Dependability Modeling Framework at the First International Workshop on “Cloud Technologies and Energy Efficiency in Mobile Communication Network” (CLEEN) which takes place from September 2-5 in Las Vegas. The ICCLab researchers will show a methodology on how to test system architectures for their ability to implement High Availability characteristics in the cloud. Thomas M. Bohnert will also present a poster which shows how the DMF is applied to the Mobile Cloud Networking (MCN) project.

cleen

The CLEEN workshop is the first conference of the IEEE dedicated to the topic of energy efficiency in mobile communication. It is is a joint initiative of three ICT projects funded by the European Commission under the Seventh Framework Programme (FP7). CLEEN workshop is organized in conjunction with the VTC 2013-Fall conference.

 

« Older posts Newer posts »