Hardware Extension for Ceilometer

Ceilometer Introduction

Ceilometer is a monitoring tool for OpenStack cloud environments. In the next OpenStack release called Havana it will take part as a core component. However, Ceilometer is also available for the OpenStack releases Folsom and Grizzly. Currently Ceilometer offers only data of the OpenStack core components and the virtual machines of the cloud. For this reason the ICCLab decided to extend Ceilometer in a way that it is possible to collect data from hardware devices as well.

Ceilometer Extension – Concept

The collection of the data from the hardware devices should be independent and expandable. Therefore, a new Ceilometer agent with a modular structure is needed. We call this agent Hardware Agent. In the picture below the conceptual architecture is shown.

conceptual architecture

Conceptual Architecture Ceilometer Hardware Extension

As the Hardware Agent is part of Ceilometer it is installed nearly on every physical server. The Hardware Agent should be able to poll the data from various sources like IPMI, SMART or SNMP through Inspectors on different devices. This allows the Hardware Agent to get data from devices like switches or router.
It should be possible to deactivate each of these sources globally or per host. Which data will be extracted of the source should also be configurable globally or per host. The structured data is stored in Pollsters.
The Hardware Agent sends the collected data to the Ceilometer Event Bus which is in general a rabbit message queue. The central Ceilometer Collector takes the messages and stores it on a database. Through the Ceilometer API the data could be read by other systems like a billing and rating system or a graphical depiction.

Ceilometer Extension – Configuration Example

The configuration of the Hardware Agent allows the administrator to deactivate Pollsters and Inspectors globally or per device/host. The host settings take only place when no global settings are set.
To access the sources it might be necessary to set additional information like username or password. This configuration could be set globally or per host. If there is no host configuration the Hardware Agent takes the global configuration. If none of both configurations are set the inspector takes standard values to access the sources.

Configuration Sequence

Configuration Sequence of the Hardware Agent

These configurations could be set in /etc/ceilometer/hardware-agent.conf in this way:

disabled_hardware_pollsters=network
disabled_hardware_inspectors=ipmi
hardware_inspector_configurations = {“snmp” :
{“securityName”: “public”, “port”: 161}}

hardware_hosts={“10.0.0.1” :
{“disabled_pollsters”: [“cpu”],
“disabled_inspectors”: [“smart”],
“inspector_configurations”:
{“snmp” : {“port”: 163}}},
“10.0.0.2” :{…}}

Ceilometer Extension – Status and Prospect

With the programmed extension it is possible to get data of CPU (1,5 and 15 minutes usage in %), network (incoming/outcoming traffic in bytes, number of errors) storage(used/total space in bytes) and memory(used/total space in bytes) from any devices over SNMP. With new Inspectors it is imaginable to get more data of new sources like IPMI or SMART.
The base of the Ceilometer extension is currently being reviewed by other Ceilometer programmers (Review). If the review succeeds the extension take place in the Havana release of OpenStack in October 2013.

ICCLab @ CLEEN 2013 in Las Vegas

The “Dependability Modeling Framework” (DMF) becomes famous: Konstantin Benz and Thomas M. Bohnert will present their newest paper about the Dependability Modeling Framework at the First International Workshop on “Cloud Technologies and Energy Efficiency in Mobile Communication Network” (CLEEN) which takes place from September 2-5 in Las Vegas. The ICCLab researchers will show a methodology on how to test system architectures for their ability to implement High Availability characteristics in the cloud. Thomas M. Bohnert will also present a poster which shows how the DMF is applied to the Mobile Cloud Networking (MCN) project.

cleen

The CLEEN workshop is the first conference of the IEEE dedicated to the topic of energy efficiency in mobile communication. It is is a joint initiative of three ICT projects funded by the European Commission under the Seventh Framework Programme (FP7). CLEEN workshop is organized in conjunction with the VTC 2013-Fall conference.

 

SmartOS Series: A SmartOS Primer

Some time back we introduced a piece of work that we are working on: OpenStack on SmartOS. Today we start a new blog post series to dig into SmartOS and its features. We’ll start with a quick introduction to SmartOS to get everyone started with this platform.

SmartOS is an open source live operating system mainly dedicated to offer a virtualisation platform. It’s based on illumos, which in turn it’s derived form OpenSolaris, thus inherits many Solaris-like features, such as zones, ZFS and DTrace. Joyent, the company behind SmartOS, further enhanced the illumos platform by adding a porting of KVM and features like I/O throttling. The core features of SmartOS will be the topic of the next posts in this Series. Thanks to these features, SmartOS makes the perfect candidate for a truly Cloud OS.

The following presentation will walk you through the basic tasks to setup, configure and administer SmartOS:

For the next topics we will cover SmartOS virtualisation (Zones and KVM), SmartOS storage (ZFS), SmartOS networking (Crossbow) and SmartOS observability (DTrace).

Distributed File Systems Series: Ceph Introduction

With this post we are going to start a new series on Distributed File Systems. We are going to start with an introduction to a file system that is enjoying a good amount of success: Ceph.

Ceph is a distributed parallel fault-tolerant file system that can offer object, block, and file storage from a single cluster. Ceph’s objective is to provide an open source storage platform with no Single-Point-of-Failure, highly available and highly scalable.

A Ceph Cluster has three main components:

  • OSDs. A Ceph Object Storage Devices (OSD) are the core of a Ceph cluster and are in charge of storing data, handling data replication and recovery, and data rebalancing. A Ceph Cluster requires at least two OSDs. OSDs also check other OSDs for a heartbeat and provide this information to Ceph Monitors.
  • Monitors: A Ceph Monitor keeps the state of the Ceph Cluster using maps, e.g.. monitors map, OSDs map and the CRUSH map. Ceph also maintains a history, also called an epoch, of each state change in the Ceph Cluster components.
  • MDSs: A Ceph MetaData Server (MDS) stores metadata for the Ceph FileSystem client. Thanks to Ceph MDSs, POSIX file system users are able to execute basic commands such as ls and find without overloading the OSDs. Ceph MDSs can provide both metadata high-availability, i.e. multiple MDS instances, at least one in standby – and scalability, i.e. multiple MDS instances, all active and managing different directory subtrees.
ceph-architecture

Ceph Architecture (Source: docs.openstack.org)

One of the key feature of Ceph is the way data is managed. Ceph clients and OSDs compute data locations using a pseudo random algorithm called Controlled Replication Under Scalable Hashing (CRUSH). The CRUSH algorithm distributes the work amongst clients and OSDs, which free them from depending on a central lookup table to retrieve location information and allow for a high degree of scaling. CRUSH also uses intelligent data replication to guarantee resiliency.

Ceph allows clients to access data through different interfaces:

  • Object Storage: The RADOS Gateway (RGW), the Ceph Object Storage component, provides RESTful APIs compatible with Amazon S3 and OpenStack Swift. It sits on top of the Ceph Storage Cluster and has its own user database, authentication, and access control. The RADOS Gateway makes use of a unified namespace, this means that you can write data using one API, e.g. Amazon S3-compatible API, and read them with another API, e.g. OpenStack Swift-compatible API. Ceph Object Storage doesn’t make use fo the Ceph MetaData Servers.
stack

Ceph Clients (Source: ceph.com)

  • Block Devices: The RADOS Block Devices (RBD), the Ceph Block Device component, provides resizable, thin-provisioned block devices. The block devices are striped across multiple OSDs in the Ceph cluster for high performance. The Ceph Block Device component also provides image snapshotting and snapshots layering, i.e. cloning of images. Ceph RBD supports QEMU/KVM hypervisors and can easily be integrated with OpenStack and CloudStack (or any other cloud stack that uses libvirt).
  • Filesystem: CephFS, the Ceph Filesystem component, provides a POSIX-compliant filesystem layered on top of the Ceph Storage Cluster, meaning that files get mapped to objects in the Ceph cluster. Ceph clients can mount the Ceph Filesystem either as a Kernel object or as a Filesystem in User Space (FUSE). CephFS separates the metadata from the data, storing the metadata in the MDSs, and storing the file data in one or more OSDs in the Ceph cluster. Thanks to this separation the Ceph Filesystem can provide high performances without stressing the Ceph Storage Cluster.

Our next topic in the Distributed File Systems Series will be and introduction to GlusterFS.

30th Birthday of the Swiss Informatics Society

30th birthday of the Swiss Informatics Society

The 30th birthday of the Swiss Informatics Society (SI), held on Tue 25 June  in Fribourg CH, concluded successfully with more then 200 participants who globally have attended the thematic workshops in the morning, the inaugural Meeting of the Swiss AIS Chapter and the plenary in the afternoon.

We post hereafter relevant topics on the Cloud Computing workshop, moderated by ZHAW ICCLAB,  and the award ceremony.

Workshop: Cloud Computing in Switzerland

Cloud Computing is transforming the IT industry, and this concerns a high-tech country like Switzerland in particular. The resulting potentials and risks need to be well understood in order to be able to fully leverage the technical as well as economical advantages. This workshop  provided an overview of current technological and economical trends with a particular focus on Switzerland and its Federal Cloud Computing strategy

8:45 – 9:00  Intro by Christof Marti (ZHAW)
Workshop introduction, goals and activities on Cloud Computing at ZHAW.

The Cloud Computing Special Interest Group (SIG), whose formation is coordinated by ZHAW ICCLAB, was introduced with its overall goals identified  to stimulate the knowledge, implementation and development of Cloud Computing in Industry, Research, SMEs and Education. The Kick-Off meeting is foreseen in September (watch si-cc-sig or linkedin group for more details ).  Further information were presented on the InIT Cloud Computing Lab (ICCLAB),  Research Lab dedicated to Cloud Computing in the focus area of Service Engineering encompassing important research themes and cloud initiatives like: Automation, Interoperability, Dependability, SDN for Clouds, Monitoring, Rating, Charging, Billing and Future Internet platforms.

9:00-09:20  Peter Kunszt  (SystemsX)
Cloud computing services for research – first steps and recommendations

The view of the scientific community on technological trends and the opportunities offered by Cloud Computing infrastructures.  Interesting start of the workshop by the Project leader of SyBIT (SystemsX.ch Biology IT: SyBIT) with overview of possible cloud services for science and education, recommendation concerning commercial vs. selfmade clouds and possible pricing & billing models for science .

9:20-09:40 Markus Brunner (Swisscom)
Cloud/SDN in Service Provider Networks

Markus illustrated “why a new network architecture” with feature comparision of aging network technology (static) and current trend (dynamic) on global needs like cost effectiveness, agility and service oriented. The proposal was to  look at new infrastructures based on SDN (Software Defined Network) and NFV (Network Function Virtualisation). NFV is concerned with porting network or telecommunications applications, that today typically run on dedicated and specialized hardware platforms, to virtualized Cloud platforms. Some basic architectures were discussed and interplay of NFV-SDN.   The presentation concluded with analysis of challenges for Cloud technologies today for communication oriented applications like: Real-time, Security, Predictable performance, Fault Management in Virtualized Systems and fixed /  mobile differences.

9:40-10:00  Sergio Maffioletti (University of Zurich)
A roadmap for an Academic Cloud 

“The view of the scientific community on how cloud technology could be used as a foundation for building a national research support infrastructure”. Interesting and innovative presentation made by Sergio starting from the “why and what’s wrong” analysis through the initiatives in places (new platforms, cloud utilisation and long tem competitiveness objectives). The presentation also made an overview of how this is implemented with National Research Infrastructure program (the Swiss Academic Compute  Cloud project) and innovative management systems (a mechanism to collect community requirements and implementing technical services and solution ).  The presentation concluded on the objectives and targets like: inter-operate, intra/inter access to institutional infrastructure, cloud enabled,   research clustering and national computational resources.

10:00-10:20 Michèal Higgins  (CloudSigma) – remote
CloudSigma and the Challenges of Big Science in the Cloud

Switzerland based CloudSigma is a pure-cloud IaaS service provider, offering highly-available flexible enterprise-class cloud servers in Europe and the U.S. It offers innovative services like all SSD storage, high performance solutions and firewall/VPN services. Helping building the a federated cloud platform (Helix Nebula) that addresses the needs of big science, CloudSigma sees the biggest challenges and values to have huge data-sets available close to the computing instances. As a conclusion CloudSigmas offers the Science community to store common big data sets for free close to their compute instances reducing the cost and time to transfer the data.

         10:20-10:40 Muharem Hrnjadovic (RackSpace)

An overview of key capabilities of cloud based infrastructures like OpenStack and challenging scenarios were presented during this session.

10:40-10:45 All
Q&A session

Swiss Informatics Competition 2013

Aside from speakers and panel discussions, captivating student projects (Bachelors &  Masters in Computer Science), from Universities and High Schools Specialty, have been introduced  to illustrate the diversity of computing technologies. Selected projects by team of experts have been also awarded. The details on the student projects are available here.

 Some photos taken from the cloud computing workshop, the plenary and ending awards:

Capture33 IMG_20130625_174341_stitch IMG_20130625_174455 IMG_20130625_174733 IMG_20130625_180000Foto 5Foto 3

Foto 2 Foto 1

How to apply the 7-Step Continual Service Improvement Process in a Cloud Computing Environment

How good was my last configuration change? The following article shows how to implement the “7-Step Continual Service Improvement”-Process for a cloud computing environment.

Why is Continual Service Improvement important?

Delivering an IT service (like e. g. a cloud) is not a project. A project is something unusual and has a clearly defined beginning and an end. Running and Operating an IT service is a continuous task. Service consumers expect the IT service to be available whenever they need it. An IT service is supposed to be used regularly for regular business tasks and for an indefinite time frame. Even if IT services have a limited lifetime, they are expected to run as long as it is possible to maintain them.

No IT service is operating without errors. Bad user behaviour or misconfiguration can cause operating failures. Therefore IT services must be maintained regularly.  Because IT services are used continuously, such improvement and maintenance tasks must be performed repeatedly. While the usual operation of the IT service is expected to be continuous (or at least very close to it), service interruptions occur unexpectedly and maintenance tasks are performed step-by-step. Therefore service improvement is called to be “continual” rather than “continuous”.

The “7-Step Continual Service Improvement” process is a best practice  for improving IT services. It follows the steps outlined in (Fig. 1). If we want to establish the 7-Step process for a cloud service we must describe how each of the steps can be applied to a cloud service. We must define what we do at each step and what outcomes we expect from each step. These definitions will be described in the 7 following sections.

Fig. 1: 7-Step Continual Service Improvement Process.

Fig. 1: 7-Step Continual Service Improvement Process.

Step 1: What should be measured?

In order to compare your system configuration before and after the change, you must measure the configuration. But what should be measured? A good approach is to use “critical success factors” and deduce “key performance indicators” from them. Critical success factors are important goals for the organization running the system. Should the cloud provider company be seen as a very reliable provider? Then reliability is a critical success factor. Key performance indicators are aspects of the observed system that indicate success of the organization which operates the system. They can be deduced directly from the critical success factors. If e. g. the provider must be reliable, the cloud system should be highly available.

As a first step in the process you should create a list of key performance indicators: they are the important aspects you want to measure in your cloud. Such aspects could be:

  • Availability
  • Performance
  • Capacity

At this stage you should be not too specific on the metrics you want to use, because otherwise you would start to confuse key performance indicators and performance metrics. While key performance indicators tell you what should be measured, performance metrics define how it can be measured. Do not confuse the what and how of measurement.

You should only state general performance indicators you want to know more about like e. g. that you want your cloud operating system to be highly available or that you want that service consumers work with high performing instances. The result is always a list of (positive) aspects you want to have in your system.

For this example we say that the cloud operating system should be:

  • Highly available: we want low downtimes and small outage impacts.
  • High performing: we want a fast responding cloud operating system.
  • Highly receptive: we want that the cloud operating system has enough free disk space to mange virtual machines, disk images, files etc.

Step 2: What can be measured?

Once you have a list of aspects, you should consider how they can be measured. You should define performance metrics for the key performance indicators. Availability could e. g. be measured indirectly by measuring downtime during a time period. Performance can be measured by performing some queries and measuring the response time. As we can see, not all performance indicators can be measured directly. We must construct metrics for every indicator in order to measure a system.

In this example we define the following metrics:

  • Availability: We regularly poll the cloud operating system. If an outage occurs, we measure the downtime and the impact of the outage.
  • Performance: A test query (like e.g. upload some data to an instance) should be sent regularly to the cloud operating system. The response time of the query should be measured.
  • Capacity: The disk utilization of cloud operating system nodes can be measured regularly.

Step 3: Gather data

Once we have defined the performance metrics, we should think about how we can collect data to assign values to the metrics. This step is about developing tools for service monitoring. We should think about how we can measure things like downtime, outages, response time etc. Two techniques for data gathering are very important:

  • Polling of IT services: Data is gathered by regularly polling the IT service and checking for occurrence of some events (like e. g. server is not available). Polling mechanisms must run periodically during a given time frame.
  • Direct Measurement: Data is gathered directly by checking some system configuration values. The check runs only once at a given point in time.

An important aspect is choosing the time when we measure something and the frequency of measurements. Should we measure something once per day or should we rather measure something per hour or even per minute? And once we have chosen our frequency we must define the time frame on which measurements should take place.

In this example we gather data for three months and we measure everything according to the following frequencies:

  • Impact and Downtime: We could poll every 100 seconds if an outage occurred. If an outage is detected, the impact can be measured directly as a predefined value that follows our dependability model.
  • Response Time: Every hour we could start a script which runs some test queries. Thereby we measure the time for completion of the query. The response time value is then stored as data.
  • Disk utilization: This metric need not to be polled very often. It can be measured daily by using a direct measurement technique. We just check the available used disk space and divide it by the total space of the available disks.

By using the data gathering techniques described above, we collect values for impact, downtime, response time and disk utilization.

Step 4: Process the data

The collected data must be processed now in order to be analysed.In this step we must think about aggregation functions and how to aggregate in order to be able to make meaningful statements about the gathered data.

When we collect data for three months, we can’t do anything useful with it if we do not aggregate the data somehow. We can either sum up the collected data or calculate an average. The aggregation function depends on the scale of the data we collected: if we collected e. g. only categorical data, we can only count occurrence of values. If the data can be brought into some meaningful order, we can sum up values. If the data is at least interval-scaled we can calculate averages. Other important aggregation functions are the maximum and minimum of a value.

For this example we chose the following aggregation functions:

  • Total Impact and total downtime: Every time we detect an outage, the impact and the suffered downtime is recorded. In order to aggregate the data, we sum up all downtime suffered and impacts of outages.
  • Average response time: We poll the response time of the cloud regularly, but in order to get an aggregated result we should calculate the average of the response time.
  • Maximum disk utilization:  It is better to measure the maximum disk utilization instead of the average utilization since we must find out if a critical threshold is reached. If a disk is full, additional data can not be saved. Therefore the peak disk utilization is the value we want to monitor.

In order to make the data analysable we must also think of dispersion of data. Aggregate functions are very sensitive to extreme values and outliers. Rare outliers can distort average values. If we have e. g. a lot of short response times and then suddenly an extremely large value (e. g. a CPU-intensive batch procedure), the average will get a large value which is not very representative to the actual measurements. Dispersion functions like e. g. variance and standard deviation are functions to measure how far the data is away from an average value. They are quite useful when we want to know more about the meaningfulness of an average value, Therefore we must also define the dispersion functions we want to measure.

For this example we chose the following dispersion measurements:

  • Range of impact and downtime: Since impact and downtime are measured in terms of frequencies (impact and downtime increases when an outage occurs), we must choose the range (difference between minimum downtime and maximum downtime) as the dispersion measure. By gathering this data we can find out e. g. if we have rather very many small outages or very few large outages.
  • Standard deviation of response time: Since the response time is measured as a continuous number, we chose the standard deviation as our dispersion measurement function.
  • Standard deviation of disk utilization: Disk utilization will grow continually over time, but sometimes disk utilization is reduced due to maintenance work and other activities. Disk utilization growth is not linear and continuous. Therefore we should measure changes of disk utilization and take the standard deviation as our dispersion function.

Step 5: Analyse the data

The data we gathered can be seen as a statistical sample that with average values, standard deviations etc. In order to make something useful with the data, we must define tests we could apply on the collected statistical samples.
It is obvious that we should not just analyse data somehow. All data analysis should be performed according to the requirements that we define for our IT infrastructure. These requirements are commonly called “service levels”. Typically we want our infrastructure to achieve some critical values in terms of perfomance, capacity and availability. Another important aspect of analysis is measurement of the impact of changes. I

In this step we want to find out if we achieved the values we wanted to achieve. This is done by testing the aggregated data. The most common methods to check data are statistical methods. Statistical methods can be descriptive or inferential. Descriptive statistics reveal characteristics of the data distribution in a statistical sample. In descriptive statistics we check where the average of data is situated and how the data is distributed around the average. In inferential statistics we compare different samples to each other and we induce the value distribution of the population from the value distribution of samples.

Descriptive statistics are needed to check if the required service level has been achieved or not. Inferential statistics are useful to check if the achievement derived accidentially or if it was the result of maintenance work and configuration changes.

For the example of our cloud operating system the following descriptive analytic methods are proposed:

  • Check Availability Level: The availability can be calculated by subtracting the total downtime from the timeframe of the measurement period, multiplying the result with 100 and then dividing the result through the timeframe of the measurement period. The result is a percentage value which should be above the availability level required by service consumers. In order to check how availability is distributed one should check
  • Check Outage Impact: The total impact of outages should be below a certain level. One can also calculate the mean impact size and variance of impacts in order to see if we have many small outages or few severe outages.
  • Check Average Response Time: In order to check the response time one should calculate the average of the average response time as well as the variance of the response time.
  • Check Maximum Disk Utilization: Maximum Disk Utilization should be checked if it is above some critical value. In order to see if the disk utilization grows rapidly or slowly, one should also check the average of the maximum disk utilizations as well as the variance of the maximum disk utilization.

Descriptive analytics only reveal if you were keeping the required service level during the observed timeframe. In order to see if this result was achieved by chance or not, further statistical tests are needed. The following steps must be performed too:

  • Test distribution of values: As a first step you should check the distribution of the values like outage time, impact, response time and disk utilization. If the values follow a normal distribution, you should choose other statistical tests than you would take if they were not normally distributed. Tests for distribution of values are the “Anderson-Darling-Test“, the “Cramer-von Mises-Test” and the “Kolmogorov-Smirnov-Test”. In order to use these tests you must use the data you gathered in step 3.
  • Check if average differs from critical value: In inferential statistics we want to know if the measured average value was achieved as a result of our efforts to maintain the IT infrastructure or if it is a random value generated by accident. For this reason we compare the average value to either the value we expected from previously defined service levels or to the average value of another sample which is a data set we gathered from previous iterations of the 7-Step process. If it is your first iteration you can only make comparisons between the actual data set and the service level. Otherwise you are able to compare data of your previous iteration with data of the current iteration. The goal of this analysis is to see if there is a significant difference between the average and a critical value (either the average value required in the service level agreement or the average value of the previous sample). “Significant” means that if we assume that the difference is not equal to zero there is only a small error probability α (usually below a previously chosen significance threshold of 5 percent) that the difference is in fact equal to zero. There are quite many statistical tests to prove if  differences between average values are significant. Once we know the distribution of values, we are able to test if the difference between the measured average value and the critical value is significant. If sampled values are not normally distributed, you should choose a non-parametric test like e. g. the “Wilcoxon-Signed-Rank-Test” to test the significance of the difference. If the samples follow a normal distribution, you should rather choose a parametric test like the “Student’s t-Test“. Parametric tests are generally more powerful than non-parametric tests, but they rely on the assumption that values follow a particular distribution. Therefore they are not always applicable. The interpretation of such a test is quite straightforward: when we measure a negative difference between measurement and critical value,  we have to take corrective actions.  If the difference between the measured average and the critical value is positive and significant, the difference can be considered as a result of our efforts. Otherwise it could mean that we achieved the better-than-required value only by chance. In that case we should think about corrective actions too, because the “better” value does not necessarily mean that the infrastructure is well-maintained.
  • Check if variance differs from critical value: Generally speaking lower variance is preferable to high variance in a cloud computing environment. Low variance means that you have rather few extreme values and therefore your cloud computing environment is more scalable. Though variance should be kept low, it is not always possible to really do so. If your cloud environment is e. g. used to generate shopping websites, you have unavoidably varying traffic which makes response time, disk utilization and even availability varying too. But even in such cases it is always better to know variance well than not knowing anything at all. Knowledge about increasing variance makes you aware of imminent performance problems or other risks. For this reason variance should be compared to previously collected data of former iterations of the 7-Step process. As it is the case when you check average values, you might want to know the difference between variances and you might want to know if this difference occurred by accident or if it has another cause. Another interesting thing is to know if variance of different samples did not change over time (their difference is equal to 0). The attribute of different samples to have homogeneous variances is called “homoscedasticity”. There are also a quite a few statistical tests to prove if two samples have equal variances. It depends on the distribution of the sample data which test should be taken preferably. If your sample follows a normal distribution, you should take an “F-Test” . If data is distributed non-normally, you should take more robust tests like “Levene’s Test” or the “Brown-Forsythe-Test“. These tests decide if difference in variances is significant. Interpretation of a test result is that if there is a significant difference, then you should consider why it occurred. You should also prepare some corrective actions.

Even if you have found significant changes in variance or average values, you are still not done. You have to explain why the change occured. Statistical results must be interpreted to become more than just data.

Step 6: Present and use the information

According to the ITIL V3 standard, information is nothing else than data which is interpreted in some meaningful way. Because we add meaning to the data, we transform it into information. The big advantage of information is that it can be used to take corrective actions and to change something in an IT service.

Let’s say that we have performed a data analysis on our cloud services and we have found that our average response time is significantly lower than what we expected it to be.  In that case we have gathered some information out of the data. With that information we can now decide to enhance the average response time somehow.

At this stage it is important to interpret the information correctly. There are two ways to interpret information:

  • Reasoning: In order to interpret results of your statistical tests, you should be able to identify what implications they have for your IT service. If you e. g. discover a significant increase in disk utilization, the logical implication is that you should either try to limit disk utilization or add more storage to your system. Reasoning can not be fully automated since we must have some common sense knowledge about the IT services we use. There are some approaches to assist people in reasoning tasks though. You could e. g. use  so-called “expert systems” to find good logical implications from your analysed data. An “expert system” is a software which must be fed with data and uses formal logic to calculate logical implications. These systems could be used as tools to support you in taking decisions about your IT architecture and other aspects of your IT service.
  • Root Cause Analysis: Sometimes the cloud provider might discover a problem by analysing the data. The response time of the system could decrease significantly or there could be a growth trend in the occurence of outages. As soon as such a problem is discovered one should identify the underlying cause for the problem. This procedure is called “root cause analysis”. In root cause analysis you repeatedly ask yourself what caused the problem and then what caused the cause of the problem. The recursion involved in root cause analysis has a limited depth. Root cause analysis is also a process which can be performed only manually.

By reasoning about the results of your statistical tests you create valuable information. This information serves as a base to take corrective actions.

Step 7: Implement corrective action

The last step in the 7-Step Continual Service Improvement process is to take corrective actions. The corrective actions depend on the information generated in step 6. Since the 7-Step process is a continual improvement process, the last step has also the goal to close the improvement cycle.  This is done by aligning the IT improvements to the business strategy.

The first part of this step is to immediately correct errors in the IT environment. This could be programming errors, bad configuration of hardware or software, bad network conceptions or even bad architectural decisions. It is important to document what actions we take to correct the problems and what changes are performed in the IT infrastructure. It is also important to reflect results of the performed actions in the documentation. If we face problems in the implementation of changes, we should document it. Otherwise it could be that our “corrective” actions destabilize our cloud service and we do not know why everything is getting worse. Therefore we should create a report containing corrective actions and results of the performed actions.

The second part of this step is to close the process cycle. This is done by reporting the results of the 7-Step process cycle to business decision makers – usually the managers of the cloud provider. In order to restart the process it must also be defined when and how to start the next cycle. This task is also something which must be coordinated with decision makers. For practical reasons a plan for the next cycle must be created and approved.

Continual Repetition of the Improvement Cycle

Once an Improvement Process cycle is finished, it must start over again. Therefore new goals for the Business Strategy must be defined. At this stage we have to take business decisions. A plan for the next cycle must be drawn. The goal of this business strategy redefinition is to redefine critical success factors in order to reimplement the next measurement and improvement iteration. The Business Strategy is the main output and input of the 7-Step Continual Service Improvement process. Fig. 2 shows the whole process as well as the results we get at each step. It also shows  that the 7-Step Improvement process influences and is influenced mainly by the cloud provider’s business strategy.

Fig. 2: Results of the 7-Step Continual Service Improvement Process.

Fig. 2: Results of the 7-Step Continual Service Improvement Process.

Irena Trajkovska

trajIrena joined the ICCLab in February 2015 as a researcher. She is working on the Software Defined Networking initiative with a particular focus on architectures and use-case solutions for cloud-based datacenters. As a visiting student at the lab in 2013 Irena was working on SDN QoS applications inside the FI-PPP project KIARA. Currently she is involved in the EU projects T-Nova, Sesame and FIWARE as part of the FI-PPP program developing SDN tools and libraries that facilitate the creation of value-added networking applications such as: optimized tenant isolation in datacenter networks, resilience (on-demand configuration of physical OpenFlow switches), Service Function Chaining, etc.

Irena received her bachelor’s degree in Computer Engineering, Informatics and Automatic Control at the Faculty of Electrical Engineering and Information Technologies in­ Skopje, Macedonia. She did her Master Thesis in Networks Engineering and Telematic Services at the ETSI de Telecomunicación faculty at the Universidad Politécnica de Madrid where she is currently enrolled in a PhD research program. Her PhD research involves evaluation of architectures and protocols for multimedia streaming based on hybrid — cloud computing and P2P solutions; economic aspect and incentives of P2P-cloud streaming services and QoS adaptive streaming solutions using software defined networking.

Contact: traj /at/ zhaw.ch

ICCLab hosting the FI-PPP Architecture Board Meeting

With Phase II getting into full force it is a particular honor for us to host the first Phase II face-to-face Architecture Board meeting of the Future Internet Public-Private Partnership.

Important topics were addressed, like the previous SB meeting, the status of the current FI-PPP projects, first architecture discussions, status of FI-PPP capacities, and many more topics related to technical management and coordination.

The FI-PPP is a pioneering endeavor and brings together very different communities and respective stakeholders. A trustworthy partnership is key for the success of such a research program. It is thus the even more delighting to see the very open and constructive spirit that drove the Phase I Architecture Board to be continued seamlessly into Phase II with all the new partners. The meeting provides further support for the well-functioning operation of the FI-PPP AB. Great thank to all the participants.

20130710_124126
20130710_124153

 

 

Automated OpenStack High Availability installation now available

The ICCLab developed a new High Availability solution for OpenStack which relies on DRBD and Pacemaker. OpenStack services are installed on top of a redundant 2 node MySQL database. The 2 node MySQL database stores its data tables on a DRBD device which is distributed on the 2 nodes. OpenStack can be reached via a virtual IP address. This makes the user feel that he is dealing with only one OpenStack node. All OpenStack services are monitored by the Pacemaker tool. When a service fails, Pacemaker will restart it on either node.

Fig. 1: Architecture of OpenStack HA.

Fig. 1: Architecture of OpenStack HA.

The 2 node OpenStack solution can be installed automatically using Vagrant and Puppet. The automated OpenStack HA installation is available on a Github repository.