Author: benn (page 3 of 4)

How to apply the 7-Step Continual Service Improvement Process in a Cloud Computing Environment

How good was my last configuration change? The following article shows how to implement the “7-Step Continual Service Improvement”-Process for a cloud computing environment.

Why is Continual Service Improvement important?

Delivering an IT service (like e. g. a cloud) is not a project. A project is something unusual and has a clearly defined beginning and an end. Running and Operating an IT service is a continuous task. Service consumers expect the IT service to be available whenever they need it. An IT service is supposed to be used regularly for regular business tasks and for an indefinite time frame. Even if IT services have a limited lifetime, they are expected to run as long as it is possible to maintain them.

No IT service is operating without errors. Bad user behaviour or misconfiguration can cause operating failures. Therefore IT services must be maintained regularly.  Because IT services are used continuously, such improvement and maintenance tasks must be performed repeatedly. While the usual operation of the IT service is expected to be continuous (or at least very close to it), service interruptions occur unexpectedly and maintenance tasks are performed step-by-step. Therefore service improvement is called to be “continual” rather than “continuous”.

The “7-Step Continual Service Improvement” process is a best practice  for improving IT services. It follows the steps outlined in (Fig. 1). If we want to establish the 7-Step process for a cloud service we must describe how each of the steps can be applied to a cloud service. We must define what we do at each step and what outcomes we expect from each step. These definitions will be described in the 7 following sections.

Fig. 1: 7-Step Continual Service Improvement Process.

Fig. 1: 7-Step Continual Service Improvement Process.

Step 1: What should be measured?

In order to compare your system configuration before and after the change, you must measure the configuration. But what should be measured? A good approach is to use “critical success factors” and deduce “key performance indicators” from them. Critical success factors are important goals for the organization running the system. Should the cloud provider company be seen as a very reliable provider? Then reliability is a critical success factor. Key performance indicators are aspects of the observed system that indicate success of the organization which operates the system. They can be deduced directly from the critical success factors. If e. g. the provider must be reliable, the cloud system should be highly available.

As a first step in the process you should create a list of key performance indicators: they are the important aspects you want to measure in your cloud. Such aspects could be:

  • Availability
  • Performance
  • Capacity

At this stage you should be not too specific on the metrics you want to use, because otherwise you would start to confuse key performance indicators and performance metrics. While key performance indicators tell you what should be measured, performance metrics define how it can be measured. Do not confuse the what and how of measurement.

You should only state general performance indicators you want to know more about like e. g. that you want your cloud operating system to be highly available or that you want that service consumers work with high performing instances. The result is always a list of (positive) aspects you want to have in your system.

For this example we say that the cloud operating system should be:

  • Highly available: we want low downtimes and small outage impacts.
  • High performing: we want a fast responding cloud operating system.
  • Highly receptive: we want that the cloud operating system has enough free disk space to mange virtual machines, disk images, files etc.

Step 2: What can be measured?

Once you have a list of aspects, you should consider how they can be measured. You should define performance metrics for the key performance indicators. Availability could e. g. be measured indirectly by measuring downtime during a time period. Performance can be measured by performing some queries and measuring the response time. As we can see, not all performance indicators can be measured directly. We must construct metrics for every indicator in order to measure a system.

In this example we define the following metrics:

  • Availability: We regularly poll the cloud operating system. If an outage occurs, we measure the downtime and the impact of the outage.
  • Performance: A test query (like e.g. upload some data to an instance) should be sent regularly to the cloud operating system. The response time of the query should be measured.
  • Capacity: The disk utilization of cloud operating system nodes can be measured regularly.

Step 3: Gather data

Once we have defined the performance metrics, we should think about how we can collect data to assign values to the metrics. This step is about developing tools for service monitoring. We should think about how we can measure things like downtime, outages, response time etc. Two techniques for data gathering are very important:

  • Polling of IT services: Data is gathered by regularly polling the IT service and checking for occurrence of some events (like e. g. server is not available). Polling mechanisms must run periodically during a given time frame.
  • Direct Measurement: Data is gathered directly by checking some system configuration values. The check runs only once at a given point in time.

An important aspect is choosing the time when we measure something and the frequency of measurements. Should we measure something once per day or should we rather measure something per hour or even per minute? And once we have chosen our frequency we must define the time frame on which measurements should take place.

In this example we gather data for three months and we measure everything according to the following frequencies:

  • Impact and Downtime: We could poll every 100 seconds if an outage occurred. If an outage is detected, the impact can be measured directly as a predefined value that follows our dependability model.
  • Response Time: Every hour we could start a script which runs some test queries. Thereby we measure the time for completion of the query. The response time value is then stored as data.
  • Disk utilization: This metric need not to be polled very often. It can be measured daily by using a direct measurement technique. We just check the available used disk space and divide it by the total space of the available disks.

By using the data gathering techniques described above, we collect values for impact, downtime, response time and disk utilization.

Step 4: Process the data

The collected data must be processed now in order to be analysed.In this step we must think about aggregation functions and how to aggregate in order to be able to make meaningful statements about the gathered data.

When we collect data for three months, we can’t do anything useful with it if we do not aggregate the data somehow. We can either sum up the collected data or calculate an average. The aggregation function depends on the scale of the data we collected: if we collected e. g. only categorical data, we can only count occurrence of values. If the data can be brought into some meaningful order, we can sum up values. If the data is at least interval-scaled we can calculate averages. Other important aggregation functions are the maximum and minimum of a value.

For this example we chose the following aggregation functions:

  • Total Impact and total downtime: Every time we detect an outage, the impact and the suffered downtime is recorded. In order to aggregate the data, we sum up all downtime suffered and impacts of outages.
  • Average response time: We poll the response time of the cloud regularly, but in order to get an aggregated result we should calculate the average of the response time.
  • Maximum disk utilization:  It is better to measure the maximum disk utilization instead of the average utilization since we must find out if a critical threshold is reached. If a disk is full, additional data can not be saved. Therefore the peak disk utilization is the value we want to monitor.

In order to make the data analysable we must also think of dispersion of data. Aggregate functions are very sensitive to extreme values and outliers. Rare outliers can distort average values. If we have e. g. a lot of short response times and then suddenly an extremely large value (e. g. a CPU-intensive batch procedure), the average will get a large value which is not very representative to the actual measurements. Dispersion functions like e. g. variance and standard deviation are functions to measure how far the data is away from an average value. They are quite useful when we want to know more about the meaningfulness of an average value, Therefore we must also define the dispersion functions we want to measure.

For this example we chose the following dispersion measurements:

  • Range of impact and downtime: Since impact and downtime are measured in terms of frequencies (impact and downtime increases when an outage occurs), we must choose the range (difference between minimum downtime and maximum downtime) as the dispersion measure. By gathering this data we can find out e. g. if we have rather very many small outages or very few large outages.
  • Standard deviation of response time: Since the response time is measured as a continuous number, we chose the standard deviation as our dispersion measurement function.
  • Standard deviation of disk utilization: Disk utilization will grow continually over time, but sometimes disk utilization is reduced due to maintenance work and other activities. Disk utilization growth is not linear and continuous. Therefore we should measure changes of disk utilization and take the standard deviation as our dispersion function.

Step 5: Analyse the data

The data we gathered can be seen as a statistical sample that with average values, standard deviations etc. In order to make something useful with the data, we must define tests we could apply on the collected statistical samples.
It is obvious that we should not just analyse data somehow. All data analysis should be performed according to the requirements that we define for our IT infrastructure. These requirements are commonly called “service levels”. Typically we want our infrastructure to achieve some critical values in terms of perfomance, capacity and availability. Another important aspect of analysis is measurement of the impact of changes. I

In this step we want to find out if we achieved the values we wanted to achieve. This is done by testing the aggregated data. The most common methods to check data are statistical methods. Statistical methods can be descriptive or inferential. Descriptive statistics reveal characteristics of the data distribution in a statistical sample. In descriptive statistics we check where the average of data is situated and how the data is distributed around the average. In inferential statistics we compare different samples to each other and we induce the value distribution of the population from the value distribution of samples.

Descriptive statistics are needed to check if the required service level has been achieved or not. Inferential statistics are useful to check if the achievement derived accidentially or if it was the result of maintenance work and configuration changes.

For the example of our cloud operating system the following descriptive analytic methods are proposed:

  • Check Availability Level: The availability can be calculated by subtracting the total downtime from the timeframe of the measurement period, multiplying the result with 100 and then dividing the result through the timeframe of the measurement period. The result is a percentage value which should be above the availability level required by service consumers. In order to check how availability is distributed one should check
  • Check Outage Impact: The total impact of outages should be below a certain level. One can also calculate the mean impact size and variance of impacts in order to see if we have many small outages or few severe outages.
  • Check Average Response Time: In order to check the response time one should calculate the average of the average response time as well as the variance of the response time.
  • Check Maximum Disk Utilization: Maximum Disk Utilization should be checked if it is above some critical value. In order to see if the disk utilization grows rapidly or slowly, one should also check the average of the maximum disk utilizations as well as the variance of the maximum disk utilization.

Descriptive analytics only reveal if you were keeping the required service level during the observed timeframe. In order to see if this result was achieved by chance or not, further statistical tests are needed. The following steps must be performed too:

  • Test distribution of values: As a first step you should check the distribution of the values like outage time, impact, response time and disk utilization. If the values follow a normal distribution, you should choose other statistical tests than you would take if they were not normally distributed. Tests for distribution of values are the “Anderson-Darling-Test“, the “Cramer-von Mises-Test” and the “Kolmogorov-Smirnov-Test”. In order to use these tests you must use the data you gathered in step 3.
  • Check if average differs from critical value: In inferential statistics we want to know if the measured average value was achieved as a result of our efforts to maintain the IT infrastructure or if it is a random value generated by accident. For this reason we compare the average value to either the value we expected from previously defined service levels or to the average value of another sample which is a data set we gathered from previous iterations of the 7-Step process. If it is your first iteration you can only make comparisons between the actual data set and the service level. Otherwise you are able to compare data of your previous iteration with data of the current iteration. The goal of this analysis is to see if there is a significant difference between the average and a critical value (either the average value required in the service level agreement or the average value of the previous sample). “Significant” means that if we assume that the difference is not equal to zero there is only a small error probability α (usually below a previously chosen significance threshold of 5 percent) that the difference is in fact equal to zero. There are quite many statistical tests to prove if  differences between average values are significant. Once we know the distribution of values, we are able to test if the difference between the measured average value and the critical value is significant. If sampled values are not normally distributed, you should choose a non-parametric test like e. g. the “Wilcoxon-Signed-Rank-Test” to test the significance of the difference. If the samples follow a normal distribution, you should rather choose a parametric test like the “Student’s t-Test“. Parametric tests are generally more powerful than non-parametric tests, but they rely on the assumption that values follow a particular distribution. Therefore they are not always applicable. The interpretation of such a test is quite straightforward: when we measure a negative difference between measurement and critical value,  we have to take corrective actions.  If the difference between the measured average and the critical value is positive and significant, the difference can be considered as a result of our efforts. Otherwise it could mean that we achieved the better-than-required value only by chance. In that case we should think about corrective actions too, because the “better” value does not necessarily mean that the infrastructure is well-maintained.
  • Check if variance differs from critical value: Generally speaking lower variance is preferable to high variance in a cloud computing environment. Low variance means that you have rather few extreme values and therefore your cloud computing environment is more scalable. Though variance should be kept low, it is not always possible to really do so. If your cloud environment is e. g. used to generate shopping websites, you have unavoidably varying traffic which makes response time, disk utilization and even availability varying too. But even in such cases it is always better to know variance well than not knowing anything at all. Knowledge about increasing variance makes you aware of imminent performance problems or other risks. For this reason variance should be compared to previously collected data of former iterations of the 7-Step process. As it is the case when you check average values, you might want to know the difference between variances and you might want to know if this difference occurred by accident or if it has another cause. Another interesting thing is to know if variance of different samples did not change over time (their difference is equal to 0). The attribute of different samples to have homogeneous variances is called “homoscedasticity”. There are also a quite a few statistical tests to prove if two samples have equal variances. It depends on the distribution of the sample data which test should be taken preferably. If your sample follows a normal distribution, you should take an “F-Test” . If data is distributed non-normally, you should take more robust tests like “Levene’s Test” or the “Brown-Forsythe-Test“. These tests decide if difference in variances is significant. Interpretation of a test result is that if there is a significant difference, then you should consider why it occurred. You should also prepare some corrective actions.

Even if you have found significant changes in variance or average values, you are still not done. You have to explain why the change occured. Statistical results must be interpreted to become more than just data.

Step 6: Present and use the information

According to the ITIL V3 standard, information is nothing else than data which is interpreted in some meaningful way. Because we add meaning to the data, we transform it into information. The big advantage of information is that it can be used to take corrective actions and to change something in an IT service.

Let’s say that we have performed a data analysis on our cloud services and we have found that our average response time is significantly lower than what we expected it to be.  In that case we have gathered some information out of the data. With that information we can now decide to enhance the average response time somehow.

At this stage it is important to interpret the information correctly. There are two ways to interpret information:

  • Reasoning: In order to interpret results of your statistical tests, you should be able to identify what implications they have for your IT service. If you e. g. discover a significant increase in disk utilization, the logical implication is that you should either try to limit disk utilization or add more storage to your system. Reasoning can not be fully automated since we must have some common sense knowledge about the IT services we use. There are some approaches to assist people in reasoning tasks though. You could e. g. use  so-called “expert systems” to find good logical implications from your analysed data. An “expert system” is a software which must be fed with data and uses formal logic to calculate logical implications. These systems could be used as tools to support you in taking decisions about your IT architecture and other aspects of your IT service.
  • Root Cause Analysis: Sometimes the cloud provider might discover a problem by analysing the data. The response time of the system could decrease significantly or there could be a growth trend in the occurence of outages. As soon as such a problem is discovered one should identify the underlying cause for the problem. This procedure is called “root cause analysis”. In root cause analysis you repeatedly ask yourself what caused the problem and then what caused the cause of the problem. The recursion involved in root cause analysis has a limited depth. Root cause analysis is also a process which can be performed only manually.

By reasoning about the results of your statistical tests you create valuable information. This information serves as a base to take corrective actions.

Step 7: Implement corrective action

The last step in the 7-Step Continual Service Improvement process is to take corrective actions. The corrective actions depend on the information generated in step 6. Since the 7-Step process is a continual improvement process, the last step has also the goal to close the improvement cycle.  This is done by aligning the IT improvements to the business strategy.

The first part of this step is to immediately correct errors in the IT environment. This could be programming errors, bad configuration of hardware or software, bad network conceptions or even bad architectural decisions. It is important to document what actions we take to correct the problems and what changes are performed in the IT infrastructure. It is also important to reflect results of the performed actions in the documentation. If we face problems in the implementation of changes, we should document it. Otherwise it could be that our “corrective” actions destabilize our cloud service and we do not know why everything is getting worse. Therefore we should create a report containing corrective actions and results of the performed actions.

The second part of this step is to close the process cycle. This is done by reporting the results of the 7-Step process cycle to business decision makers – usually the managers of the cloud provider. In order to restart the process it must also be defined when and how to start the next cycle. This task is also something which must be coordinated with decision makers. For practical reasons a plan for the next cycle must be created and approved.

Continual Repetition of the Improvement Cycle

Once an Improvement Process cycle is finished, it must start over again. Therefore new goals for the Business Strategy must be defined. At this stage we have to take business decisions. A plan for the next cycle must be drawn. The goal of this business strategy redefinition is to redefine critical success factors in order to reimplement the next measurement and improvement iteration. The Business Strategy is the main output and input of the 7-Step Continual Service Improvement process. Fig. 2 shows the whole process as well as the results we get at each step. It also shows  that the 7-Step Improvement process influences and is influenced mainly by the cloud provider’s business strategy.

Fig. 2: Results of the 7-Step Continual Service Improvement Process.

Fig. 2: Results of the 7-Step Continual Service Improvement Process.

Automated OpenStack High Availability installation now available

The ICCLab developed a new High Availability solution for OpenStack which relies on DRBD and Pacemaker. OpenStack services are installed on top of a redundant 2 node MySQL database. The 2 node MySQL database stores its data tables on a DRBD device which is distributed on the 2 nodes. OpenStack can be reached via a virtual IP address. This makes the user feel that he is dealing with only one OpenStack node. All OpenStack services are monitored by the Pacemaker tool. When a service fails, Pacemaker will restart it on either node.

Fig. 1: Architecture of OpenStack HA.

Fig. 1: Architecture of OpenStack HA.

The 2 node OpenStack solution can be installed automatically using Vagrant and Puppet. The automated OpenStack HA installation is available on a Github repository.

Automated Vagrant installation of MySQL HA using DRBD, Corosync and Pacemaker

Fig. 1: Redundant MySQL Server nodes using Pacemaker, Corosync and DRBD.

Fig. 1: Redundant MySQL Server nodes using Pacemaker, Corosync and DRBD.

If automation is required, Vagrant and Puppet seem to be the most adequate tools to implement it. What about automatic installation of High Availability database servers? As part of  our Cloud Dependability efforts, the ICCLab works on automatic installation of High Availability systems. One such HA system is a MySQL Server – combined with DRBD, Corosync and Pacemaker.

In this system the server-logic of the MySQL Server runs locally on different virtual machine nodes, while all database files are stored on a clustered DRBD-device which is distributed on all the nodes. The DRBD resource is used by Corosync which acts as resource layer for Pacemaker. If one of the nodes fails, Pacemaker automagically restarts the MySQL server on another node and synchronizes the data on the DRBD device. This combined DRBD and Pacemaker approach is best practice in the IT industry.

At ICCLab we have developed an automatic installation script which creates 2 virtual machines and configures MySQL, DRBD, Corosync and Pacemaker on both machines. The automated installation script can be downloaded from Github.

Specification of data to be collected in Dependability Modeling

In part 3 of our article series “Dependability Modeling on OpenStack” we have discussed that we should run Chaos Monkey tests on an OpenStack HA installation and then collect data about the impact of the attack. While we did say that we want to collect data about the implemented OpenStack HA architecture, we were not specific about which data we should actually collect. This article gives some hints what is important when collecting data about HA system architectures.

What should be measured?

A very interesting question is what should be measured during a Chaos Monkey test run. The Dependability Modeling Framework is used to measure the capability of a system architecture to deliver “low” impacts of system outages. Therefore we should measure the impact of outages. The impact is a score which is derived from the Dependability graph. It should be measured as a result of a test run.

What is analysed in Dependability Modeling?

In Dependability Modeling we are interested in correlations between the system architecture and the outage impact. The system architecture data is mainly categorical data (replication technology used, clustering technology etc.) and the impact is a number. All variables that describe the system architecture are meant to be “explanatory” or “independent” variables, i. e. variables that can be chosen freely in the simulation, while the impact of outages is the “explained” (or “dependent”) variable, because the impact is assumed to be the result of the chosen architecture. In order to find significant correlations between system architecture properties and impact, we must collect values for all explanatory variables and then use a dimensionality reduction method to find which properties are interesting.

How much data should be collected?

First we must say that it is not a bad practice to collect “too much” data in a test or a scientific experiment. In classical statistics it is usually said that we should use small samples. The reason why this is said is because the science of classical statistics was developed in the 19th century – a time where measurements were expensive and statements on data sets had to be derived from small sample sets. Nowadays we can collect data automatically, therefore we are not forced to use small sample sets. We can simulate the whole life cycle of a cloud service, e. g. we could say that an OpenStack service will run for about 8 years which is 8 x 365 = 2’920 days and take one Chaos Monkey test for each day. The advantage of the automation is that we do not need to rely on samples.
Of course there is a limitation in terms of computational power: a Chaos Monkey test takes about 0.5-1.5 seconds. If we run 2920 Chaos Monkey tests, the whole simulation run can take up to > 4’300 seconds, which is more than 1 hour. Therefore you either run a simulation as an overnight batch job or you must choose to limit the simulation to a sample size which should adequately represent the overall population. To determine the optimal sample size you could use variance estimation. The sample size can be obtained using the statistical formula for calculation of sample sizes.

With that specification, we can proceed in developing our test framework. A further article will show a sample data set.

 

 

Future trends and technologies in Mobile and Internet communications @ CFIC 2013

Future trends of Mobile and Internet Communications are revealed at the Conference on Future Internet Communications 2013 in Coimbra, Portugal. The many different speeches and talks show that Cloud Computing could play a major role in future Mobile Communication networks.

Alexander Sayenko explains future trends in 3GPP standardization.

Alexander Sayenko explains future trends in 3GPP standardization.

The first keynote speech of the Conference was held by Alexander Sayenko, researcher at Nokia Siemens Networks, where he is responsible for standardization activities of the 3GPP-specification. In his keynote speech he presented the new HetNet multicarrier solution for the enhancement of current mobile communication traffic. While mobile communication traffic is expected to grow exponentially for the next decade, very diverse requirements concerning reliability and mobility of IT services offer a major challenge to the telecommunication industry. In order to be able to handle the growing mobile traffic, widening the available radio spectrum breadth and enhancing spectral efficiency as well as offloading of communication data to clusters of mobile base stations should enhance capacity of current mobile networks. HetNet offers a solution to enhance the radio spectrum and use the radio spectrum more efficiently by meshing up multiple heterogenous access networks. The future trend in mobile communication is going towards managing heterogeneous network infrastructures since the new standards like LTE and HSPA+ are still not used broadly and will not likely replace older technologies as fast as expected by the mobile end users. While the number of mobile devices and applications grows rapidly, changes in the infrastructure of mobile communication providers are performed much more slowly. New standards in mobile communications are a necessity in order to avoid a situation where the network infrastructure becomes a bottleneck to the mobile communication market.

Bottleneck: low efficiency of current access networks

The message is clear: mobile networks should be used more efficiently. An efficiency gain could be provided by the use of Cloud Computing for mobile networks. Andreas Kassler, researcher at Karlstads University in Sweden, showed CloudMAC – a new approach on how to allow location-independent routing of mobile devices in Wireless networks without introducing additional routing protocol overhead like e. g. in the Mobile IP protocol. The solution is to source the routing logic from Wireless Termination Endpoints into a virtualized infrastructure like e. g. an OpenStack cloud. Such an approach shows that Cloud Computing could become very important for the development of more efficient mobile networks. Therefore projects like e. g. the Mobile Cloud Network at ICCLab can make mobile communication ready for the challenges of the next decade.

ICCLab: enhance Quality of Cloud Services

The ICCLab had also a chance to present the benefits of Cloud Services for future Internet communications. Konstantin Benz, researcher at ICCLab, presented different technologies for OpenStack which should enable High Availability. He also showed how the Chaos Monkey tool could be transformed in a test framework which can add HA readiness of OpenStack architectures. The ongoing research about Cloud Automation, Cloud Dependability, Cloud Interoperability, Cloud Monitoring and Cloud Performance at ICCLab improves the overall quality of Cloud Computing as a service. Therefore Cloud Computing offered by ICCLab is able

OpenStack Grizzly installation for the lazy

As kindof advertisement for the new OpenStack Grizzly release we have created an automated single-node OpenStack Grizzly installation which uses Vagrant and Puppet. The automated installation can be downloaded from Github using the following URL: https://github.com/kobe6661/vagrant_grizzly_install.git

Please feel free to install it on your machine and test the new release.

Dependability Modeling on OpenStack: Part 3

In this part of the Dependability Modeling article series we explain how a test framework on an OpenStack architecture can be established. The test procedure has 4 steps: in a first step, we implement the OpenStack environment following the planned system architecture. In the second step we calculate the probabilities of component outages during a given timeframe (e. g. 1 year). Then we start a Chaos Monkey script which “attacks” (randomly disables) the components of the system environment using the calculated probabilities as a base for the attack. As a last step we measure the impact of the Chaos Monkey attack according to the table of failure impact sizes we created in part 2. The impact of the attack should be stored as dataset in a database. Steps 1-4 form one test run. Multiple test runs can be performed on multiple architectures to create a empirical data which allows us to rate the different OpenStack architectures according to their availability.

 Step 1: Implement system architecture

Implementation of an OpenStack architecture can be achieved quite straightforward by using the Vagrant-Devstack installation. Each OpenStack node can be set up as Vagrant-Devstack system. First install Virtualbox, then install Vagrant and then install Vagrant-Devstack. Configure Devstack to support a Multi-node environment. As a next step you should create an SSH Tunnel between the different nodes using Vagrant. Once the different VM nodes are ready, you can start to test the architecture. (Fig.1) includes a typical OpenStack architecture for a single OpenStack node.

Fig. 1: Typical OS architecture for a single OpenStack node.

Fig. 1: Typical OS architecture for a single OpenStack node.

High availability is usually only possible in a multi-node environment, because redundant nodes are needed in case of node failures and consequent failovers. Therefore your architecture must be an architecture which is distributed or clustered over several redundant nodes. An example of such an architecture is shown in (Fig. 2). Once the architecture is defined, you have to implement it by using Vagrant, Puppet and Devstack.

Fig. 2: Sample 2-node architecture using DRBD, Corosync and Pacemaker.

Fig. 2: Sample 2-node architecture using DRBD, Corosync and Pacemaker.

Step 2: Calculate outage probability

Availability is usually measured during a given time period (e. g. one year). It is the fraction of uptime divided by total time. If we want to calculate the risk/probability of outages in the observed period, we must know at least two values: the total downtime of a component (which can be evaluated when the availability is known)  and the average recovery time. Both values are parameters which are needed to estimate the number of outages in the observed time period. In (Tab. 1) we have a list of all OpenStack components which are present in one node of the OpenStack installation. Availability is observed for a time period of one year (= 31’535’000 seconds). If we assign each component an availability value and an average recovery time, we can calculate the downtime and the number of outages per year. Because we are interested in the outage risk, we calculate the risk by dividing the number of total outages by the number of days per year. The calculated outage risks can be used now to simulate a typical operational day of the observed OpenStack system.

Tab. 1: Outage risk estimation of OpenStack components.

Tab. 1: Outage risk estimation of OpenStack components.

Step 3: Run Chaos Monkey attack

Although Chaos Monkey disables devices randomly, a realistic test assumes that outages do not occur completely randomly. A Chaos Monkey attack should be executed only with probability – not with certainty. Therefore we must create a script which disables the OpenStack services with probabilities we defined in (Tab. 1). Such a script could be written in Python – as shown in (Fig. 2). The most important part of the shutdown mechanism is that probabilities should be assignable to the services we want to disable. The probabilities will be taken from the values we have calculated in (Tab. 1). The other part should be that execution of Chaos Monkey attacks follows a random procedure. This can be achieved by using a simple random number generator which generates a number between 0 and 1. If the random number is smaller than the probability, the Chaos Monkey attack is execeuted (otherwise nothing is performed). This way we can simulate random occurence of outages as if it would be the case in a real OpenStack installation that runs in operational mode.

Fig. 3: Excerpt of a Python script which serves to shutdown OpenStack services.

Fig. 3: Excerpt of a Python script which serves to shutdown OpenStack services.

Step 4: Poll impact of failure

Once the Chaos Monkey attack has been performed, one has to check the impact size of the outage. Failure impact size equals the values in the table of failure impact sizes (Tab. 2). The table of failure impact sizes is derived from the execution of Dependability Modeling (as explained in article 2 of this series). The task at hand is now to poll which user interactions are still available after the Chaos Monkey attack. This can be done by performing the use cases which are affected by an outage of a component. The test tool must be a script which programmatically runs the use cases as tests. If a test fails, the failure impact size is raised according of the weight of the use case. The result of such a test run is a failure impact size after the Chaos Monkey attack.

Tab. 2: Failure impact sizes and use cases affected by component failure.

Tab. 2: Failure impact sizes and use cases affected by component failure.

Cleanup and re-run the test

Test results should be stored in a database. This database should contain failure impact sizes, assumed availabilities and average recovery times as well as information about the system architecture that has been used. When a test run has been completed, the results of the Chaos Monkey attacks have to be reverted in order to be able to re-run the test. With each test-run the database is filled up and one can be more certain about the test results.

Further test runs can be started either with the same architectural setup or with another one: instead of a one-node installation one could use a two-node OpenStack installation, one could use Ceph and Pacemaker as HA clustering software and try different technologies. If we perform steps 1-4 repeatedly, we can rate different OpenStack architectures according to their resistance against outages and find out which architecture fits best to High Availability goals.

If the test framework is applied to an OpenStack environment like e. g. Mobile Cloud Network, High Availability characteristics can be ensured more confidently. Dependability modeling is a useful recipe to test OpenStack architectures from an end users’ perspective. The capabilities of the explained method have not been explored in detail yet, but more will follow soon.

 

DRBD-Test environment for Vagrant available

There is always room to test different HA technologies in a simulated VM environment. At ICCLab we have created such a DRBD test environment for PostgreSQL databases. This environment is now available on Github.

The test environment installation uses Vagrant as tool to install VMs, Virtualbox as VM runtime environment and Puppet as VM configurator. It includes a Vagrant installation script (usually called a “Vagrantfile”) which sets up two virtual machines which run a clustered highly available PostgreSQL database.

In order to use the environment, you have to download it and then run the Vagrant installation script. The Vagrant installation script of the test environment essentially does the following things:

  • It creates two virtual machines with 1 GB RAM, one 80 GB harddrive and an extra 5 GB harddrive (which is used as DRBD device).
  • It creates an SSH tunnel between the two VM nodes which is used for DRBD synchronization.
  • It installs, configures and runs the DRBD device on both machines.
  • It installs, configures and runs Corosync and Pacemaker on both machines.
  • It creates a distributed PostgreSQL  database which runs on the DRBD device and which is managed by the Corosync/Pacemaker software.

This environment can easily be installed and then be used for testing of the DRBD technology. It can be downloaded from the following Github repository:

https://github.com/kobe6661/dependability_test_fw.git

Installation instructions can be found here.

Dependability Modeling on OpenStack: Part 2

In the previous article we defined use cases for an OpenStack implementation according to the usage scenario in which the OpenStack environment is deployed. In this part of the Dependability Modeling article series we will show how these use cases relate to functions and services provided by the OpenStack environment and create a set of dependabilities between use cases, functions, services and system components. From this set we will draw the dependency graph and make the impact of component outages computable.

Construct dependency table

The dependency graph can be constructed if we define which functions, services and components allow provision of a use case. In the example below (Fig. 1) we defined the system architecture components, services and functions which allow to create, delete or update details of a Telco Account (account of mobile end user). Since these operations are provided within virtual machines, VM User Management and VM Security Management functions provide availability of this use case. Therefore we draw a column which contains these functions. Because these functions need a User Management, SSH & Password Management service in each VM in order to operate, we draw a second column which contains the required services. Another column is constructed which tells the system components required in order to deliver the required services.

Fig. 1: Dependency Graph Construction.

Fig. 1: Dependency Graph Construction.

The procedure mentioned above is repeated for all use cases. As a result you get a table like the one in (Tab. 1). This dependency table is the starting point for the production of the dependency graph.

Tab. 1: Dependencies between Use Cases, Services, Functions and Components.

Tab. 1: Dependencies between Use Cases, Services, Functions and Components.

Construct dependency graph

For each component that is listed in the table you have to model the corresponding services, functions and use cases. This is performed like in the example in (Fig. 2). We start from the right of the graph with the Ceilometer component and the VM plugin and look which services are provided by those components: it is e. g. the “Ceilometer Monitoring” service. Therefore we draw an icon that represents this service and draw arrows from the Ceilometer and VM plugin components to the service icon (1). In the next step we look which function is provided by the Ceilometer Monitoring service. This is the “Monitoring of VM” function. Therefore we paste an icon for the function and draw an arrow to this function (2). Then we look for the use cases provided by the Monitoring of VM function. Since this is e. g. “Measure SLAs”, we paste an icon for this use case and draw another arrow to “Measure SLAs” (3). The first path between an use case and components on which it depends is drawn. This procedure is repeated on all components in (Tab. 1).

Fig. 2: Dependency Graph Construction from Dependency Table.

Fig. 2: Dependency Graph Construction from Dependency Table.

The result is the dependency graph shown below (Fig. 3).

Fig. 3: Dependency Graph of OpenStack Environment.

Fig. 3: Dependency Graph of OpenStack Environment.

Add weight factors to use cases

Once the dependency graph is constructed, we can calculate the “impact” of component outages. When a component fails, you can simply follow the arrows in the dependency graph to see which user interactions (use cases) stop to be available for end users. If e. g. the Ceilometer component fails, you would not be able to measure SLAs, meter usage of Telco services or monitor the VM infrastructure.

But it would not be a very sophisticated practice to say that each use case is equally important to the end user. Some user interactions like e. g. creation of new VM nodes need not be available all the time (or at least it depends on the OLAs of the Telco). Other actions like e. g. Telco authentication must be available all the time. Therefore, we have to add weight factors to use cases. This can be done by adding another column to the dependency table and name it “Weight factor”. The weight factor should be a score measuring the “importance” of an user interaction in terms of business need. In a productive OpenStack environment, financial values (which correspond to the business value of the user interaction) could be assigned as weight factors to each use case. For reasons of simplicity we take the ordinal values 1, 2 and 3 as weight factors (whereby 1 signifies the least important user transaction and 3 the most important user transaction). For each use case row in the dependency table we add the corresponding weight factor (Fig. 4).

Fig. 4: Assignment of weight factors.

Fig. 4: Assignment of weight factors.

As a next step, we create a pivot table containing the components and use cases as consecutive row fields and the weight factors as data field. In order to avoid duplicate counts (of use cases) we use the maximum function instead of the sum function. As a result we get the pivot table in (Tab. 2).

Tab. 2: Pivot Table of Component/Use Case dependencies.

Tab. 2: Pivot Table of Component/Use Case dependencies.

Calculate outage impacts

Calculation of system component outages is now quite straightforward. Just look at the pivot table and calculate the pivot sum of the weight factors of each component. As a result we have a table of failure impact sizes (Tab.3).

Tab. 3: OpenStack Components and Failure Impact Sizes.

Tab. 3: OpenStack Components and Failure Impact Sizes.

This table reveals which components are very important for the overall reliability of the OpenStack environment and which are not. It is an operationalization of the measurement of “failure impact” for a given IT environment (failure impacts can be measured as number). The advantage of this approach is that we can build a test framework for OpenStack availability based on the failure impact sizes.

Most obviously components whith strong support functionality like e. g. MySQL or the Keystone component have high failure impact sizes and should be strongly protected against outages. VM internal components seem to be not so important because VMs can be easily cloned and recovered in a cloud environment.

In a further article we will show how availability can be tested with the given failure impact size values on a given OpenStack architecture.

 

Dependability Modeling on OpenStack: Part 1

Dependability Modeling is carried out in 4 steps: model the user intercations, model the system functions, model the system services and then model the system components which make system services available. In the first part we will define which interactions could be expected from end users of the OpenStack cloud platform and construct the first part of the dependability graph. Once the dependapility model is constructed, a Dependability Analysis will be performed and several OpenStack HA architectures will be rated according to their outage risk.

Before we can define use cases for an OpenStack HA environment, we must first think about its Deployment Model. According to the Use Cases Whitepaper of the Open Cloud Manifesto, every cloud has its own use case scenario which depends on its “Cloud Deployment Model”. A Cloud Deployment Model is a method which describes the way how the cloud is deployed in an organizational context. The US National Institute of Standards and Technology (NIST) has published a definition paper which describes essential characteristics of cloud computing as well as possible types of Service and Deployment Models for cloud environments. According to the NIST definition of Cloud Computing, there are four types of Cloud Deployment Models:

  • Private Cloud: The cloud infrastructure is operated for one single organization inside that organization’s firewall. All data and processes are managed within the organization and are therefore not exposed to security issues, network bandwidth limitations or legal restrictions (in contrast to a Public Cloud).
  • Community Cloud: The cloud infrastructure is shared by several organizations and has the purpose to support a specific community of end users who have shared concerns. Typical Community Clouds are e. g. Googledocs, Facebook, Dropbox.
  • Public Cloud: The cloud infrastructure is made available to the general public and is owned by a cloud provider organization.
  • Hybrid Cloud: The cloud infrastructure is a composition of multiple other clouds (private, community or public) that remain unique entities but are bound together by technology that enables interoperability.

According to this definition, the MobileCloud Networking (MCN) infrastructure is rather a Hybrid Cloud. On one hand MCN is used as a Private Cloud for the Telcos to manage their infrastructure environment and handle peak loads or infrastructure-based network issues. On the other hand, the MCN is a Public Cloud for the Mobile End Users: they request communication services from the Telco sites, register and authenticate themselves and consume the communication service offered by the Telco. Mobile End Users produce the load on the Telco managed infrastructure. The MCN is deployed in an “Enterprise to Cloud to End User” scenario (Fig. 1).

Fig. 1: Enterprise to Cloud to End User

Fig. 1: Enterprise to Cloud to End User

Typically the Enterprise to Cloud to End User Scenario requires the following features:

  • Identity Management: This is performed by the authentication services provided by the Telco. Authentication services run inside the virtual machines provided by OpenStack.
  • Use of an open client: Management of the cloud should not depend on a particular platform/technology. In OpenStack this is guaranteed by using the Horizon Dashboard.
  • Federated Identity Management: Identity of Telco users should also be managed in parallel to end users. In OpenStack Telco users are managed by the Keystone component. End users are authenticated in the virtual machines provided by the Telco.
  • Location awareness: Depending on the legal restrictions in the Telco industry, data of end users must be stored on particular physical servers. Therefore the cloud service must provide awareness of the location of end users.
  • Metering and monitoring: All cloud services must be metered for chargeback and provisioning. MCN uses a provisioning facility for this task.
  • Management and Governance: It is up to the Telcos to define Governance policies for the VMs managed by OpenStack. Policies and rules can be configured via Keystone.
  • Security: The OpenStack cloud network should be secured against unauthorized access. Security is a typical Keystone task.
  • Common File Format for VMs: The infrastructure of Telco organizations might be heterogenous. For reasons of interoperability the file format of VMs used in the MCN cloud should be interchangeable. Nova is the computation component of the OpenStack framework. Nova is technology-agnostic and therefore offers VM-interoperability between many different VM-systems like e. g. KVM, Xen, Virtualbox etc.
  • Common APIs for Cloud Storage and Middleware: OpenStack offers a common API for Cloud Storage: Images are stored and managed by the Glance component. All objects managed in the cloud are stored with the Swift API. Block storage is managed by Cinder.
  • Data Application and Federation: All cloud data must be federated in order to manage the cloud infrastructure. In OpenStack cloud data is managed by a MySQL server.
  • SLAs and Benchmarks: The OpenStack environment must fulfil SLAs with the end users as well as OLAs with the Telco itself. SLAs can be metered by the MCN provisioning facility.
  • Lifecycle Management: The lifecycle of VMs must be managed also in the MCN infrastructure. Lifecycle Management is also a task of Nova component.

If we follow the list of requirements we can define use cases for the OpenStack environment of the MobileCloud Network (Tab.1). The result is a list of use cases which define the user interactions with the OpenStack cloud.

Tab. 1: Use Cases for an OpenStack environment.

Tab. 1: Use Cases for an OpenStack environment.

Modeling the user interactions is the first step in Dependability Modeling. In order to get a full Dependability Model of the OpenStack environment we must investigate the functions and services which make the user interactions available. A further post will show how this is done.

« Older posts Newer posts »