How good was my last configuration change? The following article shows how to implement the “7-Step Continual Service Improvement”-Process for a cloud computing environment.
Why is Continual Service Improvement important?
Delivering an IT service (like e. g. a cloud) is not a project. A project is something unusual and has a clearly defined beginning and an end. Running and Operating an IT service is a continuous task. Service consumers expect the IT service to be available whenever they need it. An IT service is supposed to be used regularly for regular business tasks and for an indefinite time frame. Even if IT services have a limited lifetime, they are expected to run as long as it is possible to maintain them.
No IT service is operating without errors. Bad user behaviour or misconfiguration can cause operating failures. Therefore IT services must be maintained regularly. Because IT services are used continuously, such improvement and maintenance tasks must be performed repeatedly. While the usual operation of the IT service is expected to be continuous (or at least very close to it), service interruptions occur unexpectedly and maintenance tasks are performed step-by-step. Therefore service improvement is called to be “continual” rather than “continuous”.
The “7-Step Continual Service Improvement” process is a best practice for improving IT services. It follows the steps outlined in (Fig. 1). If we want to establish the 7-Step process for a cloud service we must describe how each of the steps can be applied to a cloud service. We must define what we do at each step and what outcomes we expect from each step. These definitions will be described in the 7 following sections.
Fig. 1: 7-Step Continual Service Improvement Process.
Step 1: What should be measured?
In order to compare your system configuration before and after the change, you must measure the configuration. But what should be measured? A good approach is to use “critical success factors” and deduce “key performance indicators” from them. Critical success factors are important goals for the organization running the system. Should the cloud provider company be seen as a very reliable provider? Then reliability is a critical success factor. Key performance indicators are aspects of the observed system that indicate success of the organization which operates the system. They can be deduced directly from the critical success factors. If e. g. the provider must be reliable, the cloud system should be highly available.
As a first step in the process you should create a list of key performance indicators: they are the important aspects you want to measure in your cloud. Such aspects could be:
At this stage you should be not too specific on the metrics you want to use, because otherwise you would start to confuse key performance indicators and performance metrics. While key performance indicators tell you what should be measured, performance metrics define how it can be measured. Do not confuse the what and how of measurement.
You should only state general performance indicators you want to know more about like e. g. that you want your cloud operating system to be highly available or that you want that service consumers work with high performing instances. The result is always a list of (positive) aspects you want to have in your system.
For this example we say that the cloud operating system should be:
- Highly available: we want low downtimes and small outage impacts.
- High performing: we want a fast responding cloud operating system.
- Highly receptive: we want that the cloud operating system has enough free disk space to mange virtual machines, disk images, files etc.
Step 2: What can be measured?
Once you have a list of aspects, you should consider how they can be measured. You should define performance metrics for the key performance indicators. Availability could e. g. be measured indirectly by measuring downtime during a time period. Performance can be measured by performing some queries and measuring the response time. As we can see, not all performance indicators can be measured directly. We must construct metrics for every indicator in order to measure a system.
In this example we define the following metrics:
- Availability: We regularly poll the cloud operating system. If an outage occurs, we measure the downtime and the impact of the outage.
- Performance: A test query (like e.g. upload some data to an instance) should be sent regularly to the cloud operating system. The response time of the query should be measured.
- Capacity: The disk utilization of cloud operating system nodes can be measured regularly.
Step 3: Gather data
Once we have defined the performance metrics, we should think about how we can collect data to assign values to the metrics. This step is about developing tools for service monitoring. We should think about how we can measure things like downtime, outages, response time etc. Two techniques for data gathering are very important:
- Polling of IT services: Data is gathered by regularly polling the IT service and checking for occurrence of some events (like e. g. server is not available). Polling mechanisms must run periodically during a given time frame.
- Direct Measurement: Data is gathered directly by checking some system configuration values. The check runs only once at a given point in time.
An important aspect is choosing the time when we measure something and the frequency of measurements. Should we measure something once per day or should we rather measure something per hour or even per minute? And once we have chosen our frequency we must define the time frame on which measurements should take place.
In this example we gather data for three months and we measure everything according to the following frequencies:
- Impact and Downtime: We could poll every 100 seconds if an outage occurred. If an outage is detected, the impact can be measured directly as a predefined value that follows our dependability model.
- Response Time: Every hour we could start a script which runs some test queries. Thereby we measure the time for completion of the query. The response time value is then stored as data.
- Disk utilization: This metric need not to be polled very often. It can be measured daily by using a direct measurement technique. We just check the available used disk space and divide it by the total space of the available disks.
By using the data gathering techniques described above, we collect values for impact, downtime, response time and disk utilization.
Step 4: Process the data
The collected data must be processed now in order to be analysed.In this step we must think about aggregation functions and how to aggregate in order to be able to make meaningful statements about the gathered data.
When we collect data for three months, we can’t do anything useful with it if we do not aggregate the data somehow. We can either sum up the collected data or calculate an average. The aggregation function depends on the scale of the data we collected: if we collected e. g. only categorical data, we can only count occurrence of values. If the data can be brought into some meaningful order, we can sum up values. If the data is at least interval-scaled we can calculate averages. Other important aggregation functions are the maximum and minimum of a value.
For this example we chose the following aggregation functions:
- Total Impact and total downtime: Every time we detect an outage, the impact and the suffered downtime is recorded. In order to aggregate the data, we sum up all downtime suffered and impacts of outages.
- Average response time: We poll the response time of the cloud regularly, but in order to get an aggregated result we should calculate the average of the response time.
- Maximum disk utilization: It is better to measure the maximum disk utilization instead of the average utilization since we must find out if a critical threshold is reached. If a disk is full, additional data can not be saved. Therefore the peak disk utilization is the value we want to monitor.
In order to make the data analysable we must also think of dispersion of data. Aggregate functions are very sensitive to extreme values and outliers. Rare outliers can distort average values. If we have e. g. a lot of short response times and then suddenly an extremely large value (e. g. a CPU-intensive batch procedure), the average will get a large value which is not very representative to the actual measurements. Dispersion functions like e. g. variance and standard deviation are functions to measure how far the data is away from an average value. They are quite useful when we want to know more about the meaningfulness of an average value, Therefore we must also define the dispersion functions we want to measure.
For this example we chose the following dispersion measurements:
- Range of impact and downtime: Since impact and downtime are measured in terms of frequencies (impact and downtime increases when an outage occurs), we must choose the range (difference between minimum downtime and maximum downtime) as the dispersion measure. By gathering this data we can find out e. g. if we have rather very many small outages or very few large outages.
- Standard deviation of response time: Since the response time is measured as a continuous number, we chose the standard deviation as our dispersion measurement function.
- Standard deviation of disk utilization: Disk utilization will grow continually over time, but sometimes disk utilization is reduced due to maintenance work and other activities. Disk utilization growth is not linear and continuous. Therefore we should measure changes of disk utilization and take the standard deviation as our dispersion function.
Step 5: Analyse the data
The data we gathered can be seen as a statistical sample that with average values, standard deviations etc. In order to make something useful with the data, we must define tests we could apply on the collected statistical samples.
It is obvious that we should not just analyse data somehow. All data analysis should be performed according to the requirements that we define for our IT infrastructure. These requirements are commonly called “service levels”. Typically we want our infrastructure to achieve some critical values in terms of perfomance, capacity and availability. Another important aspect of analysis is measurement of the impact of changes. I
In this step we want to find out if we achieved the values we wanted to achieve. This is done by testing the aggregated data. The most common methods to check data are statistical methods. Statistical methods can be descriptive or inferential. Descriptive statistics reveal characteristics of the data distribution in a statistical sample. In descriptive statistics we check where the average of data is situated and how the data is distributed around the average. In inferential statistics we compare different samples to each other and we induce the value distribution of the population from the value distribution of samples.
Descriptive statistics are needed to check if the required service level has been achieved or not. Inferential statistics are useful to check if the achievement derived accidentially or if it was the result of maintenance work and configuration changes.
For the example of our cloud operating system the following descriptive analytic methods are proposed:
- Check Availability Level: The availability can be calculated by subtracting the total downtime from the timeframe of the measurement period, multiplying the result with 100 and then dividing the result through the timeframe of the measurement period. The result is a percentage value which should be above the availability level required by service consumers. In order to check how availability is distributed one should check
- Check Outage Impact: The total impact of outages should be below a certain level. One can also calculate the mean impact size and variance of impacts in order to see if we have many small outages or few severe outages.
- Check Average Response Time: In order to check the response time one should calculate the average of the average response time as well as the variance of the response time.
- Check Maximum Disk Utilization: Maximum Disk Utilization should be checked if it is above some critical value. In order to see if the disk utilization grows rapidly or slowly, one should also check the average of the maximum disk utilizations as well as the variance of the maximum disk utilization.
Descriptive analytics only reveal if you were keeping the required service level during the observed timeframe. In order to see if this result was achieved by chance or not, further statistical tests are needed. The following steps must be performed too:
- Test distribution of values: As a first step you should check the distribution of the values like outage time, impact, response time and disk utilization. If the values follow a normal distribution, you should choose other statistical tests than you would take if they were not normally distributed. Tests for distribution of values are the “Anderson-Darling-Test“, the “Cramer-von Mises-Test” and the “Kolmogorov-Smirnov-Test”. In order to use these tests you must use the data you gathered in step 3.
- Check if average differs from critical value: In inferential statistics we want to know if the measured average value was achieved as a result of our efforts to maintain the IT infrastructure or if it is a random value generated by accident. For this reason we compare the average value to either the value we expected from previously defined service levels or to the average value of another sample which is a data set we gathered from previous iterations of the 7-Step process. If it is your first iteration you can only make comparisons between the actual data set and the service level. Otherwise you are able to compare data of your previous iteration with data of the current iteration. The goal of this analysis is to see if there is a significant difference between the average and a critical value (either the average value required in the service level agreement or the average value of the previous sample). “Significant” means that if we assume that the difference is not equal to zero there is only a small error probability α (usually below a previously chosen significance threshold of 5 percent) that the difference is in fact equal to zero. There are quite many statistical tests to prove if differences between average values are significant. Once we know the distribution of values, we are able to test if the difference between the measured average value and the critical value is significant. If sampled values are not normally distributed, you should choose a non-parametric test like e. g. the “Wilcoxon-Signed-Rank-Test” to test the significance of the difference. If the samples follow a normal distribution, you should rather choose a parametric test like the “Student’s t-Test“. Parametric tests are generally more powerful than non-parametric tests, but they rely on the assumption that values follow a particular distribution. Therefore they are not always applicable. The interpretation of such a test is quite straightforward: when we measure a negative difference between measurement and critical value, we have to take corrective actions. If the difference between the measured average and the critical value is positive and significant, the difference can be considered as a result of our efforts. Otherwise it could mean that we achieved the better-than-required value only by chance. In that case we should think about corrective actions too, because the “better” value does not necessarily mean that the infrastructure is well-maintained.
- Check if variance differs from critical value: Generally speaking lower variance is preferable to high variance in a cloud computing environment. Low variance means that you have rather few extreme values and therefore your cloud computing environment is more scalable. Though variance should be kept low, it is not always possible to really do so. If your cloud environment is e. g. used to generate shopping websites, you have unavoidably varying traffic which makes response time, disk utilization and even availability varying too. But even in such cases it is always better to know variance well than not knowing anything at all. Knowledge about increasing variance makes you aware of imminent performance problems or other risks. For this reason variance should be compared to previously collected data of former iterations of the 7-Step process. As it is the case when you check average values, you might want to know the difference between variances and you might want to know if this difference occurred by accident or if it has another cause. Another interesting thing is to know if variance of different samples did not change over time (their difference is equal to 0). The attribute of different samples to have homogeneous variances is called “homoscedasticity”. There are also a quite a few statistical tests to prove if two samples have equal variances. It depends on the distribution of the sample data which test should be taken preferably. If your sample follows a normal distribution, you should take an “F-Test” . If data is distributed non-normally, you should take more robust tests like “Levene’s Test” or the “Brown-Forsythe-Test“. These tests decide if difference in variances is significant. Interpretation of a test result is that if there is a significant difference, then you should consider why it occurred. You should also prepare some corrective actions.
Even if you have found significant changes in variance or average values, you are still not done. You have to explain why the change occured. Statistical results must be interpreted to become more than just data.
Step 6: Present and use the information
According to the ITIL V3 standard, information is nothing else than data which is interpreted in some meaningful way. Because we add meaning to the data, we transform it into information. The big advantage of information is that it can be used to take corrective actions and to change something in an IT service.
Let’s say that we have performed a data analysis on our cloud services and we have found that our average response time is significantly lower than what we expected it to be. In that case we have gathered some information out of the data. With that information we can now decide to enhance the average response time somehow.
At this stage it is important to interpret the information correctly. There are two ways to interpret information:
- Reasoning: In order to interpret results of your statistical tests, you should be able to identify what implications they have for your IT service. If you e. g. discover a significant increase in disk utilization, the logical implication is that you should either try to limit disk utilization or add more storage to your system. Reasoning can not be fully automated since we must have some common sense knowledge about the IT services we use. There are some approaches to assist people in reasoning tasks though. You could e. g. use so-called “expert systems” to find good logical implications from your analysed data. An “expert system” is a software which must be fed with data and uses formal logic to calculate logical implications. These systems could be used as tools to support you in taking decisions about your IT architecture and other aspects of your IT service.
- Root Cause Analysis: Sometimes the cloud provider might discover a problem by analysing the data. The response time of the system could decrease significantly or there could be a growth trend in the occurence of outages. As soon as such a problem is discovered one should identify the underlying cause for the problem. This procedure is called “root cause analysis”. In root cause analysis you repeatedly ask yourself what caused the problem and then what caused the cause of the problem. The recursion involved in root cause analysis has a limited depth. Root cause analysis is also a process which can be performed only manually.
By reasoning about the results of your statistical tests you create valuable information. This information serves as a base to take corrective actions.
Step 7: Implement corrective action
The last step in the 7-Step Continual Service Improvement process is to take corrective actions. The corrective actions depend on the information generated in step 6. Since the 7-Step process is a continual improvement process, the last step has also the goal to close the improvement cycle. This is done by aligning the IT improvements to the business strategy.
The first part of this step is to immediately correct errors in the IT environment. This could be programming errors, bad configuration of hardware or software, bad network conceptions or even bad architectural decisions. It is important to document what actions we take to correct the problems and what changes are performed in the IT infrastructure. It is also important to reflect results of the performed actions in the documentation. If we face problems in the implementation of changes, we should document it. Otherwise it could be that our “corrective” actions destabilize our cloud service and we do not know why everything is getting worse. Therefore we should create a report containing corrective actions and results of the performed actions.
The second part of this step is to close the process cycle. This is done by reporting the results of the 7-Step process cycle to business decision makers – usually the managers of the cloud provider. In order to restart the process it must also be defined when and how to start the next cycle. This task is also something which must be coordinated with decision makers. For practical reasons a plan for the next cycle must be created and approved.
Continual Repetition of the Improvement Cycle
Once an Improvement Process cycle is finished, it must start over again. Therefore new goals for the Business Strategy must be defined. At this stage we have to take business decisions. A plan for the next cycle must be drawn. The goal of this business strategy redefinition is to redefine critical success factors in order to reimplement the next measurement and improvement iteration. The Business Strategy is the main output and input of the 7-Step Continual Service Improvement process. Fig. 2 shows the whole process as well as the results we get at each step. It also shows that the 7-Step Improvement process influences and is influenced mainly by the cloud provider’s business strategy.
Fig. 2: Results of the 7-Step Continual Service Improvement Process.