When do you need to scale up?

A big issue in cloud computing is knowing when you should upstart more VMs or switch to a more powerful virtual machine in order to process user requests efficiently. Monitoring system utilization is very important for detecting if VM utilization is too high to guarantee stable and high performing IT services. But how can one determine if upscaling of a VM-infrastructure is required? Part of the answer lies in trend detection algorithms. This article describes two of the most popular ones that can be applied to VM-infrastructures.

Autocorrelations and moving averages

If a series of measurements is correlated to the time of measurement, it is said that the series is “autocorrelated”. If you measure VM utilization several times you might discover that utilization will increase or decrease from time to time. A (linear) regression of measurement values will reveal growth trends. If such a trend appears, the average utilization is increasing, it is a “moving average”. The movement of the average causes the regression to produce errors, because regression models are computed on constant average values. Therefore one has to consider the errors produced by the moving average of measured values.

Moving average and autocorrelation can be combined in the “AutoRegressive Integrated Moving Average” (ARIMA) model. The ARIMA model has two advantages: on one side the autocorrelation function of a set of values is computed, on the other side the errors that are produced by performing this calculation are minimized. ARIMA integrates aspects of autocorrelation and moving average. Therefore it is a quite feasible model to predict trends.

When the ARIMA is applied to VM utilization one can predict (with a certain probability) that some threshold of utilization will be reached in the future. Defining acceptance criteria for probabilities of growth trends and  for reaching a threshold in the future is a major steps towards determine the “ideal” point in time when an upscaling of a VM-infrastructure is required.

Two things must be done:

  1. Define threshold values for VM utilization metrics that tell when a VM is overutilized. One could e. g. say that if mean CPU-utilization of the last 5 minutes is over 90%, the VM with that CPU is inacceptably overutilized and therefore such a value is athreshold for VM utilization.
  2. Define a threshold for ARIMA growth trends that result in VM overutilization (which is the threshold for VM utilization). For this purpose you have to measure values for VM utilization metrics and repeatedly calculate growth trends following the ARIMA model. If such a calculation results in reaching a threshold for VM utilization, an upscaling of VM utilization is required.

With threshold values for VM utilization metrics and ARIMA growth trends one can construct an event management system that catches problems of VM overutilization by repeatedly measuring  metrics and calculating growth trends.

The advantages of the ARIMA model are:

  • It gives an extrapolated estimation of the future growth trend and tries to assign a value to predicted VM utilization.
  • It takes the fact that average VM utilization changes over time into account by repeatedly calculating a moving average.

The drawbacks of the ARIMA model are:

  • The model states a prediction which appears to be “exact” to the statistically inexperienced viewer, but in fact there is only a probability that the future values will be most likely in the neighbourhood of the predicted values. Given any ARIMA prediction it is still possible that growth trends will discontinue in the future. Therefore predicted values can never be seen as “guaranteed” future values.

Control charts

Another model which can be used to predict upscaling utilizes Shewhart control charts. These charts are used in business process management for controlling process quality based on statistical measurements. The idea behind control charts is the following: we have to take n repeated samples of i measurements and then calculate the range and the average of the each sample. The ranges are then put as data points in an “R-chart” and the averages are filled in an “X-chart”. Then we calculate the average μ and the standard deviation σ of all n data points in the R- and the X-chart. Then we do the following: we define some upper and lower bound for the data points which are considered as “natural” process limits and check if there are data points lying above or below these “control limits”. The upper and lower control limit (UCL and LCL) are proportional to the standard error which is σ divided by the square root of n. As a rule of thumb the UCL is defined as the average of all data points plus two times the standard error, while the LCL is the average minus two times the standard error. By calculating the UCL and LCL for the X- and R-chart, we can check if there are data points below or above the UCL.

Control charts assume that if all data points lie within the UCL and LCL limits, the process will most likely continue as it is. It is said then that the process is “in control”. The interesting thing about control charts is that data points which lie outside the UCL or LCL can be used as indicators of process changes. If multiple points lie above the UCL, a growth trend can be indicated.

When control charts are applied to VM utilization one must first define the sample size i and the number of data points n. Let us say that we want to measure average CPU utilization of the last 5 minutes. One could e. g. measure CPU utilization at 20 random points (i=20) in the time interval between 0 and 5 minutes.  Then one can calculate the average of the sample as well as the range which is the difference between the maximum and minimum of the 20 values. As a result we get one data point for the X-chart and one for the R-chart. Then one should take n samples to populate the X- and R-charts. If we chose n=5, we can then compute the standard deviation, standard error and average of all samples. This values can be used to define the UCL and LCL for the process. As a next step we must define a decision criterion for when do we say that a process will result in a growth or decline trend. We could e. g. say that if 2 or more points lie above the UCL, a growth trend will occur in the future.

The upscaling is necessary, when either a process contains 2 or more data points above the UCL and the average is near some critical threshold (where the low performance VM reaches its maximum capacity) or when a process is in control but the UCL lies above the critical threshold. In both cases an upscaling is necessary, either because the next data points will probably lie above the threshold as a result of some growth trend or because the future data points can reach the threshold even when the process continues as it is.

Control charts are a quite simple means to predict process changes. They have the following advantages:

  • Control charts are a relatively reliable indicator for future growth trends and can therefore indicate possibly critical growth trends very early.
  • They do not bias viewers towards giving “exact” predictions for future VM utilization values.

Despite these advantages, control charts also have the following drawbacks:

  • They need a lot of parameter estimations (e. g. choice of n or i). If those parameters are not chosen well, control charts lead to many “false alarms” that indicate overutilization when there is none.
  • Control charts can predict growth trends, but they do not tell anything about the strength of the growth trend. Therefore they tend to either overestimate small process changes or underestimate large changes. They are insensitive to trend sizes.

Both models, the ARIMA and the control charts have some advantages and some drawbacks. Like many tools they are just as good as the person that uses them. Often it is advisable to test both tools first and then decide which instrument should be used for VM utilization prediction. But predicting future growth trends is still more an art than a craft. Therefore it can not be decided which method is “better”, but it is clear that both of them are better than do nothing about VM performance measurements.

 


Leave a Reply

Your email address will not be published. Required fields are marked *