In part 3 of our article series “Dependability Modeling on OpenStack” we have discussed that we should run Chaos Monkey tests on an OpenStack HA installation and then collect data about the impact of the attack. While we did say that we want to collect data about the implemented OpenStack HA architecture, we were not specific about which data we should actually collect. This article gives some hints what is important when collecting data about HA system architectures.

What should be measured?

A very interesting question is what should be measured during a Chaos Monkey test run. The Dependability Modeling Framework is used to measure the capability of a system architecture to deliver “low” impacts of system outages. Therefore we should measure the impact of outages. The impact is a score which is derived from the Dependability graph. It should be measured as a result of a test run.

What is analysed in Dependability Modeling?

In Dependability Modeling we are interested in correlations between the system architecture and the outage impact. The system architecture data is mainly categorical data (replication technology used, clustering technology etc.) and the impact is a number. All variables that describe the system architecture are meant to be “explanatory” or “independent” variables, i. e. variables that can be chosen freely in the simulation, while the impact of outages is the “explained” (or “dependent”) variable, because the impact is assumed to be the result of the chosen architecture. In order to find significant correlations between system architecture properties and impact, we must collect values for all explanatory variables and then use a dimensionality reduction method to find which properties are interesting.

How much data should be collected?

First we must say that it is not a bad practice to collect “too much” data in a test or a scientific experiment. In classical statistics it is usually said that we should use small samples. The reason why this is said is because the science of classical statistics was developed in the 19th century – a time where measurements were expensive and statements on data sets had to be derived from small sample sets. Nowadays we can collect data automatically, therefore we are not forced to use small sample sets. We can simulate the whole life cycle of a cloud service, e. g. we could say that an OpenStack service will run for about 8 years which is 8 x 365 = 2’920 days and take one Chaos Monkey test for each day. The advantage of the automation is that we do not need to rely on samples.
Of course there is a limitation in terms of computational power: a Chaos Monkey test takes about 0.5-1.5 seconds. If we run 2920 Chaos Monkey tests, the whole simulation run can take up to > 4’300 seconds, which is more than 1 hour. Therefore you either run a simulation as an overnight batch job or you must choose to limit the simulation to a sample size which should adequately represent the overall population. To determine the optimal sample size you could use variance estimation. The sample size can be obtained using the statistical formula for calculation of sample sizes.

With that specification, we can proceed in developing our test framework. A further article will show a sample data set.