After having completed part 1 of our series about reliability analysis, we now start with our first reliability measurement experiment. According to reliabili theory there are three things we could measure: survival probability, hazard rate and failure rate. The last one is the easiest one in practice. Therefore we design an experiment to measure the failure rate of OpenStack VMs under heavy load.

Failure rates can be constant, ascending or declining over time. In order to measure the general tendency of a failure rate we have to perform a time series analysis. We start up several OpenStack VMs, put them under stress by running a certain task on them and then count how many of the VMs are still alive after a certain amount of time. The stress task is performed several times on the same VMs and the number of machines that are still alive is counted repeatedly in order to get a time series of failure rates.

Continue reading