How reliable are your OpenStack VMs? How many outages do you expect to occur during 8 months of operation? Do your VMs crash regularily, randomly or do VM outages increase over time? These questions can only be answered if we perform a reliability analysis of the virtual machines that we manage. In this small guide we show you how to check reliability of VMs in your OpenStack environment. In part 1 of this 4 part series we explain the basic concepts of reliability engineering.
The vast field of reliability engineering has been used widely in various engineering disciplines like aircraft design, civil engineering, electricity management or product management. Though reliability engineering has proven to help in successfully building high quality engineering products, it has almost never been used in cloud computing so far. There might be some distrust among programmers in these scientifically proven reliability analysis methods, since they involve math and statistical exploration. But with a little introduction this is not a severe problem that we should worry about.
Reliability engineering simply deals with analyzing and measuring the outage behavior of engineered systems, trying out and testing system improvements that make the system more reliable, implementing system improvements and validating if the system improvements have reduced the occurence of outages or not. The first step is the analysis of outage behavior. How can outages be analyzed?