[This post originally appeared on the XiFi blog – ICCLab@ZHAW is a partner in XiFi and is responsible for operating the Zurich node.]
As with any open compute systems, security is a serious issue which cannot be taken lightly. XiFi takes security seriously and has regular reviews of security issues which arise during node operations.
As well as being reactive to specific incidents, proper security processes require regular upgrading and patching of systems. The Venom threat which was announced in April is real for many of the systems in XiFi as the KVM hypervisor is quite widely used. Consequently, it was necessary to upgrade systems to secure them against this threat. Here we offer a few points on our experience with this quite fundamental upgrade.
The Venom vulnerability exploits a weakness in the Floppy Disk Controller in qemu. Securing systems against Venom requires upgrading to a newer version of qemu (terminating any existing qemu processes and typically restarting the host). In an operational KVM-based system, the VMs are running in qemu environments so a simple qemu upgrade without terminating existing qemu process does not remove the vulnerability; for this reason, upgrading the system with minimal user impact is a little complex.
Our basic approach to perform the upgrade involved evacuating a single host – moving all VMs on that host to other hosts in the system – and then performing the upgrade on that system. As Openstack is not a bulletproof platform as yet, we did this with caution, moving VMs one by one, ensuring that VMs were not affected by the move (by checking network connectivity for those that had public IP address and checking the console for a sample of the remainder). We used the block migration mechanism supported by Openstack – even though this can be somewhat less efficient (depending on configuration), it is more widely applicable and does not require setup of NFS shares between hosts. Overall, this part of the process was quite time-consuming.
Once all VMs had been moved from a host, it was relatively straightforward to upgrade qemu. As we had deployed our node using Mirantis Fuel, we followed the instructions provided by Mirantis to perform the upgrade. For us, there were a couple of points missing in this documentation – there were more package dependencies (not so many – about 10) which we had to install manually from the Mirantis repo. Also, for a deployment with Fuel 5.1.1 (which we had), the documentation erroneously omits an upgrade to one important process – qemu-kvm. Once we had downloaded and installed the packages manually (using dpkg), we could reboot the system and it was then secure.
In this manner, we upgraded all of our hosts and service to the users was not impacted (as far as we know)…and now we wait for the next vulnerability to be discovered!
There are many different technologies which can increase availability of a cloud infrastructure. In our newest Techcouting paper we evaluate several HA technologies in order to define a HA architecture for an OpenStack deployment which is part of the XiFi project. HA technologies can be grouped in the following classes:
- Resource monitors that check if IT-services are alive and (sometimes automatically) recover them in case of failure.
- Load balancers that direct end user requests to those resources that are still alive and show reasonable prformance.
- Distributed disks and file systems that increase redundancy of data and help to prevent data loss in case of failure.
- Distributed databases which help to prevent loss of database records.
Every OpenStack component has the purpose to deliver a service to an end user. Availability of a cloud instance is dependent on the availability of the delivered end users services as perceived by end users. If we want to use a HA technology to increase availability of OpenStack we have to analyze dependencies of end user services on IT and infrastructure components. Therefore we created a dependability model of the provided IT services and the business services consumed by end users.
As availability always depends on the requirements that are defined by end users we asked several OpenStack end users in a survey on the importance of each business service. The result is that end users tended to rate “Infrastructure Management” and “Security Management” as the most important services. Therefore we had to ensure that these services have high availability levels.
By linking the importance of the service to the IT components that provide it, we can assign a target availability level to each component. Furthermore we can compare several HA architectures to each other and check the availability levels they can achieve. We built several fault tree diagrams that represent the link of component failures to service outages:
A simulation of service outages by given inputs of failure rates revealed that adding HA technologies to OpenStack can add up to 7-8 percent points to the average availability level of the provided services.
We tested several technologies that belong to one of the HA technology classes. Our evaluation included chances and risks associated with implementing the technology and technological maturity. We assigned each technology a chances, risks and maturity score.
The result of our evaluation is that we prefer to use keepalived, HAProxy, Ceph/RADOS and MySQL Galera as HA technologies to improve availability of our OpenStack installation. These technologies are all open-source. They have been preferred because their performance is not significantly lower than the performance of commercial products, but they are available for free, while commercial products are not. The final HA architecture is able to increase availability levels of all OpenStack services up to three nines – which is a very high availability level in cloud computing.
It is clear that another organization would come to other conclusions when the concrete implementation of a HA technology has to be selected, but the evaluation methodology used in our paper shows how to make more reasonable technology choice decisions by linking end user requirements with system architecture characteristics and rate several architectural alternatives by the availability levels that are reasonably achievable.
The ICCLab has now a new testbed for their work/research in the Cloud-Computing field at no other location than the datacenters of Equinix – one of our collaboration partners and generous donor of the rackspace – in Zurich.