Up to now, we have published several blog posts focusing on the live migration performance in our experimental Openstack deployment – performance analysis of post-copy live migration in Openstack and an analysis of the performance of live migration in Openstack. While we analyzed the live migration behaviour using different live migration algorithms (read our previous blog posts regarding pre-copy and post-copy (hybrid) live migration performance) we observed that both live migration algorithms can easily saturate our 1Gb/s infrastructure and that is not fast enough, not for us! Fortunately, our friends Robayet Nasim and Prof. Andreas Kassler from Karlstad University, Sweden also like their live migrations as fast and reliable as possible, so they kindly offered their 10 Gb/s infrastructure for further performance analysis. Since this topic is very much in line with the objectives of the COST ACROSS action which both we (ICCLab!) and Karlstad are participants of, this analysis was carried out under a 2-week short term scientific mission (STSM) within this action.
This blog post presents a short wrap-up of the results obtained focusing on the evaluation of post-copy live migration in OpenStack using 10Gb/s interfaces and comparing them with the performance of the 1Gb/s setup. The full STSM report can be found here. Continue reading
[This post originally appeared on the XiFi blog – ICCLab@ZHAW is a partner in XiFi and is responsible for operating the Zurich node.]
As with any open compute systems, security is a serious issue which cannot be taken lightly. XiFi takes security seriously and has regular reviews of security issues which arise during node operations.
As well as being reactive to specific incidents, proper security processes require regular upgrading and patching of systems. The Venom threat which was announced in April is real for many of the systems in XiFi as the KVM hypervisor is quite widely used. Consequently, it was necessary to upgrade systems to secure them against this threat. Here we offer a few points on our experience with this quite fundamental upgrade.
The Venom vulnerability exploits a weakness in the Floppy Disk Controller in qemu. Securing systems against Venom requires upgrading to a newer version of qemu (terminating any existing qemu processes and typically restarting the host). In an operational KVM-based system, the VMs are running in qemu environments so a simple qemu upgrade without terminating existing qemu process does not remove the vulnerability; for this reason, upgrading the system with minimal user impact is a little complex.
Our basic approach to perform the upgrade involved evacuating a single host – moving all VMs on that host to other hosts in the system – and then performing the upgrade on that system. As Openstack is not a bulletproof platform as yet, we did this with caution, moving VMs one by one, ensuring that VMs were not affected by the move (by checking network connectivity for those that had public IP address and checking the console for a sample of the remainder). We used the block migration mechanism supported by Openstack – even though this can be somewhat less efficient (depending on configuration), it is more widely applicable and does not require setup of NFS shares between hosts. Overall, this part of the process was quite time-consuming.
Once all VMs had been moved from a host, it was relatively straightforward to upgrade qemu. As we had deployed our node using Mirantis Fuel, we followed the instructions provided by Mirantis to perform the upgrade. For us, there were a couple of points missing in this documentation – there were more package dependencies (not so many – about 10) which we had to install manually from the Mirantis repo. Also, for a deployment with Fuel 5.1.1 (which we had), the documentation erroneously omits an upgrade to one important process – qemu-kvm. Once we had downloaded and installed the packages manually (using dpkg), we could reboot the system and it was then secure.
In this manner, we upgraded all of our hosts and service to the users was not impacted (as far as we know)…and now we wait for the next vulnerability to be discovered!
In our previous blog posts we mostly focused on virtual machine live migration performance comparing pre-copy, post-copy and hybrid approaches in an Openstack context rather than exploring other live migration features. Libvirt together with the Qemu hypervisor provides many migration configuration options. One of these options is a possibility to use tunneled live migration. Recently we found that the current libvirt tunneling implementation is not supported in post-copy migration. Consequently, In order to make the post-copy patch more production ready we decided to support the community and add support for post-copy tunneled live migration to libvirt on our own. This blog post describes the whole story of immersing ourselves into the open source community and hacking an established open source project since we believe this experience can be generalized. Continue reading
We have shed blood, sweat and tears trying to get post-copy live migration working in OpenStack over the last month: this blogpost explains all the necessary steps to make it work and hopefully it will save future post-copy pioneers some pain. In our previous blog post we focused on setting it up in QEMU. This time we consider the bigger picture spanning all the levels of Openstack from system kernel requirements to setting up the right flags in the Nova configuration file in the OpenStack environments running QEMU / KVM with Libvirt. Continue reading
Hurray! We have finally deployed QEMU 2.1.5 with post-copy live migration support on our servers! But before we get to that, a little bit of context… in our previous blog posts we focused on performance analysis of pre-copy live migration in Openstack. So far all of our experiments were done using QEMU version 1.2 with KVM acceleration. As we were keen to do some experimentation with post-copy live migration, we had to upgrade to the very new QEMU 2.1.5 which provides post-copy live migration support in one of its branches. (More generally, there have been significant enhancements in QEMU since version 1.2 – of November 2012 – and hence we expected better performance and reliability in pre-copy as well). This blog post focuses on our first hands-on experience with post-copy live migration in QEMU.