We continue our recent work regarding an analysis of the performance of live migration in Openstack Icehouse. Our previous results focused on block live migration in Openstack, without shared storage configured between computing nodes. In this post we focus on the performance of live migration in the system with a shared file system configured, compare it with block live migration and try to determine scenarios more suitable for each approach. [Note that in the above, when live migration implies shared storage based live migration; block live migration is explicit as in the Openstack literature on this topic].
[Also note that all VMs in following experiments use 5GB of disk space]
All system configuration and testing aspects (except amended flavors) remain the same as in previous block live migration tests (see here for more details). The only infrastructure change in our small 3 node configuration is the shared file system setup. We used an NFS server running on the controller node to provide its ../nova/instances folder as a mount point for the compute nodes.
Let’s start with the time taken by live migrations of unloaded VMs. In chart 1 we see a small increase of migration time with larger VM flavors, which is not so surprising. More interestingly the down time stays almost constant (< 1 s) and doesn’t grow with the migration time unlike the case of block live migration.
|Migration time (unloaded)||4.9||5.6||7.0||9.6||15.3|
|Down time (unloaded)||0.7||0.7||0.8||0.8||0.9|
Table 1 – Live migration duration & down time
In all cases with shared storage, live migration of unloaded VMs exhibits significantly lower migration time and downtime than block live migration. More precisely, the NFS live migration time is on average only 60% of the block live migration time (see chart 2). VM down time is on average 17% of the down time using block migration (see chart 3) and is less than 1 s in our experiments.
|Live migration (unloaded)||4.9||5.6||7.0||9.6||15.3|
|Block live migration (unloaded)||9.3||10.5||11.7||14.6||21.8|
Table 2 – Migration time (unloaded) – live migration vs. block live migration
|Live migration (unloaded)||0.7||0.7||0.8||0.8||0.9|
|Block live migration (unloaded)||2.4||3.2||4.4||7.0||21.5|
Table 3 – Down time (unloaded) – live migration vs. block live migration
In the unloaded VM scenario, the amount of data transferred over the network increases with the VM size, however even these values are significantly smaller than in the block live migration case: Network traffic is 11 – 30% lower in the storage based migration case, than the block migration case.
|Live migration (unloaded)||175||190||288||315||480|
|Block live migration (unloaded)||239||271||324||404||669|
Behaviour of live migration for loaded machines was a little more complex.
Using the block live migration we successfully migrated VMs across all flavors even when they were quite heavily loaded – in our tests, we focused on memory load (as described here) and we achieved successful migration even when they were approximately 75% loaded. However, shared storage based live migration was more sensitive to memory load. In fact, we couldn’t migrate even a tiny flavor VM with a relatively small memory load – 100 MB of stressed memory (out of total 512 MB). Migration was initiated but didn’t end until the load was terminated. Consequently, we can’t present results which are directly comparable to the block live migration for this loaded case.
To have a better understanding why we face this problem of non deterministic migration time we need to dive a bit deeper into the migration mechanism. In our configuration of Openstack with QEMU/KVM and libvirt, the live migration uses pre-copy memory approach. This consists of following steps:
- Copy source VM memory to destination. This operation takes some time while VM is still running (and some memory pages are changing) at the source.
- Repeat step 1 (copy changed memory pages) until amount of dirty memory is small enough and can be transferred in very short VM downtime
- Stop VM at source and copy the rest of dirty memory
- Start VM at destination
The problem arises at the point 2: if the VM memory is dirtied faster than it can be transferred to destination then this step will never terminate.. Therefore, whether a migration is a success or failure depends on the available network capacity and the current VM memory load.
To explore this further, we considered VMs with different memory load and examined how it affects migration time and the amount of transferred data. Up to a stress level of 96 MB all migrations were successful with relatively small constant downtime (<1 s) and short migration duration (mostly < 20 s). However, beyond this point the VM got stuck in a migration state and doesn’t stop migrating (until it is either killed or the memory load reduced). The real measured throughput in our network was approximately 110 MB/s, that’s why we could migrate VM with lower memory load than this value and we failed with VMs, where the stress rate of memory exceeded this threshold.
|Stressed memory [MB]||0||16||32||48||64||80||96|
|Migration time [s]||15.0||15.3||16.3||17.3||18.5||20.8||20.8|
Table 5 – Migration time with different levels of stress
|Stressed memory [MB]||0||16||32||48||64||80||96|
|Data transferred [MB]||472||510||541||573||635||770||943|
Table 6 – Data transfer with different levels of stress
Openstack live migration is fast technique how to transfer running VM from one host to another with a small downtime. But since its duration is non deterministic and there is no guarantee that migration process finishes successfully, care must be taken when employing this approach and your particular network configuration and load as well as the VM activity must be considered.
Key take aways from our experiments are:
- Live migration can be faster and the downtime lower than in case of block live migration but it can be unreliable in the case that the VM has memory intense activity. In our experiments, downtime was low (<1s) and migration time took seconds to some few 10’s of seconds.
- Block live migration exhibited better reliability but longer migration times and VM downtimes (up to some 100’s of seconds in both cases);
- In both cases, a significant network load can be imposed on the system – this needs to be borne in mind and indeed the results presented above do not take into account typical network activity within a cluster: when this is taken into account live migration is likely to be even less reliable.
Live migration is a useful tool in operating and optimizing your cloud infrastructure; however, it is inherently complex and must be used with care.
In future work, we will consider further scenarios where we test live migration, which are closer to real world conditions. We are also investigating the use of post-copy migration and how it may perform.