For the Zurich FIWARE node, we’re setting up a Kilo High Availability (HA) deployment – we’re transitioning from our current Icehouse (non-HA) deployment.

Kilo HA is recommended as there is a general understanding within the project that the HA capabilities are now ready for production use. However, there is no single Kilo HA – there are many different configurations which can be called HA – and in this post, we describe some of the points we encountered while setting up our HA node.

We deployed Mirantis Openstack v7.0 using the Fuel deployment tool, as is used in the project and as we have used before; requiring a HA deployment, we selected the HA configuration in Fuel and we have 3 controller nodes to provide HA. We did have some issues that the deployment did not terminate cleanly, failing in some astute-based post-deployment tests – however, these issues were minor and the system behaved in a sane manner.

The deployment seemed to work fine initially, but we made some changes and after some time it entered an unexpected state and networking functions were unstable: at this point, we reviewed the system configuration and were a little surprised at what we saw.

But before explaining the configuration, it is first necessary to understand that Neutron (Kilo) provides three types of routers:

  • Legacy routers: these routers operate on a single neutron node and are exactly those that are used in non-HA deployments. Resilience is provided in a HA deployment by determining if the l3 agent is down using a heartbeat mechanism and if so, restarting the routers which are no longer available on alternative network nodes.
  • HA routers: these routers operate across multiple neutron nodes (having a qrouter on each node) and operate in an active/passive mode with one of the neutron nodes hosting the active router – VRRP and keepalived is used to pass control to another router in case of failure
  • Distributed Virtual Routers (DVRs): these are more complex routers which also operate across multiple neutron nodes however, in this case, the load is distributed evenly across all routers during operation and not just a single active router

The view within the project is that DVR is still a little too experimental for production scenarios and consequently the options are legacy or HA routers.

To our surprise, the HA Fuel deployment employed legacy routers and not HA routers – in neutron.conf, the following configurations were set:

l3_ha = false
allow_automatic_l3agent_failover = true

We experimented with the use of HA routers instead of legacy routers as these should result in a more rapid failover since the router config is already present on all nodes. Changing the l3_ha parameter to true in neutron.conf facilitates this. (Note that it is also possible to have neutron create legacy routers by default but to specify at router creation time that a router should be HA router using neutron router-create --ha=True so the neutron.conf parameter is not absolute).

We did have significant problems with the HA routers, however. We found that they were not generating gratuitous ARPs and sending them to our gateway when a new floating IP address was assigned. We used a mix of the standard tcpdump and ip netns tools on both linux bridges and OVS bridge to poke around inside neutron to see what was going on. The essential test was to check what was going out on br-ex – we saw no G-ARPs going out on this interface when a floating IP was assigned, but we also observed issues inside other bridges (br-int specifically) in which there were mismatches between assigned IP addresses and MAC addresses). In a test/pre-deploy environment in which floating IPs were changing quite dynamically, this resulted in our gateway quickly losing track of the mapping from floating IP to appropriate MAC address resulting in lost connectivity.

Ultimately, we decided to stop working with HA routers and reverted to legacy routers. For our modest context, the legacy neutron router failed over in about a minute when we shut down one of our neutron nodes – this is good enough for us for now. In future, we will explore the alternative options as they mature.