We made a decision to use Kolla-Ansible for Openstack management approximately a year ago and we’ve just gone through the process of upgrading from Ocata to Pike to Queens. Here we provide a few notes on the experience.
By way of some context: our system is a moderate sized system with 3 storage nodes, 7 compute nodes and 3 controllers configured in HA. Our systems were running CentOS 7.5 with a 17.05.0-ce docker engine and we were using the centos-binary Kolla containers. Being an academic institution, usage of our system peaks during term time – performing the upgrade during the summer meant that system utilization was modest. As we are lucky enough to have tolerant users, we were not excessively concerned with ensuring minimal system downtime.
We had done some homework on some test systems in different configurations and had obtained some confidence with the Kolla-Ansible Ocata-Pike-Queens upgrade – we even managed to ‘upgrade’ from a set of centos containers to ubuntu containers without problem. We had also done an upgrade on a smaller, newer system which is in use and it went smoothly. However, we still had a little apprehension when performing the upgrade on the larger system.
In general, we found Kolla Ansible good and we were able to perform the upgrade without too much difficulty. However, it is not an entirely hands-off operation and it did require some intervention for which good knowledge of both Openstack and Kolla was necessary.
Our workflow was straightforward, comprising of the following three stages
- generate the three configuration files
passwords.yml
,globals.yml
andmultinode.ha
, - pull down all containers to the nodes using
kolla-ansible pull
- perform the upgrade using
kolla-ansible upgrade
.
We generated the globals.yml
and passwords.yml
config files by copying the empty config files from the appropriate kolla-ansible git branch to our /etc/kolla
directory, comparing them with the files used in the previous deploy and copying changes from the previous versions into the new config file. We used the approach described here to generate the correct passwords.yml
file.
Pulling appropriate containers to all nodes was straightforward:
/opt/kolla-ansible/tools/kolla-ansible \ -i /etc/kolla/multinode.ha pull
It can take a bit of time, but it’s sensible as it does not have any impact on the operational system and reduces the amount of downtime when upgrading.
We were then ready to perform the deployment. Rather than run the system through the entire upgrade process, we chose a more conservative approach in which we upgraded a single service at a time: this was to maintain a little more control over the process and to enable us to check that each service was operating correctly after upgrade. We performed this using commands such as:
/opt/kolla-ansible/tools/kolla-ansible \ -i /etc/kolla/multinode.ha --tags "haproxy" upgrade
We stepped through the services in the same order as listed in the main Kolla-Ansible playbook, deploying the services one by one.
The two services that we were most concerned about were those pertaining to data storage, naturally: mariadb and ceph. We were quite confident that the other processes should not cause significant problems as they do not retain much important state.
Before we started…
We had some initial problems with docker python libraries installed on all of our nodes. The variant of the docker python library available via standard CentOS repos is too old. We had to resort to pip to install a new docker python library which worked with newer versions of Kolla-Ansible.
Ocata-Pike Upgrade
Deploying all the services for the Ocata-Pike upgrade was straightforward: we just ran through each of the services in turn and there were no specific issues. When performing some final testing, however, the compute nodes were unable to schedule new VMs as neutron was unable to attach a VIF to the OVS bridge. We had seen this issue before and we knew that putting the compute nodes through a boot cycle solves it – not a very clean approach, but it worked.
Pike-Queens Upgrade
The Pike-Queens upgrade was more complex and we encountered issues that we had not specifically seen documented anywhere. The issues were the following:
-
- the mariadb upgrade failed – when the slave instances were restarted, they did not join the mariadb cluster and we ended up with a cluster with 0 nodes in the ‘JOINED’ state. The master node also ended up in an inoperational state.
- We solved this using the well documented approach to bootstrapping a mariadb cluster – we have our own variant of it for the kolla mariadb containers, which is essentially a replica of the
mariadb_recovery
functionality provided by kolla - This did involve a syncing process of replicating all data from the bootstrap node on each of the slave nodes; in our case, this took 10 minutes
- We solved this using the well documented approach to bootstrapping a mariadb cluster – we have our own variant of it for the kolla mariadb containers, which is essentially a replica of the
- when the mariadb database sync’d and reached quorum, we noticed many errors associated with record field types in the logs – for this upgrade, it was necessary to perform a
mysql_upgrade
, which we had not seen documented anywhere
- the mariadb upgrade failed – when the slave instances were restarted, they did not join the mariadb cluster and we ended up with a cluster with 0 nodes in the ‘JOINED’ state. The master node also ended up in an inoperational state.
-
- the ceph upgrade process was remarkably painless, especially given that this involved a transition from Ceph Jewel to Ceph Luminous. We did have the following small issues to deal with
- We had to modify the configuration of the ceph cluster using
ceph osd require-osd-release luminous
- We had one small issue that the cluster was in the
HEALTH_WARN
status as one application did not have an appropriate tag – this was easily fixed usingceph osd pool application enable {pool-name} {application-name}
- for reasons that are not clear to us, Luminous considered the status of the cluster to be somewhat suboptimal and moved over 50% of the objects in the cluster; Jewel had given no indication that a large amount of the cluster data needed to be moved
- We had to modify the configuration of the ceph cluster using
- Upgrading the object store rendered it unusable: in this upgrade, the user which authenticates against keystone with privilege to manage user data for the object store changed from
admin
toceph_rgw
. However, this user was not added to the keystone and all requests to the object store failed. Adding this user to the keystone and giving this user appropriate access to the service project fixed the issue.- This was due to a change that was introduced in the Ocata release after we had performed our deployment and it only became visible to use after we performed the upgrade.
- the ceph upgrade process was remarkably painless, especially given that this involved a transition from Ceph Jewel to Ceph Luminous. We did have the following small issues to deal with
Apart from those issues, everything worked fine; we did note that the nova database upgrade/migration in the Pike-Queens cycle did take quite a long time (about 10 minutes) for our small cluster – for a very large configuration, it may be necessary to monitor this more closely.
Final remarks…
The Kolla-Ansible upgrade process worked well for our modest deployment and we are happy to recommend it as an Openstack management tool for environments of such scale with quite standard configurations, although even with an advanced tool such as Kolla-Ansible, it is essential to have a good understanding of both Openstack and Kolla before depending on it in a production system.