A Tool for Understanding OpenStack Cloud Performance using Stacktach and the OpenStack Notification System

In one of our projects, FICORE the continuation of FIWARE, we need to offer an Openstack-based service. One aspect of service operations is to understand the performance of the system and one particular aspect of this is to understand how long basic operations take; it is interesting to see how this evolves over time as, for example, a system may get more and more loaded. To address this, we first looked at using an approach based on log files but it was not workable as the information regarding an operation is spread across multiple hosts and services. An alternative approach is to use the Openstack notification system where a lot of key events occurring within the system are published – this is a single point for all the information we need. We then used Stacktach to consume, filter and store this data and built a web application on top of it. In this blog post we give a brief overview of the Openstack notification system, the Stacktach filtering tool and the basic web tool we developed.

OpenStack Notification System

Notifications in OpenStack are primarily fine-grained JSON formatted messages sent to the message queue – usually rabbitmq – containing data relating to a certain operations. The Openstack notification system is generally a quite chatty system which offers a rich data source of structured data which can be used to understand many aspects of Openstack operations. An example of the content of a notification is shown below. Each notification has an event_type, usually describing an operation performed – here is a list of event_types relating to nova; similar lists exist for other Openstack services.

{"event_type": "compute.instance.resize.confirm.start",
      "timestamp": "2012-03-12 17:01:29.899834",
      "message_id": "1234653e-ce46-4a82-979f-a9286cac5258",
     "priority": "INFO",
      "publisher_id": "compute.compute-1",
      "payload": {"state_description": "",
                           "display_name": "testserver",
                       "memory_mb": 512,
                       "disk_gb": 20,
                        "tenant_id": "12345",
                        "created_at": "2012-03-12 16:55:17",
                       "instance_type_id": 2,
                        "instance_id": "abcbd165-fd41-4fd7-96ac-d70639a042c1",
                        "instance_type": "512MB instance",
                        "state": "active",
                        "user_id": "67890",
                        "launched_at": "2012-03-12 16:57:29",
                        "image_ref_url": "http://127.0.0.1:9292/images/a213faf83as0t"}}

To enable notifications for a certain service add the following lines to its configuration file.

notification_driver=messaging
notification_topics=notifications
# In nova.conf
notify_on_state_change=vm_and_task_state

After applying the changes above (and restarting the service, of course) a new queue will be created in the host if necessary with the name given in notification_topics. Once the services are set up in this way, they will be producing events on the message bus and we can focus on how to work with these notifications using Stacktach to do something useful with them.

Stacktach

Stacktach comprises of several services focused on consuming, filtering, processing and storing messages for the purposes of analysis or notification: in general, it can be used for any application, but it is really designed for Openstack. More information on Stacktach is available on the official website (we really recommend reading the very helpful documentation).

Setting up Stacktach is quite straightforward; here we use the stacktach-sandbox within a virtual machine but it can be installed anywhere; importantly, it needs to be installed somewhere that the message queue service of your controller node is accessible. From a vanilla Ubuntu Server VM, it is necessary to install mysql and rabbitmq as follows:

sudo apt-get install -y python-dev ipython mysql-server mysql-client libmysqlclient-dev git vim rabbitmq-server python-pip librabbitmq1

To install and run Stacktach, the following commands can be run:

    git clone https://github.com/openstack/stacktach-sandbox.git
    cd stacktach-sandbox
    ./build.sh

This will create a set of screens which contain all of the Stacktach services.

It is not necessary to have a full Openstack installation to run basic tests on Stacktach; it comes with a notification generator (notigen) which can be used to test the Stacktach deployment and this runs by default when build.sh is executed. Hence, it is possible to determine if the installation has been performed properly by checking for events directly in the database or using the klugman tool in one of the screens:

    klugman http://127.0.0.1:8000 streams

Configuring Stacktach to consume Openstack notifications

Before configuring Stacktach, it can be helpful to verify that Openstack is indeed producing notifications: this script prints out every message consumed for a given queue. It is useful for debugging Stacktach and helping to understand the workflow of a notification.

Configuring Stacktach to consume Openstack notifications is fairly straightforward and few modifications need to be done. More specifically, rabbit_broker section in yagi.conf.common must point to the message queue service of your controller node and the database url in winchester.yaml; any other modification will depend on specific configuration of your Openstack deployment, e.g. the notification queue name. For more information relating to Stacktach configuration please follow the documentation but here are some caveats to be aware of.

As the basic deployment of Stacktach consumes notigen messages, it is necessary to modify the configuration to disable consumption of notigen messages. This is done by removing or commenting the following lines from screenrc.winchester:

    # screen -t gen bash
    # stuff "cd git/stacktach-notigen/bin; python pump_from_stv2.py\r"
    # stuff "cd git/stacktach-notigen/bin; python event_pump.py ../templates 2 0\r"

It is also worth noting that build.sh regenerates yagi.conf from other files every time it is run. This script takes the configuration from yagi.conf.common, yagi.conf.winchester and winchester.yaml within stacktach-sandbox directory.

Filtering notifications

Stacktach uses a set of rules (triggers) to filter notifications you are interested in and create streams out of them. Streams can be quite arbitrary sequences of events which typically have some initial event and a terminal event. Streams are specific to one or more event types, meaning that a given stream comprises of events which match the given event types. Streams are subdivided using the ‘distinguished_by’ parameter: this enables us to differentiate between events relating to, for example, specific VMs or hosts. A basic example is shown below which matches on events relating to VM creation (compute.instance.create.* events) and are distinguished by request id. This will create short streams which comprise of events relating to VM creation generated by a specific request – it could be a stream comprising of a compute.instance.create.start followed by a compute.instance.create.end event which relate to the specific request.

In the example below, the stream is initiated by any event matching compute.instance.create.* which has a unique request_id; any subsequent events which have this event_type and the same request_id are put into the stream and the stream is terminated when the fire_criteria are met – either a compute.instance.create.end or compute.instance.create.error message are obtained for the request_id.

- name: create_instance_trigger
  debug_level: 2
  distinguished_by:
- request_id
  expiration: "$last + 1h"
  fire_pipeline : "instance_create_start"
  expire_pipeline: "instance_create_end"
  match_criteria:
- event_type:
- compute.instance.create.*
  fire_criteria:
    - event_type:
      - compute.instance.create.end
      - compute.instance.create.error

Streams are captured via triggers as shown above: triggers are defined in winchester/triggers.yaml within stacktach-sandbox directory.

Web Application

So now that we have a solution which tracks activity on our Openstack cluster and generates appropriate streams, we can build a small application which can give some insight into the operation of our resources. We built a small web-based application which tracks the time taken to perform standard API calls (eg launch VM, delete VM, snapshot VM) on the cluster. Stacktach was configured to generate streams relating to the operations we were interested in and the web application queries the stream database generated by Stacktach. The web tool allows us to see how these operations vary with time (daily, weekly, monthly) averaged over the entire node and broken down by flavor and node. The short video below shows the tool in action.

https://youtu.be/e4aCYLK2RkI

Schlagwörter: cloud performance, monitoring, notifications, openstack, Stacktach

2 Kommentare

Stig Telfer says:

27. January 2016 at 0:22

Great to see, thank you for sharing this. Is there a mechanism for identifying a hierarchy of events across OpenStack projects? For example, to break down instance creation into a set of connected events from Nova, Neutron, Cinder, Glance, and see where the time went in each? That is what I would really want…

- gaea says:
  
  27. January 2016 at 12:40
  
  Stacktach project is very powerful, it is able to consume events notification of different Openstack projects and process them into streams according to match_criteria and distinguished_by defined in triggers. More specifically, multiple match_criterias may be associated to a trigger enabling event notifications of different projects to be inserted into the same stream – just mind to distinguish them correctly otherwise you will create a stream with multiple operations on it. One issue that we can foresee is that there may be issues linking all the operating with a single (external) request, but the multiple match_criteria may perhaps help solve this problem.
  Further investigation would be necessary to give a more complete answer.
  It is worth noting that what we have seen is that most of the time consumed for operations to complete is related to image management and specifically copying images so it is unlikely that very useful insight can be obtained from understanding how the remaining time is broken down.

Service Engineering (ICCLab & SPLab)

2 Kommentare

Leave a Reply Cancel reply