Month: May 2014 (page 2 of 2)

Manage instance startup order in OpenStack Heat Templates

In many applications it is necessary to create virtual resources in a certain order. As an orchestration engine, Heat is able to support such a requirement, but how it is actually done in a template can be tricky. Recently I had to write such a Heat template, which seemed pretty easy as there is a number of examples on the OpenStack/heat-templates github. My requirements and the relative lack of explanation on how the templates are written made this a bit more difficult than expected, but after finding information dispersed over several websites I solved my issues: This post is a summary of my findings. My application was made of three servers which had to be started and configured in a specific order, each server needing to be ready before the next one can be started as it automatically connects to the previously started servers. This was really the main concern of the application. In the following examples I will use the names service1, service2 and service3, with startup order being service1 > service2 > service3. I had three requirements:

  1. I wanted to follow the Heat Orchestration Template (HOT) format, which is the latest template format meant to replace Heat CloudFormation-compatible format (CFN) as the native format supported by Heat over time, so my template is still usable in the next Heat versions.
  2. To support my startup order I needed to use WaitConditions, which are directly issued from the CFN format but normally HOT still supports the usage of CFN resources, in the new format.
  3. My image did not have the cfn tools installed and thus I could not use cfn calls directly from inside the machine during the post-boot phase. This is an issue as from the templates which can be found on github, they all use these tools when WaitConditions are used.

The idea of WaitConditions is that they have to be declared and linked to one resource, and when this resource is configured and ready it sends a signal back to Heat. Another resource depending on this signal can then be started. The template which met my requirements can be found on github, I will explain the relevant parts here:

    type: "OS::Nova::Server"
      flavor: m1.medium
      image: ubuntu_cloud
        get_param: key_name
          template: |
              curl -X PUT -H 'Content-Type:application/json' \
                   -d '{"Status" : "SUCCESS","Reason" : "Configuration OK","UniqueId" : "SERVICE1","Data" : "Service1 Configured."}' \
              get_resource: service1_wait_handle

    type: "AWS::CloudFormation::WaitCondition"
    depends_on: service1
        get_resource: service1_wait_handle
      Timeout: 1000

    type: "AWS::CloudFormation::WaitConditionHandle"

A first resource “service1” is declared, with the WaitCondition and WaitConditionHandle declared as separate resources linked together with a dependence on service1 in the case of the WaitCondition. The interesting part is in the post-boot script of service1: user-data. Here you can a curl with a specific JSON data blob (details on CloudFormation’s website) sent through a PUT on an address retrieved from the WaitConditionHandle designed as service1_wait_handle. This is what signals the success to the wait condition. Now how is it possible to specify that the next virtual instance has to wait for this success signal before being started?

    type: "OS::Nova::Server"
    depends_on: service1_wait
        get_param: instance_type
      image: ubuntu_cloud
        get_param: key_name
          template: |
              curl -X PUT -H 'Content-Type:application/json' \
                -d '{"Status" : "SUCCESS","Reason" : "Configuration OK","UniqueId" : "SERVICE2","Data" : "Service2 Configured."}' \
                - service1_wait
                - Data
              get_resource: service2_wait_handle
    type: "AWS::CloudFormation::WaitCondition"
    depends_on: service2
        get_resource: service2_wait_handle
      Timeout: 1000

    type: "AWS::CloudFormation::WaitConditionHandle"

Here you can see a structure similar to the one shown on the previous code snippet, with a new WaitCondition and Handle. This is because this server will in turn need to be configured before the final server can be started. The service2 resource differs on two points:

depends_on: service1_wait

This specifies that this resource depends on the completion of the service1_wait WaitCondition. Intuitively this should be enough as one might think that this will only happen when the success signal previously described is sent. Unfortunately it is not sufficient, at least in the Havana Release where this template was tested the resource did not wait at all and was started as soon as the template was created. A work-around to this problem is implemented in this code snippet:

        - service1_wait
        - Data

This specifically tells Heat that service2 needs to retrieve the data (in our case, a string) sent through the curl call in the service1 post-boot script. This requirement is what actually makes service2 wait for service1 to be ready, even if in the actual post-boot script of service2, there is no reference to this data at all: it is sufficient to retrieve it in the params sections of str_replace and not use it at all in the actual script. With this template, you can now start and configure you instances in whatever order fits your application’s requirements, and even combine wait conditions so that instance C waits for instance B which in turn waits for instance A. It is also possible to actually use the data sent through the success signal in other templates if this actually makes sense if your application configuration scheme.

Deploy Ceph and start using it: end to end tutorial – simple librados client (part 3/3)

(Part 1/3 – Installation – Part 2/3 – troubleshooting)

This part of the tutorial describes how to setup a simple Ceph client using librados (for C++).

The only information that the client requires for the cephx authentication is

  • Endpoint of the monitor node
  • Keyring containing the pre-shared secret (we will use the admin keyring)

Install librados APIs

On Ubuntu, the library is available on the repositories

$ sudo apt-get install librados-dev

Create a client configuration file

This is the file from which librados will read the client configuration.

The content of the file is structured according to this template:

mon host= <IP address of one of the monitors>
keyring = <path/to/client.admin.keyring>

for example:

mon host =
keyring = ./ceph.client.admin.keyring

The public endpoint of the monitor node can be retrieved with

$ ceph mon stat

The keyring file can be copied from the admin node. No change is needed to this file. The same information that is contained in the file can be retrieved with this command that will also list the client capabilities:

$ ceph auth get client.admin

Connect to the cluster

The following simple client will perform the following operations:

  • Read the configuration file (ceph.conf) from the local directory
  • Get an handle to the cluster and IO context on the “data” pool
  • Create a new object
  • Set an xattr
  • Read the object and xattr back
  • Print the list of pools
  • Print the list of objects in the “data” pool
  • Cleanup
  1. #include <rados/librados.hpp>
  2. #include <string>
  3. #include <list>
  4. int main(int argc, const char **argv)
  5. {
  6.   int ret = 0;
  7.   /*
  8.    * Errors are not checked to avoid pollution.
  9.    * After each Ceph operation:
  10.    * if (ret < 0) error_condition
  11.    * else success
  12.    */
  13.   // Get cluster handle and connect to cluster
  14.   std::string cluster_name(“ceph”);
  15.   std::string user_name(“client.admin”);
  16.   librados::Rados cluster;
  17.   cluster.init2(user_name.c_str(), cluster_name.c_str()0);
  18.   cluster.conf_read_file(“ceph.conf”);
  19.   cluster.connect();
  20.   // IO context
  21.   librados::IoCtx io_ctx;
  22.   std::string pool_name(“data”);
  23.   cluster.ioctx_create(pool_name.c_str(), io_ctx);
  24.   // Write an object synchronously
  25.   librados::bufferlist bl;
  26.   std::string objectId(“hw”);
  27.   std::string objectContent(“Hello World!”);
  28.   bl.append(objectContent);
  29.   io_ctx.write(objectId, bl, objectContent.size()0);
  30.   // Add an xattr to the object.
  31.   librados::bufferlist lang_bl;
  32.   lang_bl.append(“en_US”);
  33.   io_ctx.setxattr(objectId, “lang”, lang_bl);
  34.   // Read the object back asynchronously
  35.   librados::bufferlist read_buf;
  36.   int read_len = 4194304;
  37.   //Create I/O Completion.
  38.   librados::AioCompletion *read_completion =
  39.                                              librados::Rados::aio_create_completion();
  40.   //Send read request.
  41.   io_ctx.aio_read(objectId, read_completion, &read_buf, read_len, 0);
  42.   // Wait for the request to complete, and print content
  43.   read_completion>wait_for_complete();
  44.   read_completion>get_return_value();
  45.   std::cout << “Object name: “ << objectId << \n
  46.             << “Content: “ << read_buf.c_str() << std::endl;
  47.   // Read the xattr.
  48.   librados::bufferlist lang_res;
  49.   io_ctx.getxattr(objectId, “lang”, lang_res);
  50.   std::cout << “Object xattr: “ << lang_res.c_str() << std::endl;
  51.   // Print the list of pools
  52.   std::list<std::string> pools;
  53.   cluster.pool_list(pools);
  54.   std::cout << “List of pools from this cluster handle” << std::endl;
  55.   for (auto pool_id : pools) {
  56.     std::cout << \t << pool_id << std::endl;
  57.   }
  58.   // Print the list of objects
  59.   librados::ObjectIterator oit = io_ctx.objects_begin();
  60.   librados::ObjectIterator oet = io_ctx.objects_end();
  61.   std::cout << “List of objects from this pool” << std::endl;
  62.   for (; oit != oet; oit++) {
  63.     std::cout << \t << oit>first << std::endl;
  64.   }
  65.   // Remove the xattr
  66.   io_ctx.rmxattr(objectId, “lang”);
  67.   // Remove the object.
  68.   io_ctx.remove(objectId);
  69.   // Cleanup
  70.   io_ctx.close();
  71.   cluster.shutdown();
  72.   return 0;
  73. }

Find the pastebin here.

This example can be compiled and executed with

$ g++ client.cpp -lrados -o cephclient
$ ./cephclient

Operate with cluster data from the command line

To quickly verify if an object was written or to remove it, use the following commands (e.g., from the monitor node).

  • List objects in pool data

    $ rados -p data ls
  • Check the location of an object in pool data

    $ ceph osd map data <object name>
  • Remove object from pool data

    $ rados rm <object name> --pool=data

Deploy Ceph and start using it: end to end tutorial – Troubleshooting (part 2/3)

(Part 1/3 – Installation – Part 3/3 – librados client)

It is quite common that after the initial installation, the Ceph cluster reports health warnings. Before using the cluster for storage (e.g., allow clients to access it), a HEALTH_OK state should be reached:

cluster-admin@ceph-mon0:~/ceph-cluster$ ceph health

This part of the tutorial provides some troubleshooting hints that I collected during the setup of my deployments. Other helpful resources are the Ceph IRC channel and mailing lists.

Useful diagnostic commands

A collection of diagnostic commands to check the status of the cluster is listed here. Running these commands is how we can understand that the Ceph cluster is not properly configured.

  1. Ceph status
    $ ceph status

    In this example, the disk for one OSD had been physically removed, so 2 out of 3 OSDs were in and up.

    cluster-admin@ceph-mon0:~/ceph-cluster$ ceph status
        cluster 28f9315e-6c5b-4cdc-9b2e-362e9ecf3509
         health HEALTH_OK
         monmap e1: 1 mons at {ceph-mon0=}, election epoch 1, quorum 0 ceph-mon0
         osdmap e122: 3 osds: 2 up, 2 in
          pgmap v4699: 192 pgs, 3 pools, 0 bytes data, 0 objects
                87692 kB used, 1862 GB / 1862 GB avail
                     192 active+clean
  2. Ceph health
    $ ceph health
    $ ceph health detail
  3. Pools and OSDs configuration and status
    $ ceph osd dump
    $ ceph osd dump --format=json-pretty

    the second version provides much more information, listing all the pools and OSDs and their configuration parameters

  4. Tree of OSDs reflecting the CRUSH map
    $ ceph osd tree

    This is very useful to understand how the cluster is physically organized (e.g., which OSDs are running on which host).

  5. Listing the pools in the cluster
    $ ceph osd lspools

    This is particularly useful to check clients operations (e.g., if new pools were created).

  6. Check the CRUSH rules
    $ ceph osd crush dump --format=json-pretty
  7. List the disks of one node from the admin node
    $ ceph-deploy disk list osd0
  8. Check the logs.
    Log files in /var/log/ceph/ will provide a lot of information for troubleshooting. Each node of the cluster will contain logs about the Ceph components that it runs, so you may need to SSH on different hosts to have a complete diagnosis.

Check your firewall and network configuration

Every node of the Ceph cluster must be able to successfully run

$ ceph status

If this operation times out without giving any results, it is likely that the firewall (or network configuration) is not allowing the nodes to communicate.

Another symptom of this problem is that OSDs cannot be activated, i.e., the ceph-deploy osd activate <args> command will timeout.

Ceph monitor default port is 6789Ceph OSDs and MDS try to get the first available ports starting at 6800.

A typical Ceph cluster might need the following ports:

Mon:  6789
Mds:  6800
Osd1: 6801
Osd2: 6802
Osd3: 6803

Depending on your security requirements, you may want to simply allow any traffic to and from the Ceph cluster nodes.


Try restarting first

Without going for fine troubleshootings and log analysis, sometimes (especially after the first installation), I’ve noticed that a simple restart of the Ceph components has helped the transition from a HEALTH_WARN to a HEALTH_OK state.

If some of the OSDs are not in or not up, like in the case below

    cluster 07d28faa-48ae-4356-a8e3-19d5b81e159e
     health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs stuck unclean; 1/2 in osds are down; clock skew detected on mon.1, mon.2
     monmap e3: 3 mons at {0=,1=,2=}, election epoch 36, quorum 0,1,2 0,1,2
     osdmap e27: 6 osds: 1 up, 2 in
      pgmap v57: 192 pgs, 3 pools, 0 bytes data, 0 objects
            84456 kB used, 7865 MB / 7948 MB avail
                 192 incomplete

try to start the OSD daemons with

# on osd0
$ sudo /etc/init.d/ceph -a start osd0

If the OSDs are in, but PGs are in weird states, like in the example below

cluster 07d28faa-48ae-4356-a8e3-19d5b81e159e
     health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; clock skew detected on mon.1, mon.2
     monmap e3: 3 mons at {0=,1=,2=}, election epoch 36, quorum 0,1,2 0,1,2
     osdmap e34: 6 osds: 6 up, 6 in
      pgmap v71: 192 pgs, 3 pools, 0 bytes data, 0 objects
            235 MB used, 23608 MB / 23844 MB avail
                 128 active+degraded
                  64 active+replay+degraded

try to restart the monitor(s) with

# on mon0
$ sudo /etc/init.d/ceph -a restart mon0

Unfortunately, a simple restart will be the solution in just a few rare cases. More troubleshooting will be required in the majority of the situations.

Unable to find keyring

During the deployment of the monitor nodes (the ceph-deploy <mon> [<mon>] create-initial step), Ceph may complain about missing keyrings:

[ceph_deploy.gatherkeys][WARNIN] Unable to find
/etc/ceph/ceph.client.admin.keyring on ['ceph-server']

If this warning is reported (even if the message is not an error), the Ceph cluster will probably not reach an healthy state.

The solution to this problem is to use exactly the same names for the hostnames (i.e., the output of hostname -s) and the Ceph node names.

This means that the files

  • /etc/hosts
  • /etc/hostname
  • .ssh/config (only for the admin node)

and the result of the command hostname -s, all should have the same names for a certain node.

See also:

 Check that replication requirements can be met

I’ve found that most of my problems with Ceph health were related to wrong (i.e., unfeasible) replication policies.

This is particularly likely to happen in test deployment where one doesn’t care about setting up many OSDs or separating them across different hosts.

Some common pitfalls here may be:

  1. The number of required replicas is higher than the number of OSDs (!!)
  2. CRUSH is instructed to separate replicas across hosts but multiple OSDs are on the same host and there are not enough OSD hosts to satisfy this condition

The visible effect when running diagnostic commands is that PGs will be in wrong statuses.

CASE 1the replication level is such that it cannot be accomplished with the current cluster (e.g., a replica size of 3 with 2 OSDs).

Check the replicated size of pools with

$ ceph osd dump

Adjust the replicated size and min_size, if required, by running

$ ceph osd pool set <pool_name> size <value>
$ ceph osd pool set <pool_name> min_size <value>

CASE 2: the replication policy would require replicas to sit on separate hosts, but OSDs are running within the same hosts

Check what crush_ruleset applies to a certain pool with

$ ceph osd dump --format=json-pretty

In the example below, the pool with id 0 (“data”) is using the crush_ruleset with id 0

"pools": [
        { "pool": 0,
          "pool_name": "data",
          "crush_ruleset": 0,  <----
          "object_hash": 2,

then check with

$ ceph osd crush dump --format=json-pretty

what crush_ruleset 0 is about.

In the example below, we can observe that this rules says to replicate data by choosing the first available leaf in the CRUSH map, which is of type host.

"rules": [
        { "rule_id": 0,
          "rule_name": "replicated_ruleset",
          "ruleset": 0,
          "type": 1,
          "min_size": 1,
          "max_size": 10,
          "steps": [
                { "op": "take",
                  "item": -1,
                  "item_name": "default"},
                { "op": "chooseleaf_firstn",     <-----------
                  "num": 0,
                  "type": "host"},               <-----------
                { "op": "emit"}]}],

If not enough hosts are available, then the application of this rule will fail.

To allow replicas to be created on different OSDs but possibly on the same host, we need to create a new ruleset:

$ ceph osd crush rule create-simple replicate_within_hosts default osd

After the rule has been created, it should be listed in the output of

$ ceph osd crush dump

from where we can not its id.

The next step is to apply this rule to the pools as required:

$ ceph osd pool set data crush_ruleset <rulesetId>
$ ceph osd pool set metadata crush_ruleset <rulesetId>
$ ceph osd pool set rbd crush_ruleset <rulesetId>

An overview of Load Balancing

With the advent of large scale architectures came a need to improve the distribution of requests to optimize the throughput of the system while keeping a minimum response time. This is especially true for large web services. Load balancing is the ability to make many servers participate in the same service and do the same tasks.

The goal of this post is to explain the different approaches in traditional load balancing as well as a list of existing software. The last section will be about the integration of these approaches in a cloud-environment as nowadays the large scale architecture described in the previous paragraph may be entirely cloud-based. This blog post is not meant to be an exhaustive study of Load balancing as this is a mature topic with a lot of research and available products, but rather tries to be an introduction for someone who might need to use Load Balancing in his project and would like to have knowledge of the basic types of load balancers as well as a list of the most well-known products. To investigate further, a list of useful links is provided at the end of the post.

Load Balancing is often confused with high-availability as with the growing number of servers, risk of failure anywhere increases and must be addressed, and the ability to maintain unaffected services during these failures is also part of a load-balancer’s job, redirecting requests to working resources.

The focus of this post will be on Load-Balancing HTTP applications, which is one of the most classic applications of load balancing.

Load balancing approaches


DNS load balancing is probably the technique which is the easiest to implement. When accessing a service through an address, a DNS server is tasked to translate the address into a comprehensible IP. Through this URL translation, the DNS can select any node from the cluster it manages based on its scheduling policy. It also provides a validity period (Time-To-Live), used to cache the translation. After the expiry of this TTL, the next request is routed again to the DNS server. Round-Robin is the simplest policy to implement, so the addresses are returned by the server in a rotating order.

Example of DNS load-balancing

host -t a has address has address has address

Using a round-robin algorithm, each request is routed to one of these different IP.


In this approach, the load-balancing architecture consists of a hardware or software equipment installed in a dedicated frond-end server that will work at the network packets level. This type of LB is also called Layer 3/4 LB, distributing requests based upon data found in network and transport layer protocols such as TCP or UDP. They will act on routing, using one of the following methods: Direct Routing (the LB routes the same service address through different local, physical servers on the same network segment), Tunneling (tunnels are established between the LB and the servers, so they can be located on remote networks) or NAT (the user connects to a virtual destination address, which the load balancer translates to one of the servers’ addresses).


Application level LBs, also called Layer 7 LBs, act as reverse proxies and distribute requests based upon data found in application layer protocols such as HTTP. They provide a first level of security by only forwarding what they understand. They can also be combined with the previous type of Load Balancer to ensure a fine-grain request distribution.

Example of an architecture using both Layer 4 and Layer 7 LB.

LB Architecture

Current offering

Historically, most of the offers in the Load Balancing sectors come from major hardware network vendors such as Big-IP, Juniper and F5 but recently software load-balancers are increasingly used, especially in a cloud environment where the network might be virtual. As the number of existing Load-Balancers is huge we chose to focus on a handful of them, especially those released under an Open Source license.

Layer-4 capable software LB

IP VirtualServer
IPVS is built in the Linux kernel, and thus does not suffer from context switching between user space and kernel space, which introduces delays, especially under heavy traffic with many short lived connections.

HAProxy is an hybrid load balancer both capable of Layer 4 (TCP) and Layer 7 (HTTP) Load-Balancing. It implements an event-driven, single-process model which enables support for very high number of simultaneous connections. The idea behind this choice, which dates back to the early versions of the tool, is that because of memory limits, system scheduler limits and lock contention, multi-process/multi-threaded models are not able to cope with thousands of simultaneous connections. Since version 1.5 it supports SSL connections.

Layer-7 capable software LB

Primarily built as a lightweight HTTP server, nginx also serves quite well as an HTTP(S) load balancer. Of the listed options, nginx provides the most number of features, including many options for caching and file serving.

Through the module mod_proxy_balancer available since Apache 2.1, Apache can be used an HTTP Load Balancer retrieving requested pages from two or more backend web servers and delivering them to users, while keeping track of sessions, which allows a single user to always deal with the same backend webserver.

Between nginx and HAProxy, Pound is a lightweight HTTP-only load balancer. It offers many of the load balancing features of nginx without any of the web server capabilities and can thus be used behind any web server. This keeps Pound small and efficient.

Although primarily used as a reverse proxy cache, Varnish also includes functionality to act as a load balancer. It does not offer a great deal of configuration, but, if already using Varnish for caching, it is possible to also make use of its load balancing abilities to simplify an architecture and avoid using too many different components.

Load-Balancing in the Cloud

Many of the Infrastructure-as-a-Service management suites provide their own component dedicated to Load Balancing, among them Apache CloudStack and Openstack. This component is in fact a connector between the virtual instances and a real load balancer such as the ones described in the previous paragraph. For instance OpenStack Neutron LoadBalancing works together with HAProxy. Cloud providers such as Amazon also provide their own LB services. The common point in all these LB are that they work “as a service”, that is a tenant can dynamically add a LB to a set of virtual servers to optimize request routing.

Useful links

Newer posts »