Setup a Kubernetes Cluster on OpenStack with Heat

In this post we take a look at Kubernetes and help you setup a Kubernetes Cluster on your existing OpenStack Cloud using its Orchestration Service Heat. This Kubernetes Cluster should only be used as a Proof of Concept.

Technology involved:
Kubernetes: https://github.com/GoogleCloudPlatform/kubernetes
CoreOS: https://coreos.com/
etcd: https://github.com/coreos/etcd
fleet: https://github.com/coreos/fleet
flannel: https://github.com/coreos/flannel
kube-register: https://github.com/kelseyhightower/kube-register

The Heat Template used in this Post is available on Github.

What is Kubernetes?

Kubernetes allows the management of docker containers at scale. Its core concepts are covered in this presentation, held at the recent OpenStack&Docker Usergroup meetups.

A complete overview of Kubernetes is found on the Kubernetes Repo.

Architecture

The provisioned Cluster consists of 5 VMs. The first one, discovery, is a dedicated etcd host. This allows easy etcd discovery thanks to a static IP-Address.

A Kubernetes Master host is setup with the Kubernetes components apiserver, scheduler, kube-register, controller-manager as well as proxy. This machine also gets a floating IP assined and acts as a access point to your Kubernetes cluster.

Three Kubernetes Minion hosts are setup with the Kubernetes components kubelet and proxy.

HowTo

Follow the instructions on the Github repo to get your Kubernetes cluster up and running:

https://github.com/icclab/kubernetes-on-openstack-demo 

Examples

Two examples are provided in the repo:

A Web Application to Monitor and Understand Energy Consumption in an Openstack Cloud

In one of our projects we need to understand the energy consumption of our servers. Our initial work in this direction involved collecting energy consumption data using Kwapi and storing it in Ceilometer for further study. The data stored in Ceilometer is valuable; however, it is insufficient to really understand energy consumption in detail. Consequently, we are developing a web application which gives a much greater insight into energy consumption in our cloud resources. This is very much a work in progress, so this post just highlights a few points relating to the application as well as a video which shows the current version of the application.

The tool was developed to be totally integrated with Openstack. Users log in with their Openstack credentials (using Keystone authentication) and are  redirected to the overview page where they can see  the total energy consumed by the VMs in their projects for the the previous month as well as some  general information regarding virtual machines; a line chart displays how energy consumed varies over time.

Continue reading

Profiling the Ceilometer API to Identify Performance Bottlenecks

We are using ceilometer to collect data energy from our servers. As noted previously we were having some performance issues and we needed to investigate further. In this blog post we will cover our approach to performing profiling on ceilometer API to determine where the problems arose.

Of course, the first step was to take a look at the log files (in /var/log/ceilometer-all.log); as there was nothing unusual in there, we decided to perform profiling of the code.

Continue reading

Ceph: OSD “down” and “out” of the cluster – An obvious case

When setting up a cluster with ceph-deploy, just after the ceph-deploy osd activate phase and the distribution of keys, the OSDs should be both “up” and “in” the cluster.

One thing that is not mentioned in the quick-install documentation with ceph-deploy or the OSDs monitoring or troubleshooting page (or at least I didn’t find it), is that, upon (re-)boot, mounting the storage volumes to the mount points that ceph-deploy prepares is up to the administrator (check this discussion on the Ceph mailing list).

So, after a reboot of my storage nodes, the Ceph cluster couldn’t reach a healthy state showing the following OSD tree:

$ ceph osd tree
# id weight type name up/down reweight
-1 3.64 root default
    -2 1.82 host ceph-osd0
        0 0.91 osd.0 down 0
        1 0.91 osd.1 down 0
    -3 1.82 host ceph-osd1
        2 0.91 osd.2 down 0
        3 0.91 osd.3 up 1

I wasn’t thinking about mounting the drives, as this process was hidden to me during the initial installation, but a simple mount command would have immediately unveiled the mistery :D.

So, the simple solution was to mount the devices:

sudo mount /dev/sd<XY> /var/lib/ceph/osd/ceph-<K>/

and then to start the OSD daemons:

sudo start ceph-osd id=<K>

For some other troubleshooting hints for Ceph, you may look at this page.

Ceilometer Performance issues

Update: This does not apply to Icehouse. This flag was to activate an experimental feature  -this option no longer exists in Icehouse. (It is in Havana, however).

There have been some criticisms of the implementation of Ceilometer (or Telemetry as of Icehouse) – however, it’s still the main show in town for understanding what’s going on inside your Openstack.

We’ve been doing a bit of work with it in multiple projects. In one of our efforts – pulling in energy info via kwapi – we noticed that Ceilometer really crawls to a halt with the API giving a response in 20s when trying to enter just a single energy consumption data point. (Yes, it might make more sense to batch these up…). For our simple scenario, this performance was completely unworkable.

Our Ceilometer installation just used the basic Mirantis Fuel v4.0 which installed a variant of Havana. The db backend was mysql (chosen by Fuel) and we just went with the default configuration parameters.

There are known performance issues with Ceilometer (issue, presentation mentioning it, mailing list discussion) and it seems that Icehouse has made some significant strides in improving performance of Ceilometer/Telemetry; however, we have not managed to perform the upgrade as yet – maybe some of these issues have already been fixed.

For our work, we were able to significantly improve the performance of the Ceilometer API by activating (experimental!) thread pooling on the db: this had the effect of making entering single energy consumption data points take less than one second (down from 20s) and a larger query of the list of available meters took 5s compared to a previous 34s. It just involved setting

use_tpool=true

in /etc/ceilometer/ceilometer.conf and bingo – significant uptick in performance (for our small, experimental system).

Not sure how widely applicable this is, and not sure if it’s realistic for production environments – for our experimental system, it turned an unworkable system into something which is usable (but certainly not speedy!)

 

Benchmarking OpenStack by using Rally – part 1

As system administrators it is difficult to gather performance data before going productive. Benchmarking tools offer a comfortable way to gather performance data by simulating usage of a productive system. In the OpenStack world we can employ the Mirantis Rally tool to benchmark VM performance of our cloud environment.

Rally comes with some predefined benchmarking tasks like e. g. booting new VMs, upstarting VMs and running shell scripts on them, concurrently building new VMs and many more. The nice drawing below shows the performance of booting VMs in an OpenStack instance in a Shewhart Control Chart (often called “X-Chart” or “X-Bar-Chart”). As you can see it takes almost 7.2 seconds to upstart a VM on average and sometimes the upstarting process is outside the usual six sigma range. For a system administrator this could be quite useful data.

A X-Chart of VM boot performance in OpenStack.

A X-Chart of VM boot performance in OpenStack.

The data above was collected employing the Rally benchmark software. The Python-based Rally tool is free, open-source and extremely easy to deploy. First you have to download Rally from this Github link.

Rally comes with an install script just clone the Github repository in a folder of your choice, cd into that folder and run:

$ ./rally/install_rally.sh

Then deploy Rally by filling your OpenStack credentials in a JSON-file:

And then type:

$ rally deployment create --filename=existing.json --name=existing
+----------+----------------------------+----------+-----------------+
|   uuid   |         created_at         |   name   |      status     |
+----------+----------------------------+----------+-----------------+
|   UUID   | 2014-04-15 11:00:28.279941 | existing | deploy-finished |
+----------+----------------------------+----------+-----------------+
Using deployment : UUID 

Remember to use the UUID you got after running the previous command.
Then type:

$ rally use deployment --deploy-id=UUID
Using deployment : UUID

Then you are ready to use Rally. Rally comes with some pre-configured test scenarios in its doc-folder. Just copy a folder like e. g. rally/doc/samples/tasks/nova/boot-and-delete.json to your favourite location like e. g. /etc/rally/mytask.json:


$ cp rally/doc/samples/tasks/nova/boot-and-delete.json /etc/rally/mytask.json

Before you can run a Rally task, you have to configure the tasks. This can be done either via JSON- or via YAML-files. The Rally API can deal with both file format types.
If you edit the JSON-file mytask.json, you see something like the following:


{
    "NovaServers.boot_and_delete_server": [
        {
            "args": {
                "flavor_id": 1,
                "image_id": "Glance UUID"
            },
            "runner": {
                "type": "constant",
                "times": 10,
                "concurrency": 2
            },
            "context": {
                "users": {
                    "tenants": 3,
                    "users_per_tenant": 2
                }
            }
        }
    ]
}

You have to add the correct UUID of a Glance image in order to configure the test run properly. The UUID can be retrieved by typing:


$ rally show images
+--------------------------------------+--------+----------+
|                 UUID                 |  Name  | Size (B) |
+--------------------------------------+--------+----------+
| d3db863b-ebff-4156-a139-5005ec34cfb7 | Cirros | 13147648 |
| d94f522f-008a-481c-9330-1baafe4933be | TestVM | 14811136 |
+--------------------------------------+--------+----------+

Update the mytask.json file with the UUID of the Glance image.

If we want to run the task simply type (the “-v” flag for “verbose” output):


$ rally -v task start /etc/rally/mytask.json

=================================================================
Task  ... is started
------------------------------------------------------------------
2014-05-12 11:54:07.060 . INFO rally.benchmark.engine [-] Task ... 
2014-05-12 11:54:07.864 . INFO rally.benchmark.engine [-] Task ... 
2014-05-12 11:54:07.864 . INFO rally.benchmark.engine [-] Task ... 
...
+--------------------+-------+---------------+---------------+
|       action       | count |   max (sec)   |   avg (sec)   |
+--------------------+-------+---------------+---------------+
|  nova.boot_server  |   10  | 8.28417992592 | 5.87529754639 | |
| nova.delete_server |   10  | 6.39436888695 | 4.54159021378 |
+--------------------+-------+---------------+---------------+

---------------+---------------+---------------+---------------+
   avg (sec)   |   min (sec)   | 90 percentile | 95 percentile |
---------------+---------------+---------------+---------------+
 5.87529754639 | 4.68817186356 | 7.33927609921 | 7.81172801256 |
 4.54159021378 | 4.31421685219 | 4.61614284515 | 5.50525586605 |
---------------+---------------+---------------+---------------+

+---------------+---------------+---------------+---------------+
|   max (sec)   |   avg (sec)   |   min (sec)   |  90 pecentile | 
+---------------+---------------+---------------+---------------+
| 13.6288781166 | 10.4170130491 | 9.01177096367 | 12.7189923525 |
+---------------+---------------+---------------+---------------+...
...

The statistical output is now of major interest: it shows how long it takes to boot a VM instance in OpenStack and gives some useful information about the performance of your current OpenStack deployment. It can be viewed as a sample in the Shewhart control chart. Rally takes 10 test runs and measures the average runtime of each run. This technique is called statistical sampling. So each Rally run can be viewed as a sample which is represented as one data point in a control chart.

But how did we get our data into a Shewhart Control chart? This will be explained further in part 2.

Manage instance startup order in OpenStack Heat Templates

In many applications it is necessary to create virtual resources in a certain order. As an orchestration engine, Heat is able to support such a requirement, but how it is actually done in a template can be tricky. Recently I had to write such a Heat template, which seemed pretty easy as there is a number of examples on the OpenStack/heat-templates github. My requirements and the relative lack of explanation on how the templates are written made this a bit more difficult than expected, but after finding information dispersed over several websites I solved my issues: This post is a summary of my findings. My application was made of three servers which had to be started and configured in a specific order, each server needing to be ready before the next one can be started as it automatically connects to the previously started servers. This was really the main concern of the application. In the following examples I will use the names service1, service2 and service3, with startup order being service1 > service2 > service3. I had three requirements:

  1. I wanted to follow the Heat Orchestration Template (HOT) format, which is the latest template format meant to replace Heat CloudFormation-compatible format (CFN) as the native format supported by Heat over time, so my template is still usable in the next Heat versions.
  2. To support my startup order I needed to use WaitConditions, which are directly issued from the CFN format but normally HOT still supports the usage of CFN resources, in the new format.
  3. My image did not have the cfn tools installed and thus I could not use cfn calls directly from inside the machine during the post-boot phase. This is an issue as from the templates which can be found on github, they all use these tools when WaitConditions are used.

The idea of WaitConditions is that they have to be declared and linked to one resource, and when this resource is configured and ready it sends a signal back to Heat. Another resource depending on this signal can then be started. The template which met my requirements can be found on github, I will explain the relevant parts here:


  service1: 
    type: "OS::Nova::Server"
    properties: 
      flavor: m1.medium
      image: ubuntu_cloud
      key_name: 
        get_param: key_name
      user_data: 
        str_replace: 
          template: |
              #!/bin/bash
              curl -X PUT -H 'Content-Type:application/json' \
                   -d '{"Status" : "SUCCESS","Reason" : "Configuration OK","UniqueId" : "SERVICE1","Data" : "Service1 Configured."}' \
                   "$wait_handle$"
          params: 
            $wait_handle$: 
              get_resource: service1_wait_handle

  service1_wait: 
    type: "AWS::CloudFormation::WaitCondition"
    depends_on: service1
    properties: 
      Handle: 
        get_resource: service1_wait_handle
      Timeout: 1000

  service1_wait_handle: 
    type: "AWS::CloudFormation::WaitConditionHandle"

A first resource “service1” is declared, with the WaitCondition and WaitConditionHandle declared as separate resources linked together with a dependence on service1 in the case of the WaitCondition. The interesting part is in the post-boot script of service1: user-data. Here you can a curl with a specific JSON data blob (details on CloudFormation’s website) sent through a PUT on an address retrieved from the WaitConditionHandle designed as service1_wait_handle. This is what signals the success to the wait condition. Now how is it possible to specify that the next virtual instance has to wait for this success signal before being started?


  service2: 
    type: "OS::Nova::Server"
    depends_on: service1_wait
    properties: 
      flavor: 
        get_param: instance_type
      image: ubuntu_cloud
      key_name: 
        get_param: key_name
      user_data: 
        str_replace: 
          template: |
              #!/bin/bash
              curl -X PUT -H 'Content-Type:application/json' \
                -d '{"Status" : "SUCCESS","Reason" : "Configuration OK","UniqueId" : "SERVICE2","Data" : "Service2 Configured."}' \
                "$wait_handle$"
          params: 
            $data$: 
              get_attr: 
                - service1_wait
                - Data
            $wait_handle$: 
              get_resource: service2_wait_handle
		
  service2_wait: 
    type: "AWS::CloudFormation::WaitCondition"
    depends_on: service2
    properties: 
      Handle: 
        get_resource: service2_wait_handle
      Timeout: 1000

  service2_wait_handle: 
    type: "AWS::CloudFormation::WaitConditionHandle"

Here you can see a structure similar to the one shown on the previous code snippet, with a new WaitCondition and Handle. This is because this server will in turn need to be configured before the final server can be started. The service2 resource differs on two points:

depends_on: service1_wait

This specifies that this resource depends on the completion of the service1_wait WaitCondition. Intuitively this should be enough as one might think that this will only happen when the success signal previously described is sent. Unfortunately it is not sufficient, at least in the Havana Release where this template was tested the resource did not wait at all and was started as soon as the template was created. A work-around to this problem is implemented in this code snippet:


  params: 
    $data$: 
      get_attr: 
        - service1_wait
        - Data

This specifically tells Heat that service2 needs to retrieve the data (in our case, a string) sent through the curl call in the service1 post-boot script. This requirement is what actually makes service2 wait for service1 to be ready, even if in the actual post-boot script of service2, there is no reference to this data at all: it is sufficient to retrieve it in the params sections of str_replace and not use it at all in the actual script. With this template, you can now start and configure you instances in whatever order fits your application’s requirements, and even combine wait conditions so that instance C waits for instance B which in turn waits for instance A. It is also possible to actually use the data sent through the success signal in other templates if this actually makes sense if your application configuration scheme.

Deploy Ceph and start using it: end to end tutorial – simple librados client (part 3/3)

(Part 1/3 – Installation – Part 2/3 – troubleshooting)

This part of the tutorial describes how to setup a simple Ceph client using librados (for C++).

The only information that the client requires for the cephx authentication is

  • Endpoint of the monitor node
  • Keyring containing the pre-shared secret (we will use the admin keyring)

Install librados APIs

On Ubuntu, the library is available on the repositories

$ sudo apt-get install librados-dev

Create a client configuration file

This is the file from which librados will read the client configuration.

The content of the file is structured according to this template:

[global]
mon host= <IP address of one of the monitors>
keyring = <path/to/client.admin.keyring>

for example:

[global]
mon host = 192.168.252.10:6789
keyring = ./ceph.client.admin.keyring

The public endpoint of the monitor node can be retrieved with

$ ceph mon stat

The keyring file can be copied from the admin node. No change is needed to this file. The same information that is contained in the file can be retrieved with this command that will also list the client capabilities:

$ ceph auth get client.admin

Connect to the cluster

The following simple client will perform the following operations:

  • Read the configuration file (ceph.conf) from the local directory
  • Get an handle to the cluster and IO context on the “data” pool
  • Create a new object
  • Set an xattr
  • Read the object and xattr back
  • Print the list of pools
  • Print the list of objects in the “data” pool
  • Cleanup
  1. #include <rados/librados.hpp>
  2. #include <string>
  3. #include <list>
  4. int main(int argc, const char **argv)
  5. {
  6.   int ret = 0;
  7.   /*
  8.    * Errors are not checked to avoid pollution.
  9.    * After each Ceph operation:
  10.    * if (ret < 0) error_condition
  11.    * else success
  12.    */
  13.   // Get cluster handle and connect to cluster
  14.   std::string cluster_name(“ceph”);
  15.   std::string user_name(“client.admin”);
  16.   librados::Rados cluster;
  17.   cluster.init2(user_name.c_str(), cluster_name.c_str()0);
  18.   cluster.conf_read_file(“ceph.conf”);
  19.   cluster.connect();
  20.   // IO context
  21.   librados::IoCtx io_ctx;
  22.   std::string pool_name(“data”);
  23.   cluster.ioctx_create(pool_name.c_str(), io_ctx);
  24.   // Write an object synchronously
  25.   librados::bufferlist bl;
  26.   std::string objectId(“hw”);
  27.   std::string objectContent(“Hello World!”);
  28.   bl.append(objectContent);
  29.   io_ctx.write(objectId, bl, objectContent.size()0);
  30.   // Add an xattr to the object.
  31.   librados::bufferlist lang_bl;
  32.   lang_bl.append(“en_US”);
  33.   io_ctx.setxattr(objectId, “lang”, lang_bl);
  34.   // Read the object back asynchronously
  35.   librados::bufferlist read_buf;
  36.   int read_len = 4194304;
  37.   //Create I/O Completion.
  38.   librados::AioCompletion *read_completion =
  39.                                              librados::Rados::aio_create_completion();
  40.   //Send read request.
  41.   io_ctx.aio_read(objectId, read_completion, &read_buf, read_len, 0);
  42.   // Wait for the request to complete, and print content
  43.   read_completion>wait_for_complete();
  44.   read_completion>get_return_value();
  45.   std::cout << “Object name: “ << objectId << \n
  46.             << “Content: “ << read_buf.c_str() << std::endl;
  47.   // Read the xattr.
  48.   librados::bufferlist lang_res;
  49.   io_ctx.getxattr(objectId, “lang”, lang_res);
  50.   std::cout << “Object xattr: “ << lang_res.c_str() << std::endl;
  51.   // Print the list of pools
  52.   std::list<std::string> pools;
  53.   cluster.pool_list(pools);
  54.   std::cout << “List of pools from this cluster handle” << std::endl;
  55.   for (auto pool_id : pools) {
  56.     std::cout << \t << pool_id << std::endl;
  57.   }
  58.   // Print the list of objects
  59.   librados::ObjectIterator oit = io_ctx.objects_begin();
  60.   librados::ObjectIterator oet = io_ctx.objects_end();
  61.   std::cout << “List of objects from this pool” << std::endl;
  62.   for (; oit != oet; oit++) {
  63.     std::cout << \t << oit>first << std::endl;
  64.   }
  65.   // Remove the xattr
  66.   io_ctx.rmxattr(objectId, “lang”);
  67.   // Remove the object.
  68.   io_ctx.remove(objectId);
  69.   // Cleanup
  70.   io_ctx.close();
  71.   cluster.shutdown();
  72.   return 0;
  73. }

Find the pastebin here.

This example can be compiled and executed with

$ g++ client.cpp -lrados -o cephclient
$ ./cephclient

Operate with cluster data from the command line

To quickly verify if an object was written or to remove it, use the following commands (e.g., from the monitor node).

  • List objects in pool data

    $ rados -p data ls
  • Check the location of an object in pool data

    $ ceph osd map data <object name>
  • Remove object from pool data

    $ rados rm <object name> --pool=data

Deploy Ceph and start using it: end to end tutorial – Troubleshooting (part 2/3)

(Part 1/3 – Installation – Part 3/3 – librados client)

It is quite common that after the initial installation, the Ceph cluster reports health warnings. Before using the cluster for storage (e.g., allow clients to access it), a HEALTH_OK state should be reached:

cluster-admin@ceph-mon0:~/ceph-cluster$ ceph health
HEALTH_OK

This part of the tutorial provides some troubleshooting hints that I collected during the setup of my deployments. Other helpful resources are the Ceph IRC channel and mailing lists.

Useful diagnostic commands

A collection of diagnostic commands to check the status of the cluster is listed here. Running these commands is how we can understand that the Ceph cluster is not properly configured.

  1. Ceph status
    $ ceph status

    In this example, the disk for one OSD had been physically removed, so 2 out of 3 OSDs were in and up.

    cluster-admin@ceph-mon0:~/ceph-cluster$ ceph status
        cluster 28f9315e-6c5b-4cdc-9b2e-362e9ecf3509
         health HEALTH_OK
         monmap e1: 1 mons at {ceph-mon0=192.168.0.1:6789/0}, election epoch 1, quorum 0 ceph-mon0
         osdmap e122: 3 osds: 2 up, 2 in
          pgmap v4699: 192 pgs, 3 pools, 0 bytes data, 0 objects
                87692 kB used, 1862 GB / 1862 GB avail
                     192 active+clean
  2. Ceph health
    $ ceph health
    $ ceph health detail
  3. Pools and OSDs configuration and status
    $ ceph osd dump
    $ ceph osd dump --format=json-pretty

    the second version provides much more information, listing all the pools and OSDs and their configuration parameters

  4. Tree of OSDs reflecting the CRUSH map
    $ ceph osd tree

    This is very useful to understand how the cluster is physically organized (e.g., which OSDs are running on which host).

  5. Listing the pools in the cluster
    $ ceph osd lspools

    This is particularly useful to check clients operations (e.g., if new pools were created).

  6. Check the CRUSH rules
    $ ceph osd crush dump --format=json-pretty
  7. List the disks of one node from the admin node
    $ ceph-deploy disk list osd0
  8. Check the logs.
    Log files in /var/log/ceph/ will provide a lot of information for troubleshooting. Each node of the cluster will contain logs about the Ceph components that it runs, so you may need to SSH on different hosts to have a complete diagnosis.

Check your firewall and network configuration

Every node of the Ceph cluster must be able to successfully run

$ ceph status

If this operation times out without giving any results, it is likely that the firewall (or network configuration) is not allowing the nodes to communicate.

Another symptom of this problem is that OSDs cannot be activated, i.e., the ceph-deploy osd activate <args> command will timeout.

Ceph monitor default port is 6789Ceph OSDs and MDS try to get the first available ports starting at 6800.

A typical Ceph cluster might need the following ports:

Mon:  6789
Mds:  6800
Osd1: 6801
Osd2: 6802
Osd3: 6803

Depending on your security requirements, you may want to simply allow any traffic to and from the Ceph cluster nodes.

References: http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/2231

Try restarting first

Without going for fine troubleshootings and log analysis, sometimes (especially after the first installation), I’ve noticed that a simple restart of the Ceph components has helped the transition from a HEALTH_WARN to a HEALTH_OK state.

If some of the OSDs are not in or not up, like in the case below

    cluster 07d28faa-48ae-4356-a8e3-19d5b81e159e
     health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs stuck unclean; 1/2 in osds are down; clock skew detected on mon.1, mon.2
     monmap e3: 3 mons at {0=192.168.252.10:6789/0,1=192.168.252.11:6789/0,2=192.168.252.12:6789/0}, election epoch 36, quorum 0,1,2 0,1,2
     osdmap e27: 6 osds: 1 up, 2 in
      pgmap v57: 192 pgs, 3 pools, 0 bytes data, 0 objects
            84456 kB used, 7865 MB / 7948 MB avail
                 192 incomplete

try to start the OSD daemons with

# on osd0
$ sudo /etc/init.d/ceph -a start osd0

If the OSDs are in, but PGs are in weird states, like in the example below

cluster 07d28faa-48ae-4356-a8e3-19d5b81e159e
     health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; clock skew detected on mon.1, mon.2
     monmap e3: 3 mons at {0=192.168.252.10:6789/0,1=192.168.252.11:6789/0,2=192.168.252.12:6789/0}, election epoch 36, quorum 0,1,2 0,1,2
     osdmap e34: 6 osds: 6 up, 6 in
      pgmap v71: 192 pgs, 3 pools, 0 bytes data, 0 objects
            235 MB used, 23608 MB / 23844 MB avail
                 128 active+degraded
                  64 active+replay+degraded

try to restart the monitor(s) with

# on mon0
$ sudo /etc/init.d/ceph -a restart mon0

Unfortunately, a simple restart will be the solution in just a few rare cases. More troubleshooting will be required in the majority of the situations.

Unable to find keyring

During the deployment of the monitor nodes (the ceph-deploy <mon> [<mon>] create-initial step), Ceph may complain about missing keyrings:

[ceph_deploy.gatherkeys][WARNIN] Unable to find
/etc/ceph/ceph.client.admin.keyring on ['ceph-server']

If this warning is reported (even if the message is not an error), the Ceph cluster will probably not reach an healthy state.

The solution to this problem is to use exactly the same names for the hostnames (i.e., the output of hostname -s) and the Ceph node names.

This means that the files

  • /etc/hosts
  • /etc/hostname
  • .ssh/config (only for the admin node)

and the result of the command hostname -s, all should have the same names for a certain node.

See also:

 Check that replication requirements can be met

I’ve found that most of my problems with Ceph health were related to wrong (i.e., unfeasible) replication policies.

This is particularly likely to happen in test deployment where one doesn’t care about setting up many OSDs or separating them across different hosts.

Some common pitfalls here may be:

  1. The number of required replicas is higher than the number of OSDs (!!)
  2. CRUSH is instructed to separate replicas across hosts but multiple OSDs are on the same host and there are not enough OSD hosts to satisfy this condition

The visible effect when running diagnostic commands is that PGs will be in wrong statuses.

CASE 1the replication level is such that it cannot be accomplished with the current cluster (e.g., a replica size of 3 with 2 OSDs).

Check the replicated size of pools with

$ ceph osd dump

Adjust the replicated size and min_size, if required, by running

$ ceph osd pool set <pool_name> size <value>
$ ceph osd pool set <pool_name> min_size <value>

CASE 2: the replication policy would require replicas to sit on separate hosts, but OSDs are running within the same hosts

Check what crush_ruleset applies to a certain pool with

$ ceph osd dump --format=json-pretty

In the example below, the pool with id 0 (“data”) is using the crush_ruleset with id 0

"pools": [
        { "pool": 0,
          "pool_name": "data",
          [...]
          "crush_ruleset": 0,  <----
          "object_hash": 2,
          [...]

then check with

$ ceph osd crush dump --format=json-pretty

what crush_ruleset 0 is about.

In the example below, we can observe that this rules says to replicate data by choosing the first available leaf in the CRUSH map, which is of type host.

"rules": [
        { "rule_id": 0,
          "rule_name": "replicated_ruleset",
          "ruleset": 0,
          "type": 1,
          "min_size": 1,
          "max_size": 10,
          "steps": [
                { "op": "take",
                  "item": -1,
                  "item_name": "default"},
                { "op": "chooseleaf_firstn",     <-----------
                  "num": 0,
                  "type": "host"},               <-----------
                { "op": "emit"}]}],

If not enough hosts are available, then the application of this rule will fail.

To allow replicas to be created on different OSDs but possibly on the same host, we need to create a new ruleset:

$ ceph osd crush rule create-simple replicate_within_hosts default osd

After the rule has been created, it should be listed in the output of

$ ceph osd crush dump

from where we can not its id.

The next step is to apply this rule to the pools as required:

$ ceph osd pool set data crush_ruleset <rulesetId>
$ ceph osd pool set metadata crush_ruleset <rulesetId>
$ ceph osd pool set rbd crush_ruleset <rulesetId>

Deploy Ceph and start using it: end to end tutorial – Installation (part 1/3)

Ceph is one of the most interesting distributed storage systems available, with a very active development and a complete set of features that make it a valuable candidate for cloud storage services. This tutorial goes through the required steps (and some related troubleshooting), required to setup a Ceph cluster and access it with a simple client using librados. Please refer to the Ceph documentation for detailed insights on Ceph components.

(Part 2/3 – Troubleshooting – Part 3/3 – librados client)

Assumptions

  • Ceph version: 0.79
  • Installation with ceph-deploy
  • Operating system for the Ceph nodes: Ubuntu 14.04

Cluster architecture

In a minimum Ceph deployment, a Ceph cluster includes one Ceph monitor (MON) and a number of Object Storage Devices (OSD).

Administrative and control operations are issued from an admin node, which must not necessarily be separated from the Ceph cluster (e.g., the monitor node can also act as the admin node). Metadata server nodes (MDS) are required only for Ceph Filesystem (Ceph Block Devices and Ceph Object Storage do not use MDS).

Preparing the storage

WARNING: preparing the storage for Ceph means to delete a disk’s partition table and lose all its data. Proceed only if you know exactly what you are doing!

Ceph will need some physical storage to be used as Object Storage Devices (OSD) and Journal. As the project documentation recommends, for better performance, the Journal should be on a separate drive than the OSD. Ceph supports ext4, btrfs and xfs. I tried setting up clusters with both btrfs and xfs, however I could achieve stable results only with xfs, so I will refer to this latter.

  1. Prepare a GPT partition table (I have observed stability issues when using a dos partition)
    $ sudo parted /dev/sd<x>
    (parted) mklabel gpt
    (parted) mkpart primary xfs 0 ­100%
    (parted) quit

    if parted complains about alignment issues (“Warning: The resulting partition is not properly aligned for best performance”), check this two links to find a solution: 1 and 2.

  2. Format the disk with xfs (you might need to install xfs tools with sudo apt-get install xfsprogs)
    $ sudo mkfs.xfs /dev/sd<x>1
  3. Create a Journal partition (raw/unformatted)
    $ sudo parted /dev/sd<y>
    (parted) mklabel gpt
    (parted) mkpart primary 0 100%

 Install Ceph deploy

The ceph-deploy tool must only be installed on the admin node. Access to the other nodes for configuration purposes will be handled by ceph-deploy over SSH (with keys).

  1. Add Ceph repository to your apt configuration, replace {ceph-stable-release} with the Ceph release name that you want to install (e.g., emperor, firefly, …)
    $ echo deb http://ceph.com/debian-{ceph-stable-release}/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
  2. Install the trusted key with
    $ wget -q -O- 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' | sudo apt-key add -
  3. If there is no repository for your Ubuntu version, you can try to select the newest one available by manually editing the file /etc/apt/sources.list.d/ceph.list and changing the Ubuntu codename (e.g., trusty -> raring)
    $ deb http://ceph.com/debian-emperor raring main
  4. Install ceph-deploy
    $ sudo apt-get update
    $ sudo apt-get install ceph-deploy

Setup the admin node

Each Ceph node will be setup with an user having passwordless sudo permissions and each node will store the public key of the admin node to allow for passwordless SSH access. With this configuration, ceph-deploy will be able to install and configure every node of the cluster.

NOTE: the hostnames (i.e., the output of hostname -s) must match the Ceph node names!

  1. [optional] Create a dedicated user for cluster administration (this is particularly useful if the admin node is part of the Ceph cluster)
    $ sudo useradd -d /home/cluster-admin -m cluster-admin -s /bin/bash

    then set a password and switch to the new user

    $ sudo passwd cluster-admin
    $ su cluster-admin
  2. Install SSH server on all the cluster nodes (even if a cluster node is also an admin node)
    $ sudo apt-get install openssh-server
  3. Add a ceph user on each Ceph cluster node (even if a cluster node is also an admin node) and give it passwordless sudo permissions
    $ sudo useradd -d /home/ceph -m ceph -s /bin/bash
    $ sudo passwd ceph
    <Enter password>
    $ echo "ceph ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph
    $ sudo chmod 0440 /etc/sudoers.d/ceph
  4. Edit the /etc/hosts file to add mappings to the cluster nodes. Example:
    $ cat /etc/hosts
    127.0.0.1       localhost
    192.168.58.2    mon0
    192.168.58.3    osd0
    192.168.58.4    osd1

    to enable dns resolution with the hosts file, install dnsmasq

    $ sudo apt-get install dnsmasq
  5. Generate a public key for the admin user and install it on every ceph nodes
    $ ssh-keygen
    $ ssh-copy-id ceph@mon0
    $ ssh-copy-id ceph@osd0
    $ ssh-copy-id ceph@osd1
  6. Setup an SSH access configuration by editing the .ssh/config file. Example:
    Host osd0
       Hostname osd0
       User ceph
    Host osd1
       Hostname osd1
       User ceph
    Host mon0
       Hostname mon0
       User ceph
  7. Before proceeding, check that ping and host commands work for each node
    $ ping mon0
    $ ping osd0
    ...
    $ host osd0
    $ host osd1

Setup the cluster

Administration of the cluster is done entirely from the admin node.

  1. Move to a dedicated directory to collect the files that ceph-deploy will generate. This will be the working directory for any further use of ceph-deploy
    $ mkdir ceph-cluster
    $ cd ceph-cluster
  2. Deploy the monitor node(s) – replace mon0 with the list of hostnames of the initial monitor nodes
    $ ceph-deploy new mon0
    [ceph_deploy.cli][INFO  ] Invoked (1.4.0): /usr/bin/ceph-deploy new mon0
    [ceph_deploy.new][DEBUG ] Creating new cluster named ceph
    [ceph_deploy.new][DEBUG ] Resolving host mon0
    [ceph_deploy.new][DEBUG ] Monitor mon0 at 192.168.58.2
    [ceph_deploy.new][INFO  ] making sure passwordless SSH succeeds
    [ceph_deploy.new][DEBUG ] Monitor initial members are ['mon0']
    [ceph_deploy.new][DEBUG ] Monitor addrs are ['192.168.58.2']
    [ceph_deploy.new][DEBUG ] Creating a random mon key...
    [ceph_deploy.new][DEBUG ] Writing initial config to ceph.conf...
    [ceph_deploy.new][DEBUG ] Writing monitor keyring to ceph.mon.keyring...
  3. Add a public network entry in the ceph.conf file if you have separate public and cluster networks (check the network configuration reference)
    public network = {ip-address}/{netmask}
  4. Install ceph in all the nodes of the cluster. Use the --no-adjust-repos option if you are using different apt configurations for ceph. NOTE: you may need to confirm the authenticity of the hosts if your accessing them on SSH for the first time!
    Example (replace mon0 osd0 osd1 with your node names):

    $ ceph-deploy install --no-adjust-repos mon0 osd0 osd1
  5. Create monitor and gather keys
    $ ceph-deploy mon create-initial
  6. The content of the working directory after this step should look like
    cadm@mon0:~/my-cluster$ ls
    ceph.bootstrap-mds.keyring  ceph.bootstrap-osd.keyring  ceph.client.admin.keyring  ceph.conf  ceph.log  ceph.mon.keyring  release.asc

Prepare OSDs and OSD Daemons

When deploying OSDs, consider that a single node can run multiple OSD Daemons and that the journal partition should be on a separate drive than the OSD for better performance.

  1. List disks on a node (replace osd0 with the name of your storage node(s))
    $ ceph-deploy disk list osd0

    This command is also useful for diagnostics: when an OSD is correctly mounted on Ceph, you should see entries similar to this one in the output:

    [ceph-osd1][DEBUG ] /dev/sdb :
    [ceph-osd1][DEBUG ] /dev/sdb1 other, xfs, mounted on /var/lib/ceph/osd/ceph-0
  2. If you haven’t already prepared your storage, or if you want to reformat a partition, use the zap command (WARNING: this will erase the partition)
    $ ceph-deploy disk zap --fs-type xfs osd0:/dev/sd<x>1
  3. Prepare and activate the disks (ceph-deploy also has a create command that should combine this two operations together, but for some reason it was not working for me). In this example, we are using /dev/sd<x>1 as OSD and /dev/sd<y>2 as journal on two different nodes, osd0 and osd1
    $ ceph-deploy osd prepare osd0:/dev/sd<x>1:/dev/sd<y>2 osd1:/dev/sd<x>1:/dev/sd<y>2
    $ ceph-deploy osd activate osd0:/dev/sd<x>1:/dev/sd<y>2 osd1:/dev/sd<x>1:/dev/sd<y>2

Final steps

Now we need to copy the cluster configuration to all nodes and check the operational status of our Ceph deployment.

  1. Copy keys and configuration files, (replace mon0 osd0 osd1 with the name of your Ceph nodes)
    $ ceph-deploy admin mon0 osd0 osd1
  2. Ensure proper permissions for admin keyring
    $ sudo chmod +r /etc/ceph/ceph.client.admin.keyring
  3. Check the Ceph status and health
    $ ceph health
    $ ceph status

    If, at this point, the reported health of your cluster is HEALTH_OK, then most of the work is done. Otherwise, try to check the troubleshooting part of this tutorial.

Revert installation

There are useful commands to purge the Ceph installation and configuration from every node so that one can start over again from a clean state.

This will remove Ceph configuration and keys

ceph-deploy purgedata {ceph-node} [{ceph-node}]
ceph-deploy forgetkeys

This will also remove Ceph packages

ceph-deploy purge {ceph-node} [{ceph-node}]

Before getting a healthy Ceph cluster I had to purge and reinstall many times, cycling between the “Setup the cluster”, “Prepare OSDs and OSD Daemons” and “Final steps” parts multiple times, while removing every warning that ceph-deploy was reporting.