What is active storage about?
In most of the distributed storage systems, the data nodes are decoupled from compute nodes. Disaggregation of storage from the compute servers is motivated by an improved efficiency of storage utilization and a better and mutually independent scalability of computation and storage.
While the above consideration is indisputable, several situations exist where moving computation close to the data brings important benefits. In particular, whenever the stored data is to be processed for analytics purposes, all the data needs to be moved from the storage to the compute cluster (consuming network bandwidth). After some analytics on the data, in most cases the results need to go back to the storage. Another important observation is that large amounts of resources (CPU and memory) are available in the storage infrastructure which usually remain underutilized. Active storage is a research area that studies the effects of moving computation close to data and analyzes the fields of application where data locality actually introduces benefits. In short, active storage allows to run computation tasks where the data is, leveraging storage nodes’ underutilized resources, reducing data movement between storage and compute clusters.
There are many active storage frameworks in the research community. One example of active storage is is the OpenStack Storlets framework, developed by IBM and integrated within OpenStack Swift deployments. IOStack is European funded project, that builds around this concept for object storage. Another example is ZeroVM, which allows developers to push their application to their data instead of having to pull their data to their application.
So, what about Ceph?
Ceph is a widespread unified, distributed storage system that offers high performance, reliability, and scalability. Ceph plays a very important role in the open-source storage world. Nonetheless, we had to dig a bit deeper to find out that also Ceph has a feature to actually implement active storage. Object classes are the technology to support this, but probably being not a widely known and adopted Ceph feature. Specifically, object classes allow to extend Ceph by loading custom code directly into OSDs that can then be executed by a librados application. In particular, the created object classes can define methods having the ability to call the native methods in the Ceph object store or other class methods incorporated via libraries (or created yourself). As a further effect, this allows to exploit the distributed scale of Ceph also for computational tasks as parallel computing over OSDs can be achieved. The resulting available compute power is much higher than what a single client could provide!
Although, the Ceph official documentation is not really exhaustive in describing the use of Ceph object classes, a very useful set of examples can be found in the book Mastering Ceph by Nick Fish. Chapters 5 and 6 represent a very good guideline for understanding the basics of building applications that directly interact with a Ceph cluster through librados and building own object classes. Based on the examples presented in the book, in the remaining of this post we will report our experiments with Ceph object classes. Multiple non-obvious configuration steps are to be taken to finally deploy object classes in a Ceph cluster.
Deploying an Object Class on a Ceph cluster
At this stage we assume our testing Ceph cluster is up and running with one monitor and three OSDs. For more details on the Ceph cluster deployment steps, please refer to the specific sections at the bottom of this blog post.
The example we report on here is modelling the case where we want to calculate an MD5 hash of every object in a RADOS pool and store the resulting hash as an attribute of the specific object. A first solution to this problem would be that the client requests the object, performs the computation remotely and then pushes back the attribute in the storage cluster. The second option instead, is to create an object class that allows to read an object from a pool, calculates the MD5 hash and stores it as an attribute to the object. In this second option, the client would just have to send a command to the OSD to execute the object class.
We will show the implementation for both the options described above. The goal is not only to show that the final result is the same, but we will also report on the performance comparison in terms of time needed to reach the solution. For a better comparison we will implement a code that repeats the MD5 hash calculation for 1000 times. From the final comparison of the two solution, we will notice the benefits obtained by using object classes exploiting data locality for computation. Actually, we can’t wait with revealing the result! Specifically, using the Ceph object class on the OSDs, the computation lasts only 0.126s when adopting the active storage concept, instead of 7.735s when computation is made remotely on the client. This results in a 98.4% time saving, which is a very important result! In the next sections we report on the steps performed to obtain these results on our own test Ceph cluster.
Cloning the Ceph git repository
To be able to create object classes in a running Ceph cluster we first need to clone the Ceph git repository. Note, this should be done on one monitor node in the Ceph cluster. Very important is to make sure to clone the git branch corresponding to the deployed Ceph version in the cluster! In our Ceph cluster, the monitor node is node mon1, and the installed Ceph version is mimic:
ceph-admin@mon1:~$ git clone -branch mimic https://github.com/ceph/ceph.git
Next step towards the ability to build Ceph object classes, is to install some required additional packages. To do this, the install-deps.sh script in the Ceph source tree should be run and the build-essential package installed:
ceph-admin@mon1:~/ceph$ ./install-deps.sh ceph-admin@mon1:~/ceph$ sudo apt-get install build-essential
Writing an object class
We can now write the object class that reads an object, calculates the MD5 hash and writes it as an attribute of the object itself. Note this class will perform these operations without any client involvement and iterate 1000 times (for a better performance comparison). To this aim we create a C++ source file called cls_md5.cc under a directory md5 we created under the /ceph/src/cls folder. The source code is reported below:
#include <openssl/md5.h> #include "objclass/objclass.h" CLS_VER(1,0) CLS_NAME(md5) cls_handle_t h_class; cls_method_handle_t h_calc_md5; static int calc_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out) { char md5string[33]; for(int i = 0; i < 1000; ++i) { size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret; bufferlist data; ret = cls_cxx_read(hctx, 0, size, &data); if (ret < 0) return ret; unsigned char md5out[16]; MD5((unsigned char*)data.c_str(), data.length(), md5out); for(int i = 0; i < 16; ++i) sprintf(&md5string[i*2], "%02x", (unsigned int)md5out[i]); CLS_LOG(0,"Loop:%d - %s",i,md5string); bufferlist attrbl; attrbl.append(md5string); ret = cls_cxx_setxattr(hctx, "MD5", &attrbl); if (ret < 0) { CLS_LOG(0, "Error setting attribute"); return ret; } } out->append((const char*)md5string, sizeof(md5string)); return 0; } void __cls_init() { CLS_LOG(0, "loading cls_md5"); cls_register("md5", &h_class); cls_register_cxx_method(h_class, "calc_md5", CLS_METHOD_RD | CLS_METHOD_WR, calc_md5, &h_calc_md5); }
We can now proceed with building the new object class we wrote. It is not necessary to build the whole Ceph git repository, but can limit ourselves to building the cls_md5 class. Before doing this, we need to add to the CMakeLists.txt file (under the /ceph/src/cls folder) a section for the new class we wrote. For the cls_md5 class the section to add is the following:
# cls_md5 set(cls_md5_srcs md5/cls_md5.cc) add_library(cls_md5 SHARED ${cls_md5_srcs}) set_target_properties(cls_md5 PROPERTIES VERSION "1.0.0" SOVERSION "1" INSTALL_RPATH "") install(TARGETS cls_md5 DESTINATION ${cls_dir}) target_link_libraries(cls_md5 crypto) list(APPEND cls_embedded_srcs ${cls_md5_srcs})
Once the file is updated we can use cmake to make the build environment. In our experiment, cmake was not installed by defaults so we had install it first. Running the do_cmake.sh script creates a build directory in the source tree. Inside this directory we can use make to create our new object class cls_md5:
ceph-admin@mon1:~/ceph$ sudo apt install cmake ceph-admin@mon1:~/ceph$ do_cmake.sh ceph-admin@mon1:~/ceph/build$ make cls_md5
Once the class is compiled correctly, we will have to copy the class in each of the OSDs in the Ceph cluster under the /usr/lib/rados-classes directory. After this, we need to restart the OSDs in order to load the new class. By default, the OSDs will not be allowed to load any new class. We therefore need to whitelist the new object classes on the OSDs. To do this in each OSD the ceph.conf configuration file should be updated (under the /etc/ceph directory) to include the following lines:
[osd] osd class load list = * osd class default list = *
Now we are ready to copy the compiled classes into the OSDs (in our case nodes osd1, osd2 and osd3) and restart the daemons:
ceph-admin@osd1:/usr/lib/rados-classes$ sudo scp ceph-admin@mon1:/home/ceph-admin/ceph/build/lib/libcls_md5.so* . ceph-admin@osd1:/usr/lib/rados-classes$ sudo systemctl stop ceph-osd.target ceph-admin@osd1:/usr/lib/rados-classes$ sudo systemctl start ceph-osd.target
To make sure the new classes are loaded correctly, we can have a look at the log file to see whether or not an error occurred. For instance in our osd3 node we will see:
ceph-admin@osd3:~$ sudo cat /var/log/ceph/ceph-osd.3.log | grep cls 2019-12-17 14:16:39.394 7fc6752b5c00 0 /home/ceph-admin/ceph/src/cls/md5/cls_md5.cc:43: loading cls_md5
Writing the librados client applications
We are now ready to write our two librados client applications that either calculate the MD5 hash remotely on the client or call the newly created object class. As expected, the result from the two solutions will be the same, but the computation time is different. Note that both librados applications need to be run on the monitor node in the Ceph cluster.
Specifically, both client applications will act on a pool called rbd (this should be created first using the command: ceph osd pool create rbd 128), on an object called LowerObject where a parameter called MD5 will be added to contain the MD5 hash. The client application that computes the MD5 hash remotely on the client is saved in a file called rados_md5.cc:
#include <cctype> #include <rados/librados.hpp> #include <iostream> #include <string> #include <openssl/md5.h> void exit_func(int ret); librados::Rados rados; int main(int argc, const char **argv) { int ret = 0; // Define variables const char *pool_name = "rbd"; std::string object_name("LowerObject"); librados::IoCtx io_ctx; // Create the Rados object and initialize it { ret = rados.init("admin"); // Use the default client.admin keyring if (ret < 0) { std::cerr << "Failed to initialize rados! error " << ret << std::endl; ret = EXIT_FAILURE; } } // Read the ceph config file in its default location ret = rados.conf_read_file("/etc/ceph/ceph.conf"); if (ret < 0) { std::cerr << "Failed to parse config file " << "! Error" << ret << std::endl; ret = EXIT_FAILURE; } // Connect to the Ceph cluster ret = rados.connect(); if (ret < 0) { std::cerr << "Failed to connect to cluster! Error " << ret << std::endl; ret = EXIT_FAILURE; } else { std::cout << "Connected to the Ceph cluster" << std::endl; } // Create connection to the Rados pool ret = rados.ioctx_create(pool_name, io_ctx); if (ret < 0) { std::cerr << "Failed to connect to pool! Error: " << ret << std::endl; ret = EXIT_FAILURE; } else { std::cout << "Connected to pool: " << pool_name << std::endl; } for(int i = 0; i < 1000; ++i) { size_t size; int ret = io_ctx.stat(object_name, &size, NULL); if (ret < 0) return ret; librados::bufferlist data; ret = io_ctx.read(object_name, data, size, 0); if (ret < 0) return ret; unsigned char md5out[16]; MD5((unsigned char*)data.c_str(), data.length(), md5out); char md5string[33]; for(int i = 0; i < 16; ++i) sprintf(&md5string[i*2], "%02x", (unsigned int)md5out[i]); librados::bufferlist attrbl; attrbl.append(md5string); ret = io_ctx.setxattr(object_name, "MD5", attrbl); if (ret < 0) { exit_func(1); } } exit_func(0); } void exit_func(int ret) { // Clean up and exit rados.shutdown(); exit(ret); }
What this application does, is to create the Rados object, read the configuration file for the Ceph cluster, connect to the Ceph cluster, connected to the rbd Rados pool, read the LowerObject object from
the OSD, calculate the MD5 hash of the object on the client, and write it back as an attribute called MD5 to the object.
The second client application instead computes the MD5 hash on the OSDs using the created object class is saved in a file called rados_class_md5.cc:
#include <cctype> #include <rados/librados.hpp> #include <iostream> #include <string> void exit_func(int ret); librados::Rados rados; int main(int argc, const char **argv) { int ret = 0; // Define variables const char *pool_name = "rbd"; std::string object_name("LowerObject"); librados::IoCtx io_ctx; // Create the Rados object and initialize it { ret = rados.init("admin"); // Use the default client.admin keyring if (ret < 0) { std::cerr << "Failed to initialize rados! error " << ret << std::endl; ret = EXIT_FAILURE; } } // Read the ceph config file in its default location ret = rados.conf_read_file("/etc/ceph/ceph.conf"); if (ret < 0) { std::cerr << "Failed to parse config file " << "! Error" << ret << std::endl; ret = EXIT_FAILURE; } // Connect to the Ceph cluster ret = rados.connect(); if (ret < 0) { std::cerr << "Failed to connect to cluster! Error " << ret << std::endl; ret = EXIT_FAILURE; } else { std::cout << "Connected to the Ceph cluster" << std::endl; } // Create connection to the Rados pool ret = rados.ioctx_create(pool_name, io_ctx); if (ret < 0) { std::cerr << "Failed to connect to pool! Error: " << ret << std::endl; ret = EXIT_FAILURE; } else { std::cout << "Connected to pool: " << pool_name << std::endl; } librados::bufferlist in, out; io_ctx.exec(object_name, "md5", "calc_md5", in, out); exit_func(0); } void exit_func(int ret) { // Clean up and exit rados.shutdown(); exit(ret); }
Also this application creates the Rados object, initializes it reading the configuration file for the Ceph cluster, connects to the Ceph cluster, creates a connection to the rbd Rados pool, and then calls the exec function that triggers the method calc_md5 from the md5 class passing the name of the object LowerObject and two buffers for input and output to the class. It will be the task of the called object class to calculate the MD5 hash and write it to the attribute called MD5 of the object (repeating this for 1000 times).
The two client applications can be compiled using the g++ compiler:
ceph-admin@mon1:~/test_app$ g++ rados_md5.cc -o rados_md5 -lrados -std=c++11 ceph-admin@mon1:~/test_app$ g++ rados_class_md5.cc -o rados_class_md5 -lrados -std=c++11
If the applications compile successfully (i.e. no output given), we are ready to test the applications and compare their performance.
Comparing the client applications performance
To compare the performance of the two client applications for the MD5 hashing computation, we can use the time utility from Linux and measure the time taken.
The first application we test is the one performing the computation remotely on the client, namely the rados_md5 application. Besides checking whether the MD5 hash has been computed and inserted as an attribute to the given object, we are interested in taking note of the computation time:
ceph-admin@mon1:~/test_app$ time sudo ./rados_md5 Connected to the Ceph cluster Connected to pool: rbd real 0m7.735s user 0m0.274s sys 0m0.211s ceph-admin@mon1:~/test_app$ sudo rados -p rbd getxattr LowerObject MD5 9d40bae4ff2032c9eff59806298a95bd
The second application we test is the one performing the computation directly on the OSDs using the object class we loaded on the Ceph nodes, namely the rados_class_md5 application. Note that we first need to delete the attribute form the object to make sure this is now computed by the object class (they are acting on the same pool, object and attribute).
ceph-admin@mon1:~/test_app$ sudo rados -p rbd rmxattr LowerObject MD5 ceph-admin@mon1:~/test_app$ sudo rados -p rbd getxattr LowerObject MD5 error getting xattr rbd/LowerObject/MD5: (61) No data available
Also here, besides checking whether the MD5 hash has been computed and inserted as an attribute to the given object, we are interested in taking note of the computation time:
ceph-admin@mon1:~/test_app$ time sudo ./rados_class_md5 Connected to the Ceph cluster Connected to pool: rbd real 0m0.126s user 0m0.042s sys 0m0.009s ceph-admin@mon1:~/test_app$ sudo rados -p rbd getxattr LowerObject MD5 9d40bae4ff2032c9eff59806298a95bd
As we compare the outputs, we notice that the MD5 hash corresponds in the two cases. What is instead the most interesting result is that using the object class on the OSDs, the computation lasts only 0.126s when adopting the active storage concept, instead of 7.735s when computation is performed remotely on the client. This results in a 98.4% time saving, which is a very important result.
VMs preparation for our test Ceph cluster deployment
Although, writing a tutorial on installing a Ceph cluster is not the main scope of this blog post, to give a complete overview of our study we report here a summary of the steps taken and the features of the used machines. Other tutorials are available in the Internet with more details on the steps to take.
For the scope of our tests, we deployed a Ceph cluster on our Openstack framework. We created six VM instances of flavor m1.large (1 vCPU, 2 GB of RAM, 20GB size) so that our Ceph cluster has one monitor, one ceph-admin node, one rgw node and 3 OSDs (osd1, osd2 and osd3). The OSDs have additional volumes attached (a 10GiB disk on osd1, osd2 and osd3, a 15GiB disk on osd3).
For the purpose of building a Ceph cluster, it is important to define security groups with rules that will open access to ports needed by the Ceph nodes. In particular, the ceph-admin node requires port 22, 80, 2003 and 4505-4506 to be open, the monitor node requires 22 and 6789 to be open, whereas the OSD nodes require ports 22 and the port range 6800:7300 to be open. Note that additional ports might need to be opened according to the specific configuration of the Ceph cluster.
All our VMs have Ubuntu Bionic (18.04.3 LTS) installed and have a floating IP associated. The resulting association of hostname and IP address for our study case is as follows:
hostname IP address ceph-admin 10.20.3.13 mon1 10.20.1.144 osd1 10.20.3.216 osd2 10.20.3.138 osd3 10.20.1.21 rgw 10.20.3.95
On each node we create a user named ‘ceph-admin’ and configure it for passwordless sudo privileges on all nodes. Further, on all machines we installed python and python-pip packages and updated the hosts configuration file with the list of hostnames and the corresponding IP address.
We used the ceph-admin node for configuring the Ceph cluster. To this aim this node needs to have privileges for passwordless SSH access for user ‘ceph-admin’ to all nodes. We therefore, generated the ssh keys for ‘ceph-admin’ user on the ceph-admin node launching the ssh-keygen command (leaving the passphrase blank/empty). We then edited the ssh configuration file (the ~/.ssh/config file) as follows:
Host osd1 Hostname osd1 User ceph-admin Host osd2 Hostname osd2 User ceph-admin Host osd3 Hostname osd3 User ceph-admin Host mon1 Hostname mon1 User ceph-admin Host mon2 Hostname mon2 User ceph-admin Host rgw Hostname rgw User ceph-admin
Further steps to finalize the configuration are:
- run chmod 644 ~/.ssh/config on the ceph-admin node
- run ssh-keyscan osd1 osd2 osd3 mon1 rgw >> ~/.ssh/known_hosts
- Access on each vm as root over ssh and edit the ssh configuration file: sudo nano /etc/ssh/sshd_config
- Change PasswordAuthentication to yes, and restart daemon: sudo systemctl restart sshd
- On the ceph-admin node (typing ceph-admin password when requested): ssh-copy-id osd1 – ssh-copy-id osd2 – ssh-copy-id osd3 – ssh-copy-id mon1 – ssh-copy-id rgw
Next step is to configure a firewall on the Ubuntu servers to protect the system leaving some specific ports open: 80, 2003 and 4505-4506 on the ceph-admin node, 22, 80 and 6789 on the mon1 node and 22, 6800-7300 on the osd1, osd2 and osd3 nodes. For instance for osd1 node the commands to launch are:
ceph-admin@ceph-admin:~$ ssh osd1 ceph-admin@osd1:~$ sudo apt-get install -y ufw ceph-admin@osd1:~$ sudo ufw allow 22/tcp ceph-admin@osd1:~$ sudo ufw allow 6800:7300/tcp ceph-admin@osd1:~$ sudo ufw enable
To configure the additional volumes available on the OSD nodes we login to the OSD nodes and format the partition with a XFS filesystem:
ceph-admin@osd3:~$ sudo parted -s /dev/vdb mklabel gpt mkpart primary xfs 0% 100% ceph-admin@osd3:~$ sudo mkfs.xfs -f /dev/vdb
Deploying the Ceph cluster
Once all the machines are configured, we are ready to deploy our Ceph cluster using ceph-deploy. On the ceph-admin node we install ceph-deploy with the following commands:
ceph-admin@ceph-admin:~$ wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add - ceph-admin@ceph-admin:~$ echo deb https://download.ceph.com/debian-luminous/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list ceph-admin@ceph-admin:~$ sudo apt update ceph-admin@ceph-admin:~$ sudo apt install ceph-deploy
In a given directory (e.g., we created a directory ceph-deploy) on the ceph-admin node we will run the command to define the cluster nodes:
ceph-admin@ceph-admin:~$ mkdir ceph-deploy ceph-admin@ceph-admin:~$ cd ceph-deploy/ ceph-admin@ceph-admin:~/ceph-deploy$ ceph-deploy new mon1
This command generates the Ceph cluster configuration file ‘ceph.conf‘ in the current directory. The ceph.conf file can be edited to add the public network details under the [global] block. The resulting ceph.conf file looks like this:
[global] fsid = 44d61b90-a1de-459f-97c6-6d9642eb5e0f mon_initial_members = mon1 mon_host = 10.20.1.144 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx public network = 10.20.0.0/16
The next steps to take now are: i) installing Ceph on all nodes from the ceph-admin node, ii) deploying the monitor node on node mon1, iii) deploying the management-key to all associated nodes, and iv) deploy a manager daemon on the monitor node:
ceph-admin@ceph-admin:~/ceph-deploy$ ceph-deploy install ceph-admin mon1 osd1 osd2 osd3 rgw ceph-admin@ceph-admin:~/ceph-deploy$ ceph-deploy mon create-initial ceph-admin@ceph-admin:~/ceph-deploy$ ceph-deploy admin ceph-admin mon1 osd1 osd2 osd3 rgw ceph-admin@ceph-admin:~/ceph-deploy$ ceph-deploy mgr create mon1
To be able to use the ceph CLI on all nodes without need to specify the monitor address and the admin key we should also change the permissions of the key file on all nodes:
sudo chmod 644 /etc/ceph/ceph.client.admin.keyring
Finally, we can add the OSD daemons on the nodes:
ceph-admin@ceph-admin:~/ceph-deploy$ ceph-deploy osd create --data /dev/vdb osd1 ceph-admin@ceph-admin:~/ceph-deploy$ ceph-deploy osd create --data /dev/vdb osd2 ceph-admin@ceph-admin:~/ceph-deploy$ ceph-deploy osd create --data /dev/vdb osd3 ceph-admin@ceph-admin:~/ceph-deploy$ ceph-deploy osd create --data /dev/vdc osd3
These steps made us have a Ceph cluster we could use for testing the object classes for active storage. Further more detailed descriptions on adding an rgw to the cluster, enabling the Ceph dashboard, storing/retrieving object data, resetting the Ceph cluster, adding/removing additional nodes, monitoring the cluster status and so on, are not in the scope of this blog post. The interested reader could refer to the official Ceph documentation or to good reference tutorials: i) Deploy Ceph and start using it: end to end tutorial – Installation, ii) How To Install Ceph Storage Cluster on Ubuntu 18.04 LTS, iii) How to install a Ceph Storage Cluster on Ubuntu 16.04. Note that some information in the links above might be outdated!
I Personally Like Your Post; You Have Shared Good Insights And Experiences. Keep It Up.