In a previous blog post, we showed how “bringing the code to the data” can highly improve computation performance through the active storage (also known as computational storage) concept. In our journey in investigating how to best make computation and storage ecosystems interact, in this blog post we analyze a somehow opposite approach of “bringing the data close to the code“. What the two approaches have in common is the possibility to exploit data locality moving away in both cases from the complete disaggregation of computation and storage.
The approach in focus for this blog post, is at the basis of the Alluxio project, which in short is a memory speed distributed storage system. Alluxio enables data analytics workloads to access various storage systems and accelerate data-intensive applications. It manages data in-memory and optionally on secondary storage tiers, such as cheaper SSDs and HDDs, for additional capacity. It achieves high read and write throughput unifying data access to multiple underlying storage systems reducing data duplication among computation workloads. Alluxio lies between computation frameworks or jobs, such as Apache Spark, Apache MapReduce, or Apache Flink, and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS or Ceph. Data is available locally for repeated accesses to all users of the compute cluster regardless of the compute engine used avoiding redundant copies of data to be present in memory and driving down capacity requirements and thereby costs.
For more details on the components, the architecture and other features please visit the Alluxio homepage. In the rest of the blog post we will present our experience in integrating Alluxio on our Ceph cluster and use a Spark application to demonstrate the obtained performance improvement (the reference analysis and testing we aimed to reproduce can be found here).
In most of the distributed storage systems, the data nodes are decoupled from compute nodes. Disaggregation of storage from the compute servers is motivated by an improved efficiency of storage utilization and a better and mutually independent scalability of computation and storage.
While the above consideration is indisputable, several situations exist where moving computation close to the data brings important benefits. In particular, whenever the stored data is to be processed for analytics purposes, all the data needs to be moved from the storage to the compute cluster (consuming network bandwidth). After some analytics on the data, in most cases the results need to go back to the storage. Another important observation is that large amounts of resources (CPU and memory) are available in the storage infrastructure which usually remain underutilized. Active storage is a research area that studies the effects of moving computation close to data and analyzes the fields of application where data locality actually introduces benefits. In short, active storage allows to run computation tasks where the data is, leveraging storage nodes’ underutilized resources, reducing data movement between storage and compute clusters.
There are many active storage frameworks in the research community. One example of active storage is is theOpenStack Storlets framework, developed by IBM and integrated within OpenStack Swift deployments. IOStack is European funded project, that builds around this concept for object storage. Another example is ZeroVM, which allows developers to push their application to their data instead of having to pull their data to their application.
In June we could participe to the 28th edition of EuCNC, an international conference sponsored by the IEEE Communications Society, the European Association for Signal Processing, and supported by the European Commission. EuCNC is one of the most prominent communications and networking conferences in Europe, which efficiently brings together cutting-edge research and world-renown industries and businesses.