In a previous blog post, we showed how “bringing the code to the data” can highly improve computation performance through the active storage (also known as computational storage) concept. In our journey in investigating how to best make computation and storage ecosystems interact, in this blog post we analyze a somehow opposite approach of “bringing the data close to the code“. What the two approaches have in common is the possibility to exploit data locality moving away in both cases from the complete disaggregation of computation and storage.
The approach in focus for this blog post, is at the basis of the Alluxio project, which in short is a memory speed distributed storage system. Alluxio enables data analytics workloads to access various storage systems and accelerate data-intensive applications. It manages data in-memory and optionally on secondary storage tiers, such as cheaper SSDs and HDDs, for additional capacity. It achieves high read and write throughput unifying data access to multiple underlying storage systems reducing data duplication among computation workloads. Alluxio lies between computation frameworks or jobs, such as Apache Spark, Apache MapReduce, or Apache Flink, and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS or Ceph. Data is available locally for repeated accesses to all users of the compute cluster regardless of the compute engine used avoiding redundant copies of data to be present in memory and driving down capacity requirements and thereby costs.
For more details on the components, the architecture and other features please visit the Alluxio homepage. In the rest of the blog post we will present our experience in integrating Alluxio on our Ceph cluster and use a Spark application to demonstrate the obtained performance improvement (the reference analysis and testing we aimed to reproduce can be found here).