In a previous blog post, we showed how “bringing the code to the data” can highly improve computation performance through the active storage (also known as computational storage) concept. In our journey in investigating how to best make computation and storage ecosystems interact, in this blog post we analyze a somehow opposite approach of “bringing the data close to the code“. What the two approaches have in common is the possibility to exploit data locality moving away in both cases from the complete disaggregation of computation and storage.
The approach in focus for this blog post, is at the basis of the Alluxio project, which in short is a memory speed distributed storage system. Alluxio enables data analytics workloads to access various storage systems and accelerate data-intensive applications. It manages data in-memory and optionally on secondary storage tiers, such as cheaper SSDs and HDDs, for additional capacity. It achieves high read and write throughput unifying data access to multiple underlying storage systems reducing data duplication among computation workloads. Alluxio lies between computation frameworks or jobs, such as Apache Spark, Apache MapReduce, or Apache Flink, and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS or Ceph. Data is available locally for repeated accesses to all users of the compute cluster regardless of the compute engine used avoiding redundant copies of data to be present in memory and driving down capacity requirements and thereby costs.
For more details on the components, the architecture and other features please visit the Alluxio homepage. In the rest of the blog post we will present our experience in integrating Alluxio on our Ceph cluster and use a Spark application to demonstrate the obtained performance improvement (the reference analysis and testing we aimed to reproduce can be found here).
Storage, together with computing and networking, is one of the fundamental parts of IaaS.
The research initiative on cloud storage at ICCLab, under the Infrastructure theme, focuses on the exploration of the limiting factors of the available storage systems, aiming at identifying new technologies and providing solutions that can be used to improve the efficiency of data management in cloud environments.
The need for advanced distributed architectures and software components allowing the deployment of secure, reliable, highly available and high-performing storage systems is clearly remarked by the fast growing rate of user-generated data. This trend sets challenging requirements for service and infrastructure providers to find efficient solutions for permanent data storage in their data centers.
About Cloud Storage Systems
A cloud storage system is typically obtained through a composition of software resources (running in a distributed environment), and a set of physical machines (i.e., servers), that exposes access to a logical layer of storage.
Cloud storage provides an abstract view of the multiple physical storage resources that it manages (these can be located across multiple servers, or even across different data centers) and it internally handles different layers of transparency that ensure reliability and performance.
The main concepts that are to be found in cloud storage systems are:
Data replication and reliability. Policies can be defined in such a way that copies of the same data are spread across different failure domains, to ensure availability and disaster recovery.
Data placement. A cloud storage system exposes a logical view of storage and internally handles how data is assigned to the available resources. This allows for e.g., striping data and improving access performance by using parallel accesses, or ensuring a proper load balancing between a set of nodes.
Availability. As a distributed system, cloud storage must not exhibit any single point of failure. This is usually achieved by introducing redundancy in hardware components and by implementing fail-over policies to recover from failures.
Performance. Concurrent accesses to data can improve data rates significantly as different portions of the same file or object can be provided by two different disks or nodes.
Geo-replication. A cloud storage system can replicate data in such a way that it is closer to where it is consumed (e.g., across data centers on different regions) to improve the access efficiency.
Implement research ideas into working prototypes that can attract industrial interest
Obtain funding by participating in financed research projects
Produce and distribute our open source implementations
Keep and increase the reputation of the ICCLab in international contexts
Define a strong field of expertise in Distributed File Systems and software solutions for storage
Explore and implement clustered storage architectures
From an applied research perspective, the scenario of cloud computing and the growing demand for efficient data storage solutions, offers a ground where many areas and directions can be explored and evaluated.
Here at the ICCLab, the following aspects are currently being developed in the cloud storage initiative: