Tag: Spark

Testing Alluxio for Memory Speed Computation on Ceph Objects

In a previous blog post, we showed how “bringing the code to the data” can highly improve computation performance through the active storage (also known as computational storage) concept. In our journey in investigating how to best make computation and storage ecosystems interact, in this blog post we analyze a somehow opposite approach of “bringing the data close to the code“. What the two approaches have in common is the possibility to exploit data locality moving away in both cases from the complete disaggregation of computation and storage.

The approach in focus for this blog post, is at the basis of the Alluxio project, which in short is a memory speed distributed storage system. Alluxio enables data analytics workloads to access various storage systems and accelerate data-intensive applications. It manages data in-memory and optionally on secondary storage tiers, such as cheaper SSDs and HDDs, for additional capacity. It achieves high read and write throughput unifying data access to multiple underlying storage systems reducing data duplication among computation workloads. Alluxio lies between computation frameworks or jobs, such as Apache Spark, Apache MapReduce, or Apache Flink, and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS or Ceph. Data is available locally for repeated accesses to all users of the compute cluster regardless of the compute engine used avoiding redundant copies of data to be present in memory and driving down capacity requirements and thereby costs.

For more details on the components, the architecture and other features please visit the Alluxio homepage. In the rest of the blog post we will present our experience in integrating Alluxio on our Ceph cluster and use a Spark application to demonstrate the obtained performance improvement (the reference analysis and testing we aimed to reproduce can be found here).

The framework used for testing

Fig. 1: Alluxio testing set-up.
Continue reading

Lightning Sparks all around: A comprehensive analysis of popular distributed computing frameworks (ABDA’15)

Distributed Computing Frameworks

Big Data processing has been a very current topic for the last ten or so years. In order to process Big Data, special software frameworks have been developed. Nowadays, these frameworks are usually based on distributed computing because horizontal scaling is cheaper than vertical scaling. But horizontal scaling imposes a new set of problems when it comes to programming. A traditional programmer feels safer in a well-known environment that pretends to be a single computer instead of a whole cluster of computers. In order to deal with this problem, several programming and architectural patterns have been developed, most importantly MapReduce and the use of distributed file systems. There are several OpenSource frameworks that implement these patterns. Continue reading