Distributed Computing Frameworks
Big Data processing has been a very current topic for the last ten or so years. In order to process Big Data, special software frameworks have been developed. Nowadays, these frameworks are usually based on distributed computing because horizontal scaling is cheaper than vertical scaling. But horizontal scaling imposes a new set of problems when it comes to programming. A traditional programmer feels safer in a well-known environment that pretends to be a single computer instead of a whole cluster of computers. In order to deal with this problem, several programming and architectural patterns have been developed, most importantly MapReduce and the use of distributed file systems. There are several OpenSource frameworks that implement these patterns.
There are tools for every kind of software job (sometimes even multiple of those) and the developer has to make a decision which one to choose for the problem at hand. This can be a cumbersome task especially as this regularly involves new software paradigms.
Comparison of well-known frameworks
We conducted an empirical study with certain frameworks, each destined for its field of work. The main objective was to show which frameworks excel in which fields. For this evaluation, we first had to identify the different fields that needed Big Data processing. Second, we had to find the appropriate tools that could deal with these problems. Third, in order to make the evaluation / comparison of these frameworks objective, we had to identify certain parameters in which we ranked them. In a final part, we chose one of these frameworks which looked most versatile and conducted a benchmark.
First things first, we had to identify different fields of Big Data processing. We came to the conclusion that there were 3 major fields, each with its own characteristics. These are batch processing, stream processing and real-time processing, even though the latter two could be merged into the same category. Nowadays, with social media, another type is emerging which is graph processing. As this latter shows characteristics of both batch and real-time processing, we chose not to delve into it as of now. Nevertheless, we included a framework in our analysis that is built for graph processing.
The main difference between the three fields is the reaction time. While in batch processing, this time can be several hours (as it takes as long to complete a job), in real-time processing, the results have to come almost instantaneously. The main advantage of batch processing is its high data throughput. Stream processing basically handles streams of short data entities such as integers or byte arrays (say from a set of sensors) which have to be processed at least as fast as they arrive – whether the result is needed in real-time is not always of importance. Nevertheless, stream and real-time processing usually result in the same frameworks of choice because of their tight coupling.
Frameworks of choice
This led us to identifying the relevant frameworks. We didn’t want to spend money on licensing so we were left with OpenSource frameworks, mainly from the Apache foundation. Following list shows the frameworks we chose for evaluation:
• Apache Hadoop MapReduce for batch processing
• HaLoop for loop-aware batch processing
• Apache Storm for real-time stream processing
• Apache Giraph for graph processing
• Apache Spark as a replacement for the Apache Hadoop suite
As the third part, we had to identify some relevant parameters we could rank the frameworks in. These came down to the following:
• scalability: is the framework easily & highly scalable?
• data throughput: how much data can it process in a certain time?
• fault tolerance: a regularly neglected property – can the system easily recover from a failure?
• real-time capability: can we use the system for real-time jobs?
• supported data size: Big Data usually handles huge files – the frameworks as well?
• iterative task support: is iteration a problem?
• data caching: it can considerably speed up a framework
• environment of execution: a known environment poses less learning overhead for the administrator
• supported programming languages: like the environment, a known programming language will help the developers
Scalability and data throughput are of major importance when it comes to distributed computing. But many administrators don’t realize how important a reliable fault handling is, especially as distributed systems are usually connected over an error-prone network. Real-time capability and processed data size are each specific for their data processing model so they just tell something about the framework’s individual performance within its own field. Yet the following two points have very specific meanings in distributed computing: while iteration in traditional programming means some sort of while/for loop, in distributed computing, it is about performing two consecutive, similar steps efficiently without much overhead – whether with a loop-aware scheduler or with the help of local caching. This leads us to the data caching capabilities of a framework. As distributed systems are always connected over a network, this network can easily become a bottleneck. Local data caching can optimize a system and retain network communication at a minimum. The last two points are more of a stylistic aspect of each framework, but could be of importance for administrators and developers.
With the help of their documentations and research papers, we managed to compile the following table:
|Property / Framework||Hadoop (MapReduce)||HaLoop||Storm||Giraph||Spark|
|Scalability||high||supposedly high||fairly high||supposedly high||high|
|Data Throughput||high (batch processing)||higher than Hadoop||high (in-memory)||high for few jobs||high (in-memory)|
|Fault Tolerance||high||officially not known||high; several methods||claimed to be high||high but expensive|
|Real-time Processing||none||none||yes, explicitly||none||high|
|Supported Data Size||huge||huge||stream of small messages||huge||depending on filesystem|
|Iterative Task Support||none||yes||none||yes||yes|
|Data Caching (In-Memory or on-disk)||none by default||on-disk||partly||in-memory||on-disk, in-memory|
|Environment of Execution||Hadoop||Hadoop||standalone, Hadoop (YARN)||Hadoop||standalone, Hadoop (YARN), Mesos|
|Supported Programming Languages||any language||Java (maybe more)||any language||Java (maybe more)||Scala, Java, Python|
The table clearly shows that Apache Spark is the most versatile framework that we took into account. It is not only highly scalable but also supports real-time processing, iteration, caching – both in-memory and on disk -, a great variety of environments to run in plus its fault tolerance is fairly high. The only drawback is the limited amount of programming languages it supports (Scala, Java and Python), but maybe that’s even better because this way, it is specifically tuned for a high performance in those few languages. For a more in-depth analysis, we would like to refer the reader to the paper “Lightning Sparks all around: A comprehensive analysis of popular distributed computing frameworks (link coming soon)” which was written for the 2nd International Conference on Advances in Big Data Analytics 2015 (ABDA’15).
Benchmarking of Spark
For these former reasons, we chose Spark as the framework to perform our benchmark with. Now we had to find certain use cases that we could measure. In the end, we settled for three benchmarking tests: we wanted to determine the curve of scalability, in especially whether Spark is linearly scalable. Then, we wanted to see how the size of input data is influencing processing speed. With a third experiment, we wanted to find out by how much Spark’s processing speed decreases when it has to cache data on the disk.
To sum up, the results have been very promising. Spark turned out to be highly linearly scalable. As claimed by the documentation, its initial setup time of about 10 seconds for MapReduce jobs doesn’t make it apt for real-time processing, but keep in mind that this wasn’t executed in Spark Streaming which is especially developed for that kind of jobs. The third test showed only a slight decrease of performance when memory was reduced. The results are as well available in the same paper (coming soon).
After all, some more testing will have to be done when it comes to further evaluating Spark’s advantages, but we are certain that the evaluation of former frameworks will help administrators when considering switching to Big Data processing.