Tag: Hadoop

Lightning Sparks all around: A comprehensive analysis of popular distributed computing frameworks (ABDA’15)

Distributed Computing Frameworks

Big Data processing has been a very current topic for the last ten or so years. In order to process Big Data, special software frameworks have been developed. Nowadays, these frameworks are usually based on distributed computing because horizontal scaling is cheaper than vertical scaling. But horizontal scaling imposes a new set of problems when it comes to programming. A traditional programmer feels safer in a well-known environment that pretends to be a single computer instead of a whole cluster of computers. In order to deal with this problem, several programming and architectural patterns have been developed, most importantly MapReduce and the use of distributed file systems. There are several OpenSource frameworks that implement these patterns. Continue reading

ICCLab Research Group Activity

Big data is a general term that might be petabyte (10^15 Byte), Exabyte (10^18 byte) or zettabyte (10^21 byte) large and consisting of billions to trillions or quadrillions of records.

Big data can be described as

  • Large volume amount of data a specific company produces ,
  • A data which requires too much time and cost for analysis,
  • A data that takes too much time to load into a relational database,
  • A data that is beyond the limit of processing capacity of specific database system and so on.

Due to the rapid growth of the data volumes, dealing with big data might lead you to the difficulties of being able to store, create, manipulate and manage your data. Generally big data is a problem in business analytics because of the large volume of data storage, process time and cost.

Goal of ICCLab research group

Most of the time big data is related with cloud computing because of the storage plus management and analysis of big data. Big dataset requires a framework like MapReduce to distribute the work among different computers.

Our aim is to solve challenges on storing, accessing and analyzing big data using the infrastructure of Cloud Computing with Hadoop and analytic packages such as SAS and R. Our infrastructure is not only for storing but also big data analytics is a challenge which needs attention!

The benefit of having big data; Even though it has some difficulties to work with big data, it helps to extract more information which could help for further researches.  Having big data allow research groups to have variety of research areas or it enable them to analyze the data in different aspects/dimension. Furthermore, big data can provide more detailed results for better decision making.

Why Hadoop

Now a days people started to put their data into Hadoop because

  • It is an open source storage,
  • Inexpensive and
  • Helps to save more data than before.

Hadoop supports around 4000 of nodes with 4TB of hard disk capacity per node which is a large amount of volume and it’s easily possible to add and remove servers into a Hadoop cluster. Beyond that it can be used without propriety licensing fees.

And it is possible to integrate high performance parallel data processing using MapReduce.

Analytics Lab

Since we are research group, we don’t want to just have big data stored in an organized way; we also need to analyze the data.

Three important points of why we choose SAS:

  • Using the new version of SAS DI studio it is easy to access stored files in Hadoop without too many extra steps; we can use infile statement of SAS language to read and write files to and from Hadoop.
  • It is possible to work with Hadoop hive tables as if they are SAS datasets, so that we can work with any jobs in SAS DI studio using Hive tables.
  • Using SAS Base it is also possible to use the functionality of Hadoop like MapReduce programming, HDFS command execution and pig

Moreover there is upcoming plan; instead of accessing the data from Hadoop for processing in SAS, it is possible to take the advantage of the cluster by sending down the work to Hadoop cluster to be processed since the data is in the cluster. Interesting!