Real-time data processing
Many use cases across various domains require real-time data processing for faster decision making: credit card fraud analytics, network fault prediction from sensor data, security threat prediction, etc. Data Stream Processing or stream computing is the new computing paradigm for processing streaming data in real-time without storing them in secondary storage. Twitter’s Storm project is a distributed real-time computation system, designed to be scalable, fault tolerant and programming language agnostic. Although it’s at an early stage (currently in incubation at The Apache Software Foundation), Storm is used by well-known companies with significant volumes of streaming data, such as The Weather Channel, Spotify, Twitter, and Rocket Fuel.
Storm topologies
The basic primitives Storm provides for doing stream transformations are spouts and bolts. Spouts produce data in the form of tuples, by reading data from HTTP streams, databases, files, message queues, etc. For example, a spout may connect to the Twitter API and emit a stream of tweets. A bolt is a component that performs stream transformations. It creates new streams based on its input streams.
Topologies are graphs of stream transformations where each node is a spout or bolt. Edges in the graph indicate the data flow between spouts and bolts. Topologies are submitted by users and run on Storm clusters until they are explicitly killed.
Storm architecture
There are two kinds of nodes on a Storm cluster: the master node and the worker nodes. The master node runs a daemon called Nimbus, responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.
On worker nodes, there are several processes running: a single Supervisor and one or more Workers. The Supervisor starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.
Storm clusters
Apart from its own components, Storm relies on ZooKeeper for coordination between the Nimbus and the Supervisors. In small clusters, a single ZooKeeper server is usually sufficient to handle the coordination between Storm’s components.
Setting up a Storm cluster is quite straight-forward, as it mainly consists of three steps:
- installing the Storm packages on each node of the cluster
- configuring the Storm nodes to discover the ZooKeeper servers by setting a property in the
conf/storm.yaml
file. - launching Storm processes on the Nimbus node and the Supervisors node by running the
bin/storm
script withnimbus
andsupervisor
as parameters.
By using ZooKeeper, the Storm master node (Nimbus) will automatically take care of discovering and integrating Supervisor nodes into the cluster.
In summary
Quoting from the project site:
Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Understanding Storm’s programming model and designing topologies that can run on Storm do require some effort, as Storm is a very technical platform that lacks higher order tools and streaming operators.
Looking at the bigger picture of real-time data processing, Storm plays a key role, but several other components are necessary to build a system that ingests data, performs computations and exposes the results in real time. A typical processing architecture consists of three parts: aggregating streams of data (for instance, by using Kafka), stream processing (with Storm) and querying the results with a query engine.
Analysing data as fast as they are acquired is certainly a critical requirement in many practical applications. By means of real-time data processing frameworks, organizations can gain insights from data in real time, or very close to it. An obvious use case is monitoring across data feeds (logs, system usage, etc.) and detecting anomalies so that decisions can be made in real-time.