Skip to content

Application to demonstrate Storm distributed framework by counting the words [from random sentences] in real-time. This project does not need internet access.

License

Notifications You must be signed in to change notification settings

dlbydz/StormWordCount

 
 

Repository files navigation

StormWordCount


Introduction

Skeleton Storm project for The Fifth Elephant, 2013 workshop on Big Data, Real-time Processing and Storm on 11th July, 2013 at NIMHANS Convention Centre, Bangalore.
This will be a live-coding session where you will learn how to use Storm. This workshop is for software developers with some background in Java programming who are interested in distributed data processing.

This repository contains an application for demonstrating Storm distributed framework by counting the words present in random sentences fed by the code [in the Spout] in real-time.
This project does not need internet access while executing the topology i.e. once configured and Maven downloads all the required dependencies. Please check my other repo, StormTweetsWordCount for counting words in tweets which also needs internet access for getting data from Twitter.

Storm is a free and open source distributed real-time computation system, developed at BackType by Nathan Marz and team. It has been open sourced by Twitter [post BackType acquisition] in August, 2011.
This application has been developed and tested with Storm v0.8.2 on Windows 7 in local mode. Application may or may not work with earlier or later versions than Storm v0.8.2.

This application has been tested in:

  • Local mode on a CentOS machine and even on Microsoft Windows 7 machine.
  • Cluster mode on a private cluster and also on Amazon EC2 environment of 4 machines and 5 machines respectively; with all the machines in private cluster running Ubuntu while EC2 environment machines were powered by CentOS.

This application uses and complements Nathan Marz's storm starter Project.

Features

  • Application receives random sentences from Spout.
  • It splits each sentence with space as the delimiter and counts frequency of the words present in sentences.
  • Every 5 seconds, during processing, the application logs the word and its count to the console and also to a log file.
  • In local mode, topology runs for 2 minutes and then shuts down. Topology time duration can be updated by modifying this value.
  • Also this project has been made compatible with both Eclipse IDE and IntelliJ IDEA. Import the project in your favorite IDE [which has Maven plugin installed] and you can quickly follow the code.
  • As of today, this codebase has almost no or very less comments.

Dependencies

  • Storm v0.8.2
  • Google Guava v14.0.1
  • SLF4J v1.7.5
  • Logback v1.0.13

Also, please check pom.xml for more information on the various dependencies of the project.

Requirements

This project uses Maven to build and run the topology.
You need the following on your machine:

  • Oracle JDK >= 1.7.x
  • Apache Maven >= 3.0.5
  • Clone this repo and import as an existing Maven project to either Eclipse IDE or IntelliJ IDEA.
  • This application uses Google Guava for making life simple while using Collections.
  • Requires ZooKeeper, JZMQ, ZeroMQ installed and configured in case of executing this project in distributed mode i.e. Storm Cluster.
    • Follow the steps mentioned here for more details on setting up a Storm Cluster.

Rest of the required frameworks and libraries are downloaded by Maven as required in the build process, the first time the Maven build is invoked.

Usage

To build and run this topology, you must use Java 1.7.

Local Mode:

Local mode can also be run on Windows environment without installing any specific software or framework as such. Note: Please be sure to clean your temp folder as it adds lot of temporary files in every run.
In local mode, this application can be run from command line by invoking:

mvn clean compile exec:java -Dexec.classpathScope=compile -Dexec.mainClass=org.p7h.storm.offline.wordcount.topology.WordCountTopology

or

mvn clean compile package && java -jar target/storm-wordcount-1.0-SNAPSHOT.jar

Distributed [or Cluster / Production] Mode:

Distributed mode requires a complete and proper Storm Cluster setup. Please refer this wiki for setting up a Storm Cluster.
In distributed mode, after starting Nimbus and Supervisors on individual machines, this application can be executed on the master [or Nimbus] machine by invoking the following on the command line:

storm jar target/storm-wordcount-1.0-SNAPSHOT-jar-with-dependencies.jar org.p7h.storm.offline.wordcount.topology.WordCountTopology WordCount

Problems

If you find any issues, please report them either raising an issue here on Github or alert me on my Twitter handle @P7h. Or even better, please send a pull request. Appreciate your help. Thanks!

License

Copyright © 2013 Prashanth Babu.
Licensed under the Apache License, Version 2.0.

About

Application to demonstrate Storm distributed framework by counting the words [from random sentences] in real-time. This project does not need internet access.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published