Andrew Pavlo
Brown University
Data Management Group
Providence, Rhode Island USA
Hadoop @ Brown :: Andrew Pavlo - Brown University

Hadoop @ Brown

The following page provides information on executing Hadoop Map/Reduce jobs at the Brown Department of Computer Science. Please refer to the Hadoop Map/Reduce Tutorial for more information about how execute different kinds of jobs. You can also view the slides from my talk in September 2008 about using Hadoop at Brown. Although I am not the adminstrator for the department's Hadoop cluster, I am more than happy to answer your questions and help get you started.

Update 10-22-2009: The slides from the updated presentation from Fall 2009 for Hadoop@Brown have been posted.

Update 05-01-2009: The source code for the Map Reduce tasks and data generators, as well as the cluster configuration information from our SIGMOD paper on benchmarking Hadoop against parallel databases can be found here.

Getting Started

There are two things that you will need to do in order to gain access to the department Hadoop cluster and be able to run your jobs.

Cluster Information

Our current Hadoop cluster runs in the Brown iLab on 16 machines. Each data node has 100GB of storage space available in /ltmp. You can view the status of your jobs and browse the filesystem using the following urls:

Note that you will only be able to access these urls from inside of the department firewall.

Storing Data

Before you can begin executing Map/Reduce jobs, you must first import your data into Hadoop's distributed filesystem. There are three ways to do this:

Submitting Jobs

After you have loaded your data into HDFS, you are now ready to submit your Map/Reduce job. You can follow the tutorial from the Hadoop documentation about the different options available to you inside of your Map/Reduce program. The example below is based on the OrderSum demo from my slides:

   $ hadoop fs -rmr /path/to/hdfs/output
   $ hadoop jar demo.jar /path/to/hfds/input /path/to/hfds/output
Note that we have to delete the output directory (if it already exists) before we execute our job, otherwise Hadoop will throw an error. Once your job is running, you can view the progress of the individual tasks on the JobTracker's status page.

Example Files

The following links are for code that we currently using in our research: