Hadoop @ Brown
The following page provides information on executing Hadoop Map/Reduce jobs at the Brown Department of Computer Science. Please refer to the Hadoop Map/Reduce Tutorial for more information about how execute different kinds of jobs. You can also view the slides from my talk in September 2008 about using Hadoop at Brown. Although I am not the adminstrator for the department's Hadoop cluster, I am more than happy to answer your questions and help get you started.
Update 10-22-2009: The slides from the updated presentation from Fall 2009 for Hadoop@Brown have been posted.
Update 05-01-2009: The source code for the Map Reduce tasks and data generators, as well as the cluster configuration information from our SIGMOD paper on benchmarking Hadoop against parallel databases can be found here.
Getting Started
There are two things that you will need to do in order to gain access to the department Hadoop cluster and be able to run your jobs.
- Get added to the hadoop group
You first need to contact tstaff and request to be added to the hadoop group. Without this, you will be unable to write any data into HDFS nor be able to execute jobs. Make sure that they also create you an HDFS directory for your account /users. Once this is done, you will need to log out of your account first before the changes will take effect to your session. - Add Hadoop environment variables
Next, add the following variables to your department shell source file ($HOME/.environment):## ------------------------------------------- ## HADOOP ENVIRONMENT ## ------------------------------------------- setenvifnot JAVA_HOME "/usr" setenvvar HADOOP_HOME "/pro/hadoop/home" setenvvar HADOOP_CONF_DIR "$HADOOP_HOME/conf" pathprepend PATH "$HADOOP_HOME/bin"
Cluster Information
Our current Hadoop cluster runs in the Brown iLab on 16 machines. Each data node has 100GB of storage space available in /ltmp. You can view the status of your jobs and browse the filesystem using the following urls:
- HDFS Master: http://babbage:50070/
- JobTracker: http://bell:50030/
Storing Data
Before you can begin executing Map/Reduce jobs, you must first import your data into Hadoop's distributed filesystem. There are three ways to do this:
- Copy files directly using Hadoop's command-line tool
This will create files that can be read in using KeyValueTextInputFormat or TextInputFormat file readers.$ hadoop fs -mkdir /data $ hadoop fs -put /path/to/local/file /path/to/hdfs/file
- Write a Custom Loader for SequenceFileInputFormat data
Alternatively, you can write a data parser to read in a data file and generate SequenceFileInputFormat key/vale pairs in HDFS (see edu.brown.cs.mapreduce.DataLoader). SequenceFileInputFormat is the internal, non-human-readable data format that Hadoop uses for the intermediate output of map jobs. The data is stored as serialized objects in HDFS blocks and deserialized at runtime by the execution node. The advantage of this approach versus using regular text files is that SequenceFileInputFormat files allows you to predefine the key/value data types. It is currently also the only data input type that can be used record or block level compression in Hadoop. In my experience, however, I found that sometimes the gains from using compression is offset by the overhead of Java's deserialization procedures.You will need to use a SequenceFile.Writer to open up a new SequenceFileInputFormat file and start appending new key/value pairs to it. Newer versions of Hadoop should allow you to open existing files and add new values to it (assuming the data types match).
SequenceFile.Writer writer = SequenceFile.createWriter(fs, // FileSystem conf, // Configuration new Path(output_file), // Path to new file in HDFS IntWritable.class, // Key Data Type Text.class, // Value Data Type SequenceFile.CompressionType.BLOCK)); writer.append(new IntWritable(new_key), new Text(new_value)); - Write a Data Generator
You can also write a custom Map/Reduce job that generates data in HDFS from directly inside of the map task. See my modified version of Yahoo!'s TeraGen data generator used for the TeraByte benchmark.
Submitting Jobs
After you have loaded your data into HDFS, you are now ready to submit your Map/Reduce job. You can follow the tutorial from the Hadoop documentation about the different options available to you inside of your Map/Reduce program. The example below is based on the OrderSum demo from my slides:
$ hadoop fs -rmr /path/to/hdfs/output $ hadoop jar demo.jar /path/to/hfds/input /path/to/hfds/outputNote that we have to delete the output directory (if it already exists) before we execute our job, otherwise Hadoop will throw an error. Once your job is running, you can view the progress of the individual tasks on the JobTracker's status page.
Example Files
The following links are for code that we currently using in our research: