Using Hadoop on the Patas cluster
is a processing framework that allows for scalable, distributed processing. It includes a distributed filesystem (HDFS) and a distributed processing framework (MapReduce
). Unlike Condor
, which can schedule any type of job, Hadoop jobs must be written specifically to work with the MapReduce
framework. However, for jobs that are well-suited to it, it automates some of the tasks you'd otherwise have to do with your own code in a Condor job.
Local installation details
Hadoop is installed under /opt/hadoop/bin. This directory is on the system path so you can run Hadoop commands without specifying this directory. You will, however, need to add /opt/hadoop to your Java CLASSPATH when building Java code to run on Hadoop.
HDFS directories are layed out somewhat differently than on our local filesystems. Instead of /home2, Hadoop user directories are under /user; i.e., if your NetID is "jdoe", you have a Hadoop user directory under /user/jdoe.
To see the current job tracker status, visit the Job Tracker Web GUI
- Official Hadoop documentation -- somewhat terse, but a good starting point.
- HadoopWordCountExample -- a simple example of how to run a parallel job on our cluster, including copying the data to HDFS and extracting the results.
- The "hadoop" command, if run by itself, will give simple usage instructions. This also applies to submodules; e.g., "hadoop fs" will list all the commands accepted by the HDFS module.
- Seeing the Bars of the Hadoop Cage -- advice on how to write Hadoop jobs without locking yourself into the Hadoop model.
Topic revision: r2 - 2013-05-16 - 00:20:03 - brodbd