Hadoop Example using WordCount
In this example, we'll run the WordCount example
that comes with Hadoop on our local copy of the Brown Corpus.
1. Make a directory in the Hadoop Distributed Filesystem (HDFS) to hold the project:
$ hadoop fs -mkdir brown
Note that, by default, Hadoop assumes all paths that don't start with a / are relative to /user/username, where "username" is your patas username.
2. Copy the corpus data into HDFS. Note that
will automatically create the destination directory, so we don't have to make it ahead of time.
$ hadoop fs -put /corpora/ICAME/texts/brown1 brown/input
You can skip this step and run jobs against non-HDFS paths by prefixing the complete path with "file://", for example "file:///corpora/ICAME/texts/brown1" However, this loses the speed advantages of distributing the data among the compute nodes.
3. Launch the WordCount map-reduce job. Note that the output directory will be created automatically, and in fact it's an error if it already exists.
$ hadoop jar /opt/hadoop/hadoop-examples-*.jar wordcount brown/input brown/output
12/03/23 12:41:38 INFO input.FileInputFormat: Total input paths to process : 15
12/03/23 12:41:39 INFO mapred.JobClient: Running job: job_201203211437_0002
12/03/23 12:41:40 INFO mapred.JobClient: map 0% reduce 0%
12/03/23 12:41:54 INFO mapred.JobClient: map 6% reduce 0%
(...rest of the output snipped for brevity)
4. We can now find the results in our output directory:
$ hadoop fs -ls brown/output Found 3 items
-rw-r--r-- 3 brodbd supergroup 0 2012-03-23 12:42 /user/brodbd/brown/output/_SUCCESS
drwxr-xr-x - brodbd supergroup 0 2012-03-23 12:41 /user/brodbd/brown/output/_logs
-rw-r--r-- 3 brodbd supergroup 1123352 2012-03-23 12:42 /user/brodbd/brown/output/part-r-00000
From here, we can view the output file directly with
hadoop fs -cat brown/output/part-r-00000
, or transfer it back to our local filesystem with something like
hadoop fs -get brown/output/part-r-00000 brown-results.txt
Keep in mind HDFS is not backed up, so it's best to retrieve any data you want to keep to the local filesystem.
Cleanup can be done with the
commands, which are equvalent to the shell commands "rm" or "rm -r". For example, to remove our entire project, we could do:
hadoop fs -rmr brown