
| Line: 1 to 1 | ||||||||
|---|---|---|---|---|---|---|---|---|
Hadoop Example using WordCount | ||||||||
| Changed: | ||||||||
| < < | In this example, we'll run the WordCount example that comes with Hadoop on the Brown Corpus. | |||||||
| > > | In this example, we'll run the WordCount example that comes with Hadoop on our local copy of the Brown Corpus. | |||||||
| Changed: | ||||||||
| < < |
| |||||||
| > > | 1. Make a directory in the Hadoop Distributed Filesystem (HDFS) to hold the project:
$ hadoop fs -mkdir brown | |||||||
| Changed: | ||||||||
| < < | Keep in mind HDFS is not backed up, so it's best to retrieve any data you want to keep to the local filesystem. | |||||||
| > > | Note that, by default, Hadoop assumes all paths that don't start with a / are relative to /user/username, where "username" is your patas username.
2. Copy the corpus data into HDFS. Note that -put will automatically create the destination directory, so we don't have to make it ahead of time.
$ hadoop fs -put /corpora/ICAME/texts/brown1 brown/inputYou can skip this step and run jobs against non-HDFS paths by prefixing the complete path with "file://", for example "file:///corpora/ICAME/texts/brown1" However, this loses the speed advantages of distributing the data among the compute nodes. | |||||||
| Changed: | ||||||||
| < < | When you're done, you can clean up anything you no longer need with the -rm (equivalent to the "rm") or -rmr (equivalent to "rm -r") commands: | |||||||
| > > | 3. Launch the WordCount map-reduce job. Note that the output directory will be created automatically, and in fact it's an error if it already exists.
$ hadoop jar /opt/hadoop/hadoop-examples-*.jar wordcount brown/input brown/output 12/03/23 12:41:38 INFO input.FileInputFormat: Total input paths to process : 15 12/03/23 12:41:39 INFO mapred.JobClient: Running job: job_201203211437_0002 12/03/23 12:41:40 INFO mapred.JobClient: map 0% reduce 0% 12/03/23 12:41:54 INFO mapred.JobClient: map 6% reduce 0% (...rest of the output snipped for brevity) | |||||||
| Changed: | ||||||||
| < < | hadoop fs -rmr brown/input | |||||||
| > > | 4. We can now find the results in our output directory:
$ hadoop fs -ls brown/output Found 3 items -rw-r--r-- 3 brodbd supergroup 0 2012-03-23 12:42 /user/brodbd/brown/output/_SUCCESS drwxr-xr-x - brodbd supergroup 0 2012-03-23 12:41 /user/brodbd/brown/output/_logs -rw-r--r-- 3 brodbd supergroup 1123352 2012-03-23 12:42 /user/brodbd/brown/output/part-r-00000From here, we can view the output file directly with hadoop fs -cat brown/output/part-r-00000, or transfer it back to our local filesystem with something like hadoop fs -get brown/output/part-r-00000 brown-results.txt.
Keep in mind HDFS is not backed up, so it's best to retrieve any data you want to keep to the local filesystem. | |||||||
| Changed: | ||||||||
| < < | -- Main.brodbd - 2012-03-21 | |||||||
| > > | Cleanup can be done with the -rm or -rmr commands, which are equvalent to the shell commands "rm" or "rm -r". For example, to remove our entire project, we could do: hadoop fs -rmr brown | |||||||
| Line: 1 to 1 | ||||||||
|---|---|---|---|---|---|---|---|---|
| Added: | ||||||||
| > > | Hadoop Example using WordCountIn this example, we'll run the WordCount example that comes with Hadoop on the Brown Corpus.
-rm (equivalent to the "rm") or -rmr (equivalent to "rm -r") commands:
hadoop fs -rmr brown/input
-- Main.brodbd - 2012-03-21 | |||||||