Difference: HadoopWordCountExample (1 vs. 2)

Revision 22012-03-23 - brodbd

Line: 1 to 1
 

Hadoop Example using WordCount

Changed:
<
<
In this example, we'll run the WordCount example that comes with Hadoop on the Brown Corpus.
>
>
In this example, we'll run the WordCount example that comes with Hadoop on our local copy of the Brown Corpus.
 
Changed:
<
<
  1. Make a directory in the Hadoop Distributed Filesystem (HDFS) to hold the project:
    hadoop fs -mkdir brown
    Note that by default, Hadoop assumes all paths that don't start with a / are relative to /user/username, where "username" is your patas username.
  2. Copy the corpus data into HDFS:
    hadoop fs -put /corpora/ICAME/texts/brown1 brown/input
    Note that you can skip this step and run jobs against non-HDFS paths by prefixing the complete path with "file://", for example "file:///corpora/ICAME/texts/brown1" However, this loses the speed advantages of a distributed filesystem.
  3. Launch the WordCount map-reduce job:
    hadoop jar /opt/hadoop/hadoop-examples-*.jar wordcount brown/input brown/output
    Note that the output directory will be created automatically; in fact, it is an error if it exists.
  4. Take a look at the results:
    hadoop fs -ls brown/output
    hadoop fs -cat brown/output/part-r-00000
    You could also retrieve output to the local filesystem:
    hadoop fs -get brown/output/part-r-000000 brown-output.txt
>
>
1. Make a directory in the Hadoop Distributed Filesystem (HDFS) to hold the project:
$ hadoop fs -mkdir brown
 
Changed:
<
<
Keep in mind HDFS is not backed up, so it's best to retrieve any data you want to keep to the local filesystem.
>
>
Note that, by default, Hadoop assumes all paths that don't start with a / are relative to /user/username, where "username" is your patas username.

2. Copy the corpus data into HDFS. Note that -put will automatically create the destination directory, so we don't have to make it ahead of time.

$ hadoop fs -put /corpora/ICAME/texts/brown1 brown/input

You can skip this step and run jobs against non-HDFS paths by prefixing the complete path with "file://", for example "file:///corpora/ICAME/texts/brown1" However, this loses the speed advantages of distributing the data among the compute nodes.

 
Changed:
<
<
When you're done, you can clean up anything you no longer need with the -rm (equivalent to the "rm") or -rmr (equivalent to "rm -r") commands:
>
>
3. Launch the WordCount map-reduce job. Note that the output directory will be created automatically, and in fact it's an error if it already exists.
$ hadoop jar /opt/hadoop/hadoop-examples-*.jar wordcount brown/input brown/output
12/03/23 12:41:38 INFO input.FileInputFormat: Total input paths to process : 15
12/03/23 12:41:39 INFO mapred.JobClient: Running job: job_201203211437_0002
12/03/23 12:41:40 INFO mapred.JobClient:  map 0% reduce 0%
12/03/23 12:41:54 INFO mapred.JobClient:  map 6% reduce 0%
(...rest of the output snipped for brevity)
 
Changed:
<
<
hadoop fs -rmr brown/input
>
>
4. We can now find the results in our output directory:
$ hadoop fs -ls brown/output                            Found 3 items
-rw-r--r--   3 brodbd supergroup          0 2012-03-23 12:42 /user/brodbd/brown/output/_SUCCESS
drwxr-xr-x   - brodbd supergroup          0 2012-03-23 12:41 /user/brodbd/brown/output/_logs
-rw-r--r--   3 brodbd supergroup    1123352 2012-03-23 12:42 /user/brodbd/brown/output/part-r-00000

From here, we can view the output file directly with hadoop fs -cat brown/output/part-r-00000, or transfer it back to our local filesystem with something like hadoop fs -get brown/output/part-r-00000 brown-results.txt.

Keep in mind HDFS is not backed up, so it's best to retrieve any data you want to keep to the local filesystem.

 
Changed:
<
<
-- Main.brodbd - 2012-03-21
>
>
Cleanup can be done with the -rm or -rmr commands, which are equvalent to the shell commands "rm" or "rm -r". For example, to remove our entire project, we could do: hadoop fs -rmr brown

Revision 12012-03-21 - brodbd

Line: 1 to 1
Added:
>
>

Hadoop Example using WordCount

In this example, we'll run the WordCount example that comes with Hadoop on the Brown Corpus.

  1. Make a directory in the Hadoop Distributed Filesystem (HDFS) to hold the project:
    hadoop fs -mkdir brown
    Note that by default, Hadoop assumes all paths that don't start with a / are relative to /user/username, where "username" is your patas username.
  2. Copy the corpus data into HDFS:
    hadoop fs -put /corpora/ICAME/texts/brown1 brown/input
    Note that you can skip this step and run jobs against non-HDFS paths by prefixing the complete path with "file://", for example "file:///corpora/ICAME/texts/brown1" However, this loses the speed advantages of a distributed filesystem.
  3. Launch the WordCount map-reduce job:
    hadoop jar /opt/hadoop/hadoop-examples-*.jar wordcount brown/input brown/output
    Note that the output directory will be created automatically; in fact, it is an error if it exists.
  4. Take a look at the results:
    hadoop fs -ls brown/output
    hadoop fs -cat brown/output/part-r-00000
    You could also retrieve output to the local filesystem:
    hadoop fs -get brown/output/part-r-000000 brown-output.txt

Keep in mind HDFS is not backed up, so it's best to retrieve any data you want to keep to the local filesystem.

When you're done, you can clean up anything you no longer need with the -rm (equivalent to the "rm") or -rmr (equivalent to "rm -r") commands:

hadoop fs -rmr brown/input

-- Main.brodbd - 2012-03-21

 
This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions