PatasMahout
General Info
Mahout is a big data machine learning toolkit that can run on top of
Hadoop. At the time of writing it is rather unstable, so here are some tips on tweaking the samples so you can get started.
Local Installation and Setup
Mahout is installed under /NLP_TOOLS/ml_tools/mahout/latest/ . However, some of the samples on the Mahout website require write access to the mahout installation directory (sigh), so you'll want to pull down the install into your home dir:
cp -r /NLP_TOOLS/ml_tools/mahout/mahout-distribution-0.6/ ~/tools/
Once you have a copy, build the examples like so
cd tools/mahout-distribution-0.6/examples
mvn compile
Set these environment variables.
export HADOOP_HOME=/opt/hadoop
export MAHOUT_HOME=~/tools/mahout-distribution-0.6
Mahout tries to run on Hadoop by default. To disable that and run locally, set this variable to anything:
export MAHOUT_LOCAL=blah
Running Examples on 0.6
The package structure and samples are changing with each release, so here are two that have been tweaked to get them working on 0.6.
Random Forests
Original broken sample.
$ curl http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data -o glass.data
$ hadoop fs -put glass.data rdftest/glass.data
$ hadoop jar ~/tools/mahout-distribution-0.6/core/target/mahout-core-0.6-job.jar org.apache.mahout.classifier.df.tools.Describe -p rdftest/glass.data -f rdftest/glass.info -d I 9 N L
$ hadoop jar /home2/megallo/tools/mahout-distribution-0.6/examples/target/mahout-examples-0.6-job.jar org.apache.mahout.classifier.df.BreimanExample -d rdftest/glass.data -ds rdftest/glass.info -i 10 -t 100
There's no documentation for the input format, except that it says it conforms to the
UCI format (which is not described anywhere on their website, thanks guys). So my disclaimer is that the following things were figured out by messing around.
Rules on the input vectors are thus:
- one line per document
- attributes must be in the correct order
- missing attribute placeholder is a question mark
- attributes cannot contain spaces or commas
- delimiter can be either comma or space
- numeric values can contain periods
To tell it how to read the input file so it can convert to vectors, it uses the Describe class. You pass in a nigh-unintelligible string of numbers with these characters:
- N : numerical attribute
- C : categorical (nominal) attribute
- L : label (nominal) attribute
- I : ignored attribute
I 2 C 3 N C C L == ignore first item, read in two alphanumeric values followed by three numeric values followed by two alphanumeric then label
It will parse the data file and make sure it conforms to the descriptor, then writes the descriptor out to a file. Then when you call the
BreimanExample class it takes the data file, the descriptor, the number of iterations, and the number of trees. I discovered that it will give you
NaN error values if you try too many trees with not enough data points.
I'm planning to write a class to invoke the
DecisionForest code. The
BreimanExample is an okay start, but it doesn't write the model to a file for future classification, plus I need a class to invoke
TestForest with the model and give back accuracy info. I'll post it here at the end of the quarter.
K-Means
Here is the sample on the Mahout website, and
this shell script is the actual script that has been updated.This one is nicer because it will build the vector files for you from a directory of text files.
As of 0.6, this will not run on Hadoop. You will need to set your MAHOUT_LOCAL.
On this line, add -nv and -ow:
./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse -nv -ow
Here, add -ow and -cl so that it will give you the cluster output file:
./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/clusters -o ./examples/bin/work/reuters-kmeans -x 100 -k 20 -ow -cl
Now change the input cluster filename to use clusters-*-final, because this number is variable.
./bin/mahout clusterdump -s ./examples/bin/work/reuters-kmeans/clusters-*-final -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 -dt sequencefile -b 800 -n 20 -o ./examples/bin/work/cluster_output.txt
However, the script stops there. It doesn't actually give you a way to see what original docs went into which clusters, so here's a class that will do that. I pulled this Java class down from the web; it's apparently similar to the Mahout in Action book, but since that's for version 0.4, it doesn't work any more. Here is the fixed Java class. Please note that you will have to keep adding jar after jar to your classpath for it to run. Most of what you need is somewhere in the /opt/hadoop/lib directory, including Apache Commons stuff.
ClusterOutput.java Here is the class all fixed up, and here is how to invoke it:
java ClusterOutput ./examples/bin/work/reuters-kmeans/clusteredPoints ./examples/bin/work/cluster_vectors.txt ./examples/bin/work/cluster_ids.txt
Related topics: TWikiUsers,
TWikiGroups,
TWikiAccessControl