Natural Language Toolkit
The
Natural Language Toolkit is a suite of
Python libraries and programs for symbolic and statistical natural language processing. It is installed on all the computers in the Treehouse.
This page contains assorted tips and errata. Refer to the NLTK web page for a full discussion.
Installing
Installation is fairly easy and
well documented.
NLTK depends on the
Numerical Python package. The
documentation page for this package can be confusing because it discusses two packages: an older one called Numeric and a newer one called numarray. The NLTK uses the older one.
On OS X I had trouble installing Numeric from the source and had to download the
ATLAS math library before I could get it to build. Any error messages you see relating to "atlas" or "lapack" are due to this error. You mileage may vary.
The best way to install the Numeric package on OS X is probably to use FINK. Some time back I installed version 23.1 (precompiled) and it seems to work fine with
NLTK. For the cool graphics that Bill showed us, I had to download Tcl/tk and a version of Python that was compatible with it. I used Fink to download them, and both were available precompiled. (Note: If you download Python with FINK you will have two versions of Python on your system. In my case, the version that comes with Mac OS is available w/the terminal command "python" and the new ptyhon (version 2.3.3) is available with the terminal command "python2.3"
If you have a mac and you want to play around with this stuff, I'd be happy to help you get it started!
--
MichaelTepper - 30 Jan 2005
Functionality
The NLTK has an assortment of useful programming utilities and machine learning techniques. Among them are the following
- Corpus Handling
- Parsers for major corpora (Brown, Switchboard, Newsgroup)
- Corpora samples
- Tokenization framework
- Machine Learning Algorithms
- Classification ... Maximum entroy, Naive Bayes
- Clustering ... EM, K-means, GAAC
- Hidden Markov Model
- Parsers ... shift-reduce, recursive descent, chart
- Stemmer ... regular expression, Porter stemmer
- Tagger ... n-gram, Brill
- Graphics widgets ... trees, AVMs, graphs, plots
- Utility classes ... probability
Using the NLTK
The fastest way to get started with the NLTK is to run it from the Python command line. At a command prompt type
python. You'll see something like
Python 2.4 (#2, Jan 21 2005, 01:46:10)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
You can browse the
API documentation for functions that you're intested in. Many packages have a
demo() function that shows off their basic functionality. For example
Python 2.4 (#2, Jan 21 2005, 01:46:10)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk.tagger
>>> nltk.tagger.demo()
===========================================================================
Reading training data.............
Read in 30480 words for training
Training taggers.
Training unigram tagger...
Training bigram tagger...
Training trigram tagger...
Reading testing data.......
Read in 15979 words for testing
===========================================================================
Running the taggers on test data...
Default (nn) tagger: Accuracy = 15.5%
Unigram tagger: Accuracy = 78.4%
Bigram tagger: Accuracy = 79.1%
Trigram tagger: Accuracy = 79.2%
Usage statistics for the trigram tagger:
Subtagger | Words Tagged
---------------------|-----------------
<2nd Order Tagger> | 39.1%
<1st Order Tagger> | 19.4%
<Unigram Tagger> | 19.7%
<DefaultTagger: nn> | 21.8%
===========================================================================
>>>
--
BillMcNeill - 14 Jan 2005