Natural Language Toolkit

The Natural Language Toolkit is a suite of Python libraries and programs for symbolic and statistical natural language processing. It is installed on all the computers in the Treehouse.

This page contains assorted tips and errata. Refer to the NLTK web page for a full discussion.


Installation is fairly easy and well documented.

NLTK depends on the Numerical Python package. The documentation page for this package can be confusing because it discusses two packages: an older one called Numeric and a newer one called numarray. The NLTK uses the older one.

On OS X I had trouble installing Numeric from the source and had to download the ATLAS math library before I could get it to build. Any error messages you see relating to "atlas" or "lapack" are due to this error. You mileage may vary.

The best way to install the Numeric package on OS X is probably to use FINK. Some time back I installed version 23.1 (precompiled) and it seems to work fine with NLTK. For the cool graphics that Bill showed us, I had to download Tcl/tk and a version of Python that was compatible with it. I used Fink to download them, and both were available precompiled. (Note: If you download Python with FINK you will have two versions of Python on your system. In my case, the version that comes with Mac OS is available w/the terminal command "python" and the new ptyhon (version 2.3.3) is available with the terminal command "python2.3" If you have a mac and you want to play around with this stuff, I'd be happy to help you get it started!

-- MichaelTepper - 30 Jan 2005


The NLTK has an assortment of useful programming utilities and machine learning techniques. Among them are the following

  • Corpus Handling
    • Parsers for major corpora (Brown, Switchboard, Newsgroup)
    • Corpora samples
    • Tokenization framework

  • Machine Learning Algorithms
    • Classification ... Maximum entroy, Naive Bayes
    • Clustering ... EM, K-means, GAAC
    • Hidden Markov Model
    • Parsers ... shift-reduce, recursive descent, chart
    • Stemmer ... regular expression, Porter stemmer
    • Tagger ... n-gram, Brill

  • Graphics widgets ... trees, AVMs, graphs, plots

  • Utility classes ... probability

Using the NLTK

The fastest way to get started with the NLTK is to run it from the Python command line. At a command prompt type python. You'll see something like

Python 2.4 (#2, Jan 21 2005, 01:46:10) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

You can browse the API documentation for functions that you're intested in. Many packages have a demo() function that shows off their basic functionality. For example

Python 2.4 (#2, Jan 21 2005, 01:46:10) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk.tagger
>>> nltk.tagger.demo()
Reading training data.............
  Read in 30480 words for training
Training taggers.
  Training unigram tagger...
  Training bigram tagger...
  Training trigram tagger...
Reading testing data.......
  Read in 15979 words for testing
Running the taggers on test data...
  Default (nn) tagger:  Accuracy = 15.5%
  Unigram tagger:       Accuracy = 78.4%
  Bigram tagger:        Accuracy = 79.1%
  Trigram tagger:       Accuracy = 79.2%

Usage statistics for the trigram tagger:

             Subtagger | Words Tagged
    <2nd Order Tagger> |    39.1%
    <1st Order Tagger> |    19.4%
      <Unigram Tagger> |    19.7%
   <DefaultTagger: nn> |    21.8%

-- BillMcNeill - 14 Jan 2005

Topic revision: r4 - 2005-01-30 - 23:42:51 - MichaelTepper

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions