One of the challenges of working in natural language processing is the large amounts of data that must be processed to get meaningful results. After you've done everything you can to make your program run quickly--written efficient algorithms, bought powerful hardware--it may still take you hours or even days to get a result. This is where parallelization comes in.
Parallelization is a technique in which you break your problem down into smaller pieces that can be run simultaneously on multiple machines. It usually consists of three parts:
- Dividing the taks into independent parallel subtasks
- Running the subtasks simultaneously
- Coallating the results
The first and last steps entail work for the programmer to make sure that the problem is properly modularized and the parallelization code (if any) is properly written. The second step can be accomplished manually by walking from machine to machine and kicking off processes, but is greatly facilitated by having that task automated by parallel processing software, like the kind we have installed on pongo.
Parallelization is a challenging programming technique in its own right; your specific parallelization technique will vary from task to task. The most important thing to keep in mind is an awareness of the logical dependencies between different parts of your program. For example, say you have a slow parser, so that running it on a test set of 10,000 sentences takes about a day. Since parsers work on sentences independently of each other, you could break the test set up into 10 1000-sentence inputs and run them all in parallel. Done correctly, this could give you up to a 10-times speedup, so that your task would finish in a little under three hours.
The pongo cluster maintained by the University of Washington Linguistics department manages a cluster
of parallel compute nodes using the openMosix
architecture. Parallel processing on pongo is completely transparent and requires no interaction from the user. After a program has run for a certain amount of time on pongo it will automatically be migrated to one of the compute nodes. In order to run multiple tasks in parallel, you need only start multiple jobs running and the cluster takes care of the rest.
UNIX Process Management
To use the pongo parallelization system effectively, you need to know how to manage UNIX processes. A full tutorial on using UNIX is beyond the scope of this Wiki, but a concrete example will point you to some of the basic ideas.
Say you have a program called
that takes a a filename as input and writes output to STDOUT. A typical command line might look like this:
$ nlp-program input > output
This will run
, writing output to a file called
By default the command prompt will only return after the program has exited. Often, you'd like to keep working while the program does it's thing. Appending an ampersand
to the command line runs the program in the background and returns to the command prompt immediately.
$ nlp-program input > output &
See the UNIX manpages for
for details about how to manage jobs running in the background.
takes a long time to run, you might decide to speed things up by breaking the input up into several parts and running them in parallel.
$ nlp-program input1 > output1 &
$ nlp-program input2 > output2 &
$ nlp-program input3 > output3 &
This starts three
jobs. If they take a long time to execute, pongo will automatically migrate them to different nodes so that they can be run in parallel. If you run
you'll see three separate
lines. Once all three jobs have completed, you can reassemble the output, possibly by doing something like this:
$ cat output1 output2 output3 > output
By default, UNIX kills any processes you have running when you exit the shell from which they were created. This is a problem is you have programs that run for days. You can use the
(short for "no-hang up") command to tell UNIX to keep a processing running in the background even after its parent shell exits. So the full command for this example would probably look like this:
$ nohup nlp-program input1 > output1 &
$ nohup nlp-program input2 > output2 &
$ nohup nlp-program input3 > output3 &
If you do this, you can log out of pongo and log back in hours (or days) later and use
to check on the status of your jobs.
- 08 Nov 2006