TWiki> Main Web>ParallelProcessing (revision 1)EditAttach

Parallel Processing

One of the challenges of working in natural language processing is the large amounts of data that must be processed to get meaningful results. After you've done everything you can to make your program run quickly--written efficient algorithms, bought powerful hardware--it may still take you hours or even days to get a result. This is where parallelization comes in.

Parallelization is a technique in which you break your problem down into smaller pieces that can be run simultaneously on multiple machines. It usually consists of three parts:

  1. Dividing the taks into independent parallel subtasks
  2. Running the subtasks simultaneously
  3. Coallating the results

The first and last steps entail work for the programmer to make sure that the problem is properly modularized and the parallelization code (if any) is properly written. The second step can be accomplished manually by walking from machine to machine and kicking off processes, but is greatly facilitated by having that task automated by parallel processing software, like the kind we have installed on pongo.

Parallelization is a challenging programming technique in its own right; your specific parallelization technique will vary from task to task. The most important thing to keep in mind is an awareness of the logical dependencies between different parts of your program. For example, say you have a slow parser, so that running it on a test set of 10,000 sentences takes about a day. Since parsers work on sentences independently of each other, you could break the test set up into 10 1000-sentence inputs and run them all in parallel. Done correctly, this could give you up to a 10-times speedup, so that your task would finish in a little under three hours.

The pongo cluster maintained by the University of Washington Linguistics department manages a cluster of parallel compute nodes using the openMosix architecture. Parallel processing on pongo is completely transparent and requires no interaction from the user. After a program has run for a certain amount of time on pongo it will automatically be migrated to one of the compute nodes. In order to run multiple tasks in parallel, you need only start multiple jobs running and the cluster takes care of the rest.

UNIX Process Management

To use the pongo parallelization system effectively, you need to know how to manage UNIX processes. A full tutorial on using UNIX is beyond the scope of this Wiki, but a concrete example will point you to some of the basic ideas.

Say you have a program called nlp-program that takes a a filename as input and writes output to STDOUT. A typical command line might look like this:

$ nlp-program input > output

This will run nlp-program, writing output to a file called output.

By default the command prompt will only return after the program has exited. Often, you'd like to keep working while the program does it's thing. Appending an ampersand & to the command line runs the program in the background and returns to the command prompt immediately.

$ nlp-program input > output &

See the UNIX manpages for jobs, kill, bg, fg, and ps for details about how to manage jobs running in the background.

If nlp-program takes a long time to run, you might decide to speed things up by breaking the input up into several parts and running them in parallel.

$ nlp-program input1 > output1 &
$ nlp-program input2 > output2 &
$ nlp-program input3 > output3 &

This starts three nlp-program jobs. If they take a long time to execute, pongo will automatically migrate them to different nodes so that they can be run in parallel. If you run mtop you'll see three separate nlp-program lines. Once all three jobs have completed, you can reassemble the output, possibly by doing something like this:

$ cat output1 output2 output3 > output

By default, UNIX kills any processes you have running when you exit the shell from which they were created. This is a problem is you have programs that run for days. You can use the nohup (short for "no-hang up") command to tell UNIX to keep a processing running in the background even after its parent shell exits. So the full command for this example would probably look like this:

$ nohup nlp-program input1 > output1 &
$ nohup nlp-program input2 > output2 &
$ nohup nlp-program input3 > output3 &

If you do this, you can log out of pongo and log back in hours (or days) later and use mtop to check on the status of your jobs.

-- BillMcNeill - 08 Nov 2006

Edit | Attach | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2006-11-08 - 23:36:53 - BillMcNeill

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions