Gibbs Sampler Help

Gibbs sampling finds conserved fixed-length motifs within a set of sequences. There are much more extensively configurable Gibbs sampler programs available, but this one provides much of the basic functionality for your convenience. The implementation may improve in future releases. WARNING - the sampler currently works only with protein sequences and only with the standard 20 amino acids.

Run Normal: will run a Gibbs sampler using the current settings (see the menu item to change the settings).

Run Graphing: for demonstration only. Shows a dynamic graphic of a single sampler run. Useful for teaching or understanding how the sampler works.

Mask and Repeat: after running the sampler once or more, you can mask the first set of motifs found and run the sampler again to find additional motifs.

Show Settings: to change the settings used when you click "Run Normal".


Settings

Fixed motif width: to set how long the motif is expected to be. Usually it suffices to set this to a longer number than expected and a shorter motif will be found within that length.

Motif selection (probabilistic or exact): a detail under most circumstances. Probabilistic is the "true" sampler method, but exact runs faster and usually produces the same result.

Sequence progression (stochastic or sequential): whether each new sequence to be sampled will be selected at random (true sampler) or sequentially (sampling goes through the sequence list one by one in order). Sequential runs faster but may not be as accurate when the motifs are weak.


How the Gibbs Sampler Works

This is a very minimal description... Essentially, a fixed-length segment is chosen at random from each sequence and "aligned" without gaps. One sequence is dropped from this set, and a position-specific score matrix is derived from all the remaining motifs. This score matrix is used to scan the dropped sequence for the best match to the current motif alignment (either exact or probabilistic). This match is added to the "aligned" motif set and the process of dropping a sequence, deriving a position-specific score matrix, and scanning the dropped sequence is repeated many times. After each round, the quality of the motif alignment is assessed and the process is ended when this quality fails to improve in subsequent iterations. This whole process constitutes one sampler run. The results are stored and the new sampler runs are made until either the same set of motifs is found or a fixed number of samplers are completed. At the end of this, the user is shown the best scoring motif alignment (or the one that repeated).

Here is what has really happened behind the scenes. Because the initial motif selection is completely random, the initial alignment of those motifs is arbitrary and meaningless. This meaningless motif profile will thus select a meaningless match from the dropped sequence. This stochastic phase of the sampler occurs for an indeterminate number of sampler rounds. Eventually, BY CHANCE ALONE, two or more real aligned segments are put into the "aligned" motif set. Once this happens, this correctly aligned pair contributes a bias to the resulting position-specific score matrix, and subsequent sequence scans tend to find a real new alignment that adds to the first two. The bias is small (how small depends on the number of sequences being sampled and the quality of the aligned match) and may be stochastically discarded before it snowballs. However, some of the time this bias results in correct selection of a third aligned motif during the scan phase. This in turn increases the bias, and the process quickly converges on a correct solution for all the sequences.

Late in each sampler run, the motif found is slid left and right several residues to find the local best position (this reduces local alignment trapping).

There are lots of specifics to all of this of course. The method used is essentially identical to that described in the original paper on sequence alignment by Gibbs sampling, which I highly recommended as an excellent crossover computational genetics paper (one that biologists can understand). See Lawrence et al. 1993.


James H. Thomas, Department of Genome Sciences, University of Washington, 11/18/2002