Tree Bootstrap in Detail

The bootstrap is a method first suggested by Felsenstein (ref) for assessing the statistical significance the positions of branches in a phylogenetic tree that is inferred by a pairwise method (Bonsai uses Neighbor-Joining). A bootstrap starts with a set of sequences in which every possible sequence pair has been aligned. For each aligned pair, it samples scores from random positions in the alignment, adding the scores. This is repeated until the pseudo-alignment is the same length as the real alignment. When all the pairs have been sampled, it converts the scores to distances and computes a tree. This whole process is repeated many times and the frequency with which particular tree features are observed is taken as a measure of the probability that the feature is correct.

What features of a set of alignments cause a reproducible bootstrap tree feature? Obviously, if the branch lengths above and below a tree node are long, this will tend to make the tree robust at that node, but other factors play a major role as well. Since the bootstrap samples scores randomly across the length of an alignment, the longer the alignment, regardless of its quality, the less the sampling result will vary (think intro statistics). For similar reasons, if an alignment is consistent across its length the random sampling will produce more consistent results. In the limit, if a sequence has the same match score at every position all bootstrap samples will give the same result as the original alignment. Thus, if an alignment is long and consistent it will tend to produce bootstrap scores that are similar to the original alignment score. If these properties are true of all of the sequence alignments in a set, the result is a tree that is relatively robust regardless of the branch lengths in the tree.

Give an example or two.

[in development]


James H. Thomas, Department of Genome Sciences, University of Washington
5/18/2002