Multiple Alignment Methods

Introduction: The Bonsai multiple alignment methods build up a multiple alignment by first aligning subsets of the sequences to generate interim alignments, sometimes called "profiles". These interim alignments are then aligned with each other stepwise to incrementally build the full multiple sequence alignment. Critical elements in such progressive alignments include the order in which to add sequences to create interim alignments, the method for generating scores for aligned columns, the weights assigned to various sequences for the scoring, and the way gap positions and lengths are scored. Progressive alignment algorithms are the only ones that are currently computationally efficient enough to work with large numbers of sequences. If you align more than 20 or so sequences of moderate length you will see that even progressive alignments are time consuming. For example, alignment of the 45 sequences, ~340 amino acids each, provided with Bonsai as "Sample Sequences 3", required 30 seconds to align on a current fast desktop machine (Pentium 4, 1.4 GHz, Feb., 2002). Alignment of 150 sequences of similar length required 250 seconds. These times increase as nearly the square of sequence number x sequence length. Various short-cuts are available (Bonsai provides one called Turbo Words) but for any final alignment these should not be used.

Sum of Pairs Scoring: When aligning a sequence to a profile or two profiles to each other, Bonsai determines the score for one column aligned to a residue or another column by the method called Sum of Pairs. In essence, this method compares every residue in one column with every residue in the other column and sums the scores for each residue pair according to a BLOSUM score matrix. The score matrix used is chosen dynamically according to the degree of similarity of the sequences in the larger profile. An example of a set of aligned columns is shown below. For the column indicated by the arrow, the L residue will scored twice and the V residue scored once against each residue found in a column to which it aligns. An important twist on this method is sequence weighting as described below. [ in development. ]

The Position-Specific Score Matrix (PSSM): When the Sum of Pairs scores for every alignment column has been computed, the resulting series of score values is called a PSSM. In an alignment this is used in a fashion very similar to the way a simple score matrix (e.g. BLOSUM62) is used for pair alignments. I find it conceptually simplest to think of the PSSM as being the score matrix for a "collapsed" alignment; that is, one in which all the residues in each column have been compressed into a single weighted composite residue with associated rules for scoring it against other residues during an alignment. You can view all of this information for a profile by selecting columns and using "View : Column Alignment Values".

Guide Tree Construction: Bonsai begins a multiple alignment by constructing a phylogenetic "guide tree" that will be used to select the order of incremental sequence alignment and the score weights for the sequences. The guide tree is constructed by measuring the evolutionary distances between every sequence pair and using the Neighbor-Joining method to calculate a provisional phylogenetic tree. This is important for a good multiple alignment for two reasons. First, the best overall alignment is reached most efficiently by choosing the correct order in which to build interim alignments, and the guide tree is used to make these choices. Second, the sequences in most multiple alignments should not have equal weight in contributing to column match scores. The guide tree is used to calculate appropriate sequence weights. These issues are detailed below.

Gap placement: When comparing a set of related protein sequences, the positions of indels (gaps)... [ in development ]

Multiple Alignment Order: As a multiple alignment progresses, more information becomes available to guide each new step of the alignment. Because generating an accurate alignment is easiest with closely-related sequences, such alignments are more likely to be exactly correct than those among more distantly-related sequences. More distant relationships are easier to infer correctly when one or both of the groups being aligned contains multiple closer-related members. This is because such groups contain information that typifies the group as a whole, not merely a single member of the group. For example, one sequence might contain an atypical W residue in a position where its relatives have L or V residues (L to V is a favorable substitution). When aligning the atypical sequence to these close relatives, sequence context around the W residue will cause the W to be correctly aligned to the L and V column of the other sequences. In subsequent rounds of multiple alignment, the W residue will now form part of the PSSM score in a column that is predominated by L and V residues. Now, if a distantly related sequence is encountered that happens to have the less-typical W residue it will align more reliably because of the composite picture of this column. If the sequence with the W residue were aligned to a distantly related sequence first, the W might get "sucked" off to align to a nearby W that happens to be nearby in the other sequence. (I chose W as the atypical residue here because it is usually the most conserved residue and a W - W match thus scores very high. This high score will make it most prone to being sucked away, even if the alternative alignment requires a gap.) [ this should have figures to clarify ]

Another way of thinking about the same issue is evolutionarily. If we have a set of current-day sequences, some of these sequences will have diverged from each other relatively recently and thus will be more closely related. If we align the most closely-related pair of sequences first, we can derive an alignment that, in effect, represents the last shared ancestor of these two sequences. Once these ancestors are represented by interim alignments, they can be aligned to each other better. Here's a simple example. Think of a nearly symmetric tree relating 4 sequences as shown in the figure below (this example was generated with the simulation model in Bonsai). The two intermediate ancestral sequences (A and B) are just as similar to each other as each pair of current-day sequences (e.g. 0-2 and 0-1). In contrast, a current-day sequence from one side of the tree is twice as distant from a current-day sequence from the other side of the tree (e.g. 0-2 and 1-2). If we align each cluster of current-day sequences first, then align the resulting representations of their immediate ancestors (A and B), at each step we will have aligned the most similar available pair of sequences. It should be intuitively apparent that aligning more closely-related sequences is easier than more distantly-related ones. [note - the "representation of an ancestral sequence" is a bit loosely interpreted here, but is essentially correct]

Bonsai selects alignment order in the same way as the Neighbor-Joining algorithm builds trees, from the leaves back to the root. The tree below shows the order in which six example sequences will be aligned to each other, with an alignment order number at each node on the tree (i.e. alignment 3 will align sequence E to the aligned profile of C/D). What matters most in this case is that each of the clusters (A, B, F and C, D, E) will be aligned within themselves first and only at the end aligned to each other. This last stage of the alignment is the most challenging because the distance separating the two clusters is large. It's accuracy will be maximized by using the sum of information for each cluster when comparing the two clusters. Because the branches are close together and the distances are fairly similar, the exact order of alignment within each of these clusters doesn't matter much in this case.

Simple specific case example: The simplest case of a multiple alignment involves just three sequences. Most of the time, two of these sequences will be more closely related to each other than to the third. Intuitively it is obvious that we should align these two first, because their alignment will be most robust. This is more than an intuition: if we started by aligning two distantly related sequences first, features of this less-certain alignment will be "locked" into place and these incorrect inferences will then influence the way in which the third sequence is aligned to the first two, effectively inflating the importance of inaccuracies in the first alignment.

Sequence Example: [ in development! ]

TRGALLWSAPEYF

TRGALLWSAPEYF

TRGALLWSAPEYF

Multiple Alignment Sequence Weights: A multiple-alignment guide tree serves a second important purpose: to assign each sequence an alignment weight. This function is especially important if the user does not carefully choose sequences to align, since current sequence databases are heavily biased toward specific groups of organisms (most notably mammals and bacteria). A sequence's weight determines how heavily its residue is counted in the construction of the PSSM containing it. When some sequences are closely related to each other, each should be weighted less than more distance sequences. To understand why, consider an extreme example illustrated by the tree below. Suppose we have already aligned the three lower sequences to create a interim alignment and we are now ready to align one of the upper sequences to the profile. If we take no account of sequence-relatedness in the profile (in constructing its PSSM), the residues represented together by Seq 0-1-2 and 0-1-1 will be counted twice as heavily as those in Seq 0-2. This is clearly undesirable, since 0-1-2 and 0-1-1 are nearly identical because they diverged very recently. In effect, 0-1-2 and 0-1-1 are close to representing a single evolutionary sequence instance rather than two. This problem could be minimized by carefully removing sequences prior to the alignment so that only a representative set is used, but this throws away information and is unenforceable (and no doubt would usually be violated by the typical user). Fortunately Altschul et al. described an effective solution to this problem. Before progressive multiple sequence alignment, a guide tree is constructed (the same guide tree that is used to select alignment order) and the guide tree is used to assign sequence weights. In the example below, Seq 0-1-2 and 0-1-1 would be assigned just over half the weight of Seq 0-2 (if 0-1-2 and 0-1-1 were identical they would get exactly half the weight). When a PSSM of these three sequences is constructed, these weights are used to determine their relative contributions to residue scores. The result is that PSSM scores automatically account correctly for sequence-relatedness. As the tree underlying a PSSM gets increasingly complex, the assignment of sequence weights becomes complex. Altschul et al. described a simple tree-interpreting algorithm that does this correctly regardless of tree complexity. Obviously, the construction of an accurate guide tree is essential for the weighting process to be accurate.


James H. Thomas, Department of Genome Sciences, University of Washington
8/7/2002