Introduction: The Bonsai multiple alignment methods build up a multiple alignment by first aligning subsets of the sequences to generate interim alignments called "profiles". Profiles are then aligned with each other stepwise to incrementally build the full multiple sequence alignment. Critical elements in such progressive alignments include the order in which to align sequences and interim profiles, the method for generating scores for aligned columns, the weights assigned to various sequences for the scoring, and the way gap positions and lengths are scored. Progressive alignment algorithms are the only ones that are computationally efficient enough to work with large numbers of sequences. If you align more than 20 or so sequences of moderate length you will see that even progressive alignments are demanding. For example, alignment of the 45 sequences, ~340 amino acids each, provided with Bonsai as "Sample Sequences 3", required 30 seconds to align on a current fast desktop machine (Pentium 4, 1.4 GHz, Feb., 2002). Alignment of 150 sequences of similar length required 250 seconds. These times increase as nearly the square of sequence number x sequence length. Various short-cuts are available (Bonsai provides one called Turbo Words) but for any final alignment these should not be used.
Sum of Pairs Scoring: When aligning a sequence to a profile or two profiles to each other, Bonsai determines the score for one column aligned to a residue or another column by the method called Sum of Pairs. In essence, this method compares every residue in one column with every residue in the other column and sums the scores for each residue pair according to a BLOSUM score matrix. The score matrix used is chosen dynamically according to the degree of similarity of the sequences in the larger profile. An example of a set of profile columns is shown below. For the column indicated by the arrow, the L residue will scored twice and the V residue scored once against each residue found in a column to which it aligns. An important twist on this method is sequence weighting as described below. (for the initiated, Bonsai uses standard Sum of Pairs scoring, with the improvements of sequence weighting and adaptive score matrix assignment)
The Position-Specific Score Matrix (PSSM): When the Sum of Pairs scores for every profile column has been computed, the resulting series of score values is called a PSSM. In an alignment this is used in a fashion very similar to the way a simple score matrix (e.g. BLOSUM62) is used for pair alignments. I find it conceptually simplest to think of the PSSM as being the score matrix for a "collapsed" profile, that is one in which all the residues in each column have been compressed into a single weighted composite residue with associated rules for scoring it against other residues during an alignment. You can view all of this information for a profile by selecting columns and using "View : Column Alignment Values".
Guide Tree Construction: Bonsai begins a multiple alignment by constructing a phylogenetic "guide tree" that will be used to select the order of incremental sequence alignment and the score weights for the sequences. The guide tree is constructed by measuring the evolutionary distances between every sequence pair and using the Neighbor-Joining method to calculate a provisional phylogenetic tree. This is important for a good multiple alignment for two reasons. First, the best overall alignment is reached most efficiently by choosing the correct order in which to build interim alignments, and the guide tree is used to make these choices. Second, the sequences in most multiple alignments should not have equal weight in contributing to column match scores. The guide tree is used to calculate appropriate sequence weights. These issues are detailed below.
Gap placement: When comparing a set of related protein sequences, the positions of indels (gaps) determine which residues will align. Bonsai positions gaps using methods very similar to those introduced first by ClustalW (see Thompson et al. 1994). Gaps are penalized by an affine penalty (gap open different from gap extend) and these penalties are substantially altered during alignment by sequence context. In particular, the residues directly flanking a new gap are considered using the findings of Pascarella (Pascarella and Argos, 1992), and stretches of hydrophylic residues cause reduced gap open penalty. [note - these two adjustments are probably related and might result in over-compensation for sequence context.]
Multiple Alignment Order: As a multiple alignment progresses, more information becomes available to guide each new step of the alignment. Because generating an accurate alignment is easiest with closely-related sequences, such alignments are more likely to be exactly correct than those among more distantly-related sequences. More distant relationships are easier to infer correctly when one or both of the groups being aligned contains multiple closer-related members. This is because such groups contain alignment information that typifies the group as a whole, not merely a single member of the group (this is the same principle that applies to using profiles to align new sequences to a family). For example, one sequence might contain an atypical W residue in a position where its relatives have L or V residues (L to V is a favorable substitution). When aligning the atypical sequence to these close relatives, sequence context around the W residue will cause it to be correctly aligned to the L and V column of the other sequences. In subsequent rounds of multiple alignment, the W residue will be scored in a column that is predominated by L and V residues, and is likely to be correctly aligned. If the sequence with the W residue were aligned to a distantly related sequence first, the W might get "sucked" off to align to a nearby W that happens to be close by in the other sequence. I chose W as the atypical residue here because it is usually the most conserved residue and a W - W match thus scores very high. This high score will make it most prone to being sucked away, even if this alignment requires a gap.
Another way of thinking about the same issue is evolutionarily. If we have a set of current-day sequences, some of these sequences will have diverged from each other relatively recently and will thus be more closely related. If we align the most closely-related pair of sequences first, we can derive an alignment that, in effect, represents the last shared ancestor of these two sequences. Once these ancestors are represented by interim profiles, they can be aligned to each other better. Here's a simple example. Think of a nearly symmetric tree relating 4 sequences as shown in the figure below (this example was generated very easily with the simulation model in Bonsai). The two intermediate ancestral sequences (A and B) were just as similar to each other as the 2 current-day sequences on one side of the tree are to each other (e.g. 0-2 and 0-1). In contrast, a current-day sequence from one side of the tree is twice as distant from a current-day sequence from the other side of the tree (e.g. 0-2 and 1-2). If we align each pair of current-day similar sequences first, then align the resulting representations of their immediate ancestors (A and B), at each step we will have aligned the most similar available pair of sequences. [note - the "representation of an ancestral sequence" is a bit loosely interpreted here, but is essentially correct]
Sequence Example: [ in development ]
TRGALLWSAPEYF
TRGALLWSAPEYF
TRGALLWSAPEYF
Multiple Alignment Sequence Weights: A multiple-alignment guide tree serves a second, equally important purpose: to assign each sequence an alignment weight. This function is especially important if the user does not carefully choose sequences to align, since current sequence databases are heavily biased toward specific groups of organisms (most notably mammals and bacteria). A sequence's weight determines how heavily its residue is counted in the construction of the PSSM containing it. When some sequences are closely related to each other, each should be weighted less than more distance sequences. To understand why, consider an extreme example illustrated by the tree below. Suppose we have already aligned the three lower sequences to create a profile and we are now ready to align one of the upper sequences to the profile. If we take no account of sequence-relatedness in the profile (in constructing its PSSM), the residues represented together by Seq 0-1-2 and 0-1-1 will be counted twice as heavily as those in Seq 0-2. This is clearly undesirable, since 0-1-2 and 0-1-1 are nearly identical because they diverged very recently. In effect, 0-1-2 and 0-1-1 together are close to representing a single evolutionary sequence instance. This problem could be reduced by carefully removing sequences from profile so that only a representative set is used, but this throws away information and is unenforceable (and no doubt would usually be violated by the user). Fortunately Altschul et al. (1989) described an effective solution to this problem. Before progressive multiple sequence alignment, a guide tree is constructed (the same guide tree that is used to select alignment order) and the guide tree is used to assign sequence weights. In the example below, Seq 0-1-2 and 0-1-1 would be assigned just over half the weight of Seq 0-2 (if 0-1-2 and 0-1-1 were identical they would get exactly half the weight). When a PSSM of these three sequences is constructed, these weights are used to determine their relative contributions to residue scores. The result is that PSSM scores automatically account correctly for sequence-relatedness. As the tree underlying a profile gets increasingly complex, the assignment of sequence weights becomes complex. Altschul et al. (1989) describe a simple tree-interpreting algorithm that does this correctly regardless of tree complexity. Obviously, the construction of an accurate guide tree is essential for the weighting process to be accurate.