Alignment Type: Sum of Pairs [or Minimum Entropy - not yet implemented] Generally, the default Sum of Pairs will be preferable because it incorporates probabilities for all possible amino acid changes from confirmed alignments. Minimum Entropy simply attempts to create the alignment columns that have the lowest amount of residue variability without accounting for the specific residues involved and how similar they are to each other. It may be preferable if you don't want to include residue-pair information, for example if you suspect that the protein average amino acid changes are inappropriate in this sequence. See the Multiple Alignment Methods help for more details. [NOTE - Minimum Entropy is now undergoing an overhaul and may not be available]
Score Matrix: This setting is relevant only for the Sum of Pairs method. For proteins, the Bonsai default is to use an adaptive score matrix, which means a matrix is chosen that is most suitable for each step in the multiple alignment and is a good general choice. If you want to override this to use a specific matrix throughout, select the matrix here. Adaptive will almost always be the right choice for proteins. For DNA, the Bonsai default is to use a score matrix based on the Kimura model of DNA evolution. See the Score Matrix section of general Bonsai help for more details.
Guide Tree Type: Set the type of algorithm used to determine pairwise distances, which will be used to guide the order and weights for the Multiple Alignment. Full Pair Align is always preferable (see Pair Align Help for specifics) but Turbo Words is a great deal faster. This may be important when aligning very large numbers of long sequences, because the number of pairs to compare increases in approximate proportion to N2 / 2 where N is the number of sequences. See below for a description of the way Turbo Words works. My limited testing thus far shows that it isn't too awful as long as there are no long repeats in the sequences. If you experience memory problems, you may also want to turn off saving pairwise guide alignments, but you will incapacitate some aspects of the guide tree viewing and calculations.
Gap penalty and deferral checkboxes: gap penalties in multiple alignments cannot be explicitly set as for pair alignments, but you can influence some parameters that affect their use. Check "Use Gap Clustering" to cause gaps (somewhat arbitrarily) to more strongly cluster together. Uncheck "Use Residue-based Gap Bias" to eliminate the default setting that gaps tend to open more at some residues than others.
Deferred Alignment Score Cutoff: Bonsai will follow the guide tree to determine alignment order with the exception of sequences that score poorly with all others (outliers). This number determines how stringent the deferral cutoff is; larger values will cause fewer deferrals.
Display checkboxes: Check the display types that you would like this Multiple Alignment to produce. Any combination can be displayed. Currently you must set the correct displays BEFORE doing the alignment.
Turbo Words: This is a method for very rapid assessment of sequence relatedness (to the best of my knowledge, original to Bonsai). It is very sloppy and very fast, but surprisingly accurate given a few conditions that are typically true for protein coding sequences (but less so for DNA). The approach is to make a list of all short amino acid "words" found in each of the two proteins and count how many are shared. The default word size is 5. If N is the number of words of size 5, listing the words for each protein runs in time proportional to 5 * N (sequence length). The words are sorted prior to comparison, which runs in time proportional to N * ln N. Because they are pre-sorted the word comparison itself runs in time approximately proportional to N. For typical pair alignments, this represents an improvement of several orders of magnitude over full dynamic programming alignment. In practice, Turbo Words is so fast that the time to compute and display the tree predominate over the pairwise comparison. Moderate inaccuracy is typical for Turbo Words (though I suspect some heuristic tuning can minimize this) and serious inaccuracy can arise if one of the proteins contains a repeat of a domain in the other protein or if the sequences are short. [note - I may implement a correction for the repeat problem based on auto-matching]. The guide tree may be used only to weight and order sequences for the subsequent multiple alignment, and my limited testing shows that the Turbo Words tree performs quite well in that role.
Words in Turbo Words are not simple strings of amino acids. Small groups of amino acids are considered identical for word generation. There are two types of groups, tight and loose. In the default loose grouping, broader sets of similar amino acids are considered identical. The exact sets are:
Loose: "AGP", "CST", "DENQ", "FY", "H", "ILMV", "KR", "ST", "W"
Tight: "A", "C", "DE", "FY", "G", "H", "ILMV", "KR", "NQ", "P", "ST", "W"
If this method seems useful, I plan to investigate it's properties more systematically. Some obvious parameters to test include varying the definition and tightness of letter grouping, permitting some letters to appear in more than one group (A is a good candidate for this), and using auto-matching to correct for internal duplications.
The Turbo Words algorithm shares some properties with BLAST but is much sloppier and faster. The major difference is that Turbo Words takes no account of the positions of words in the proteins. Other differences include a much cruder form of substitution scoring and lack of any match extension mechanism. I'm not sure how much faster it is than BLAST, but it is likely to be substantial for the case where construction of the BLAST indexes are part of the computation time, as would be the case for Bonsai.
It is instructive to construct a pairwise tree using the sample sequence set 3, which includes 45 sequences of average length ~350. Turbo Words computes all pairwise distances almost instantaneously on my Pentium 3 laptop, while a full dynamic programming tree takes about 45 seconds.