Bonsai glossary, with notes and examples.

alignment model - the set of assumptions made to permit computing an "optimal" alignment for 2 or more sequences. For example, nearly all such models assume that each residue in a protein evolves independently of each other (clearly a simplification).

ancestral - the state characterizing the ancestor of the sequences that are under consideration. Specifically, this would be the sequence at the last time a set of sequences were literally identical (the same physical DNA molecule, or proteins they encode). For example, if two current-day sequences encode "RFWALPH" and "RFWQLPH" it is plausible to suppose that the ancestral sequence encoded "RFWXLPH" where "X" could be either A or Q. Note however that, in the absence of additional information, the ancestral "X" could be a third amino acid that changed in both current sequences. Generalizing, any of these residues could have been any amino acid that changed one or more times in both sequences to give rise to the current sequences. Intuitively (and correctly) it is more likely the R, F, W, L, P, and H residues were found in the ancestor and didn't change, rather than happening to change to the same residue in both from some different ancestral residue, but this is not necessarily true.

bit - a 2-fold increment in score probability.

BLOSUM matrices - a set of all possible amino acid pair scores, derived by Henikoff and Henikoff, that are probably the best general score matrices to date. There are several such score matrices that reflect different degrees of similarity in the proteins they are derived from.

bootstrap - a method for assessing the statistical significance of the positions of branches in a phylogenetic tree that is inferred by a pairwise method (e.g. Neighbor-Joining or UPGMA). A bootstrap starts with a set of sequences in which every possible sequence pair has been aligned. For each aligned pair, it samples scores from random positions in the alignment, adding the scores. This is repeated until the pseudo-alignment is the same length as the real alignment. When all the pairs have been sampled, it converts the scores to distances and computes a tree. This entire process is repeated many times and the frequency with which particular tree features are observed is taken as a measure of the probability that the feature is correct. See Bootstrap in Detail for more.

dynamic programming - a method for progressively building a set of scores or probabilities. In Bonsai, one or another variant of dynamic programming is the basis for all alignments.

gap extend (gap extend penalty) - the score given for adding a single residue to an already existing indel.

gap open (gap open penalty) - the score given for the appearance of an indel at a particular position in an otherwise ungapped alignment. Also see gap extend penalty and gap penalty.

gap penalty - score given for incorporating a gap residue (indel) in a sequence alignment. Most programs, including Bonsai, differentiate between the first gap residue (gap open) and additional gap residues (gap extend), a method called affine gap penalties.

global score - the score for an entire pair alignment or multiple alignment. Under some pair alignment conditions, this represents the quality of the best alignment of the ENTIRE sequences and under other pair alignment conditions, this represents the best regional alignment (which may be as long as the entire sequences or a much shorter segment depending on how global the sequence similarity is).

guide tree - a provisional phylogenetic tree used to guide a multiple alignment. The guide tree serves two purposes: it determines the order of progressive alignment and it determines weighting the sequences for alignment scoring, preventing over-counting of closely related sequences.

indels - insertion/deletion points in a sequence. In sequence alignments these are often referred to as "gaps", but this term is potentially misleading and Bonsai uses "indels" instead. The term "gap" suggests that one sequence is deficient with respect to another, requiring representation of this difference by gap symbols in the shorter sequence. This way of thinking about the situation is misleading because, without further information, it is impossible to know whether the two sequences came to differ from their shared ancestral sequence by the deletion of residues at this site in one sequence or insertion of residues in the other sequence.

local score - the score for an alignment that represents only a small local region. Often, this is the score at one specific residue, but it can also represent the sum or mean of scores in a small window around a specific residue.

minimum entropy - a method that seeks to find a minimum entropy state in comparing multiple alignment columns. [add more]

orthologue - a sequence in another species that shares a direct common ancestor with the current sequence. For some time after a speciation event this relationship is easily inferred and cleanly defined since the two genes differ only modestly. As evolutionary time passes, the orthology relationship becomes less obvious and eventually becomes ill-defined because of duplication and divergence. In the evolutionary limit, all sequences are probably derived from a single common ancestor (thought to be RNA). See also paralogue.

pair alignment - an alignment between exactly two sequences. The alignment may include gaps in one sequence or the other or both, at the ends or internally, such that residues in the two sequences are paired with each other appropriately.

PAM matrices - historically important amino acid score matrices, not provided with Bonsai because the BLOSUM matrices are more accurate. The PAM matrices were important because they were the first empirically derived score matrices and they have been very successful in guiding alignments.

paralogue - a sequence in the same species that shares a direct common ancestor with the current sequence. As with orthologues, this relationship is clear soon after a gene duplication within a species, because the two sequences differ only modestly. See also orthologue.

phylogenetic tree - a diagrammatic representation of the evolutionary relationships among a set of entities, which in Bonsai are invariably sequences. Phylogenetic trees are also commonly used to represent relationships among species of organisms.

profile - a set of aligned sequences and associated position-specific score information, possibly with associated information such as a tree. The alignment itself is a set of lines of characters, one line for each aligned sequence. In Bonsai, the residues in the sequence are represented by standard 1-letter codes and gaps are represented by '-'.

residue codes - for display and file exports, sequence residues are represented by standard 1-letter codes as detailed below. For various reasons, most notably computational speed, Bonsai represents residues internally (behind the scenes) as integers (A is 0, C is 1, etc in alphabetical order). This representation is invisible to the normal user, but if you look at source code or serialized Bonsai files this will help you understand them.

DNA code:
a - adenine
c - cytosine
g - guanine
t - thymidine

RNA code:
a - adenine
c - cytosine
g - guanine
u - uracil

Protein code:
A - alanine (ala)
C - cysteine (cys)
D - aspartate (asp)
E - glutamate (glu)
F - phenylalanine (phe)
G - glycine (gly)
H - histidine (his)
I - isoleucine (leu)
K - lysine (lys)
L - leucine (leu)
M - methionine (met)
N - asparagine (asn)
P - proline (pro)
Q - glutamine (gln)
R - arginine (arg)
S - serine (ser)
T - threonine (thr)
V - valine (val)
W - tryptophan (trp)
Y - tyrosine (tyr)

score matrix - the table of values that is used to determine the score assigned to any pair of aligned residues. For example, an amino acid substitution score matrix would be a 20 x 20 table of values representing every possible amino acid change. For computational speed, these are typically integers and represent log odds.

sum of pairs - a method for evaluating scores in multiple alignment columns. It evaluates all amino acid pairs in the two columns being compared and assigns a total score derived from a score matrix of the same type used in pair alignments (e.g. a BLOSUM matrix).


James H. Thomas, Department of Genome Sciences, University of Washington
8/1/2002