Protein Score Matrices
In standard pair alignments, a score is assigned for each possible amino acid pair that may be aligned. The score is obtained from a look-up table usually called a score matrix. To view any of the score matrices used by Bonsai, open Pair Alignment Settings, select a score matrix, and click on the "View Score Matrix" button. To find a score value in the look-up table, simply find the row for one amino acid and the column for the other amino acid and read their intersection. The score matrix can be presented as a full table, but the values are symmetric so it doesn't matter which amino acid of the aligned pair you use for the row or column (note A).
How are the scores obtained? A good biochemist could make up a plausible set of matrix scores based on their knowledge of amino acid structures and functions, but we can do much better by letting evolution tell us the scores. This approach also has the advantage of allowing alignments to be assigned true probabilities (see below). A number of attempts have been made to construct such score matrices, but I will describe only the most successful of these, the BLOSUM matrices (because they are consistently the best performing general matrices, these are the only matrices Bonsai provides for use). [NOTE - user specified matrices and special use matrices should be added; thus far the PHAT transmembrane matrix is included] The general approach to defining a score is intuitive and simple. Find sets of sequences whose alignment is thought to be correct, measure the frequency of various amino acid pairs in columns of the alignment, and divide by the expected frequency of such pairs randomly in the same set. For computational reasons, the score numbers that appear in the matrices are derived from this by taking the base 2 logarithm and rounding to the nearest integer (see B for more details). The beauty of this approach is that it takes into account all of the imaginable influences on amino acid substitution frequencies, including selective forces and ease of mutation from one codon to another. Assuming that your source sequence sets (often called the training set) are representative and correctly aligned, the method is guaranteed to give the "correct" average substitution scores.
Why are there 12 different BLOSUM matrices? The different BLOSUM matrices directly measure the amino acid pair frequencies from sets of aligned proteins that differ in their degree of similarity. Each matrix is assigned a number that indicates the similarity bin it is derived from. The BLOSUM90 matrix derives from pair alignments that are more than 90% identical, whereas the BLOSUM85 matrix derives from those that are 85 to 90% identical, etc. This approach produces matrices that are known to perform better than earlier matrices (e.g. PAM matrices), especially for more diverged sequences. Earlier score matrices were made in a similar manner to BLOSUM matrices, but used only closely-related sequences, then assumed that more divergent sequences would follow an identical relative pattern of matches (though with higher frequencies of change of course). This is now known to be incorrect, though it was a reasonable first approximation. The BLOSUM matrices also have the advantage of a much larger set of training sequences than was available at the time the PAM matrices were calculated.
In general, for a particular alignment the BLOSUM matrix used should approximately match the percentage of identical residues in the aligned sequences (i.e. BLOSUM50 for sequences that are 50% identical).
[@@ add semi-technical section on BLOSUM scores.]
A) The symmetry of the score matrix follows intuitively from simple considerations. When only two sequences are compared, it is impossible to know at a mismatch position which sequence residue is ancestral and which has undergone a substitution (in fact, neither residue may be ancestral). Another way to look at this is that, when aligning two sequences, you are not finding the best alignment for "that" sequence with "my" sequence, you are comparing the two sequences - exactly the same alignment would be found if you found the best of alignment for "my" sequence with "that" sequence.
B) Explain sum of pairs, expected value, bits, and probabilistic interpretation. [ in development]