Bonsai Reference List [in development]

Needleman and Wunsch (1970). A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Mol. Biol. 48, 443-453.
This is the breakthrough paper describing an algorithm that can find optimal sequence alignments given a scoring scheme. It applies a "dynamic programming" algorithm to the problem of global alignment. After the appearance of this paper, various modifications of the method were introduced to give optimal regional alignments, overlap matches, repeat matches, use more realistic score functions, and to increase the computational speed or memory use by the algorithm.

Smith and Waterman (1981). Identification of Common Molecular Subsequences. J. Mol. Biol. 147, 195-197.
An adaptation of the Needleman-Wunsch algorithm that permits best local scoring alignments to be found.

Pascarella and Argos (1992). Analysis of Insertions/Deletions in Protein Structures. J. Mol. Biol. 224, 461-471.
This paper uses a large dataset of structurally superimposable proteins to analyze the length, compositions, and positions of indels. Results from the paper are one factor used to guide Bonsai multiple alignment indel assignments.

Henikoff and Henikoff (1992). Amino acid substitution matrices from protein blocks. PNAS 89, 10915-10919.
This paper reported the best and current standard set of amino acid score matrices, using an improved protein dataset and a sounder method for generating score matrices for distant relatives.

Lawrence, Altshcul, Boguski, Liu, Neuwald, and Wootton (1993). Detecting subtle sequence signals: a Gibbs Sampling strategy for multiple alignment. Science 262, 208-214.
This paper develops the best method for finding short gapless motifs that are distantly related to each other, using a method called Gibbs Sampling.

Thompson, Higgins, and Gibson (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nuc. Acids Res. 22, 4673-4680.
This paper reports on the algorithms used by the most popular multiple alignment program CLUSTAL W (later given a glossier user interface in CLUSTAL X). Bonsai is generally modeled on CLUSTAL algorithms and approaches.

Altschul, Carroll, and Lipman (1989). Weights for Data Related by a Tree. J. Mol. Biol. 207, 647-653.
For progressive multiple alignments to work well with typical sets of sequences, the sequences or pair-comparisons must be weighted appropriately. If this isn't done then closely-related sequences are over-counted in the alignments because they appear 2 or more times. In the colorful words of the authors: "Imagine buying several [identical] copies of the morning paper to be assured that what it says is true..... Three different editions of the morning paper may be useful for learning about an event, but [even] they generally shouldn't outweigh two other, independent sources. Given an idea of the inter-relationships in a set of data, it should be possible to weight the individual observations so as to discount redundant information." This classic paper derives two different approaches to weighting observations. The most commonly used method (and the less exact one) relates the sequences by a tree and assigns weights to each sequence that can then be used to assign score weights in building position-specific score matrices during multiple alignment. This is the method that Bonsai currently uses (as well as Clustal). Their second method (which I hope to implement soon) is exact and involves giving a weight, not to
individual sequences, but to each pair of sequences (a total of N * (N - 1) / 2 pairs) when they are compared to build a sum-of-pairs score. The commonly used sequence-weighting method captures only some of the sequence relationships, whereas the latter captures all of the available information. However, it is trickier to program and is somewhat slower to run.