Understanding Phylogenetic Trees

Phylogenetic trees of sequences serve two related purposes: they encapsulate the evolutionary history of a set of sequences, and they provide an intuitive graphical summary of the often complex relationships among those sequences. In principle, these two purposes need not be congruent; the fact that they are is perhaps the strongest proof of the theory of evolution. I will discuss phylogenetic trees from the point of view of sequence history, but nearly everything I say applies just as well to an accurate graphical summary of sequence relationships.

The two fundamental features of a tree are the time line and the gene duplication. In Bonsai, the time line runs from left to right, in part because this is the general convention in English for showing time lines, and in part because this orientation makes it easiest to label the tree (in scientific literature every possible tree orientation is seen). Gene duplications are vertical lines that specify the time at which a particular duplication took place. Before the duplication there was a single evolving sequence, and after the duplication there were two. In the sample tree shown below, there were 5 gene duplications that produced 6 genes from the original single gene. Such duplications have two main sources - speciation and genetic rearrangement. In the case of speciation, the gene replicates as usual, but copies that end up in different individual organisms become genetically isolated from each other, for example by geographic separation. From that point on, the two copies of the gene evolve separately and, if both lineages survive to the current day, we may sample each lineage by sequencing. [note - these are actually populations of genes present in each species, but this complication can be ignored for this discussion.] In the tree below, each subcluster of three sequences arose by two speciation events from a single parent sequence. In the case of genetic rearrangement, a transposition, tandem duplication, genome duplication, or some similar event gives rise to 2 copies of a gene that previously was present in a single copy. Unlike speciation, these two copies are present in the same individual organism. If this new genotype spreads and becomes fixed in the population (and both copies of the gene persist with time), the two copies of the gene will now evolve separately. We may sample both copies in the genome sequence of any species that originated from this lineage after the time of the genetic rearrangement. In the tree below, the most ancient duplication was of this type: an ancestal Eag/Erg like gene duplicated and the two copies of the gene persisted for a long time before the two speciation events (which of course affected both ancestral genes, producing two clusters of three sequences with similar divergence patterns). We can clearly infer that this was a gene duplication because many divergent organisms (mammals, flies, worms) have clearly recognizable copies of each gene type.

Digression on tree orientation: Unfortunately there is no standard for tree orientation. Probably the most common orientation of rooted trees in scientific literature is with the root at the top and the sequences at the bottom. Unfortunately, this direction for time is not intuitive to most people and worse, the sequence names in English are very difficult to fit on the twigs of the tree (this orientation might be perfect for Chinese labels). My vote for the most distasteful tree orientation (and the most common in the popular literature) is the tree that branches upward the way real trees do, with the "highly evolved" species at the top (where highly evolved actually means most similar to humans). In reality of course, all extant sequences and organisms have evolved for exactly the same amount of time. Even DNA isolated from a mastodon or an ancient human is the merest eyelash away from current. Such upright trees also have the labeling problem of inverted trees.

Tree roots: So far, all trees in Bonsai are "rooted", meaning at the oldest time (toward the left) all the sequences join at one ancestral sequence. In scientific literature, you will also sometimes see unrooted trees, which look more like damaged spider webs than trees, spreading out from an ill-defined center. If the theory of evolution is correct, all real phylogenetic trees are rooted somewhere in the distant past, but the true historical root can be very difficult to infer when it is sufficiently ancient or when data are inadequate. An unrooted tree is a concession to this uncertainty - one can be certain that the true historical root lies on one of the branches of the unrooted tree, but which one is unspecified. Because Bonsai insists on rooting all trees, you should take the validity of the root position with a grain of salt. In most cases as a biologist, you will have little trouble placing the root correctly based on your knowledge of the organisms and sequence families you are analyzing. Bonsai tries to guess the "best" root by making the tree leaves as even as possible, which is usually correct. In the example tree shown above, consideration of the known phylogenies of the organisms and the robustness of the sequence clusters makes it clear that the root is correctly placed.

Topogical complexity: Even a large phylogenetic tree looks so clean and simple that it is nearly impossible to intuit just how many alternative tree topologies are possible. For example, the relatively innocuous looking tree below (14 sequences) has 8 trillion possible topologies. These are true topological variations - no number of node rotations can interconvert them. (Conversely, any two trees that can be interconverted by node rotations are considered identical trees, though such rotations can be helpful in making the display intuitive to humans.)

 

 


Distance Correction: As two sequences diverge by mutation, the rate at which the percent identity or alignment score drops is not linear with time. Since we are most interested in divergence time, this means that tree distances computed from raw pairwise similarity measures must be corrected to represent true distances.

To understand the non-linearity of sequence similarity with divergence time, consider two sequences that have just arisen by duplication of a single sequence (and are thus identical). Over time, mutations will accumulate in each sequence and the two sequences will become less similar. Consider the effects of the first amino acid change to one that occurs much later in time. Because the sequences are initially identical, the first amino acid change will invariably reduce the percent identity (and alignment score) for the two sequences. Later in time, the two sequences will already differ from each other substantially and the next amino acid change in one of the two sequences is less likely to change the percent identity. Specifically, as time passes it is more and more likely to affect an amino acid that is already different between the two proteins. In this case, the change will either not change the percent identity or it will actually change the amino acid back to the match the other protein (increasing their identity). The effect is gradual - the more different the two sequences are, the less new amino acid changes will contribute to their becoming increasingly divergent. To make this slightly quantitative, consider two identical proteins of 100 amino acids each. The very first amino acid change will cause a precise 1% drop in their aligned sequence identity. Now consider the same two proteins a long time later, when they share only 50% aligned sequence identity. In this case, 50% of the time a new amino acid change will cause the same 1% drop in sequence identity (when it affects one of the conserved amino acids), but the other 50% of the time the change will not affect their identity or will actually increase it. The effect is analogous to the law of diminishing returns. [see note A for issues of score-based distances]

A practical simulation also serves to demonstrate this point. The two trees shown below are exactly the same except for the application of a correction for distance. The underlying set of eight divergent sequences were generated in silico by three rounds of duplication, each followed by identical densities of amino acid changes (average of 60% change at each step), i.e. the real historical tree was perfectly symmetrical with equal branch lengths at each step. It is obvious that the distance-corrected tree to the left is a much better approximation to reality. (Note: even perfect alignment and tree building will not produce perfect symmetry since the evolutionary model is appropriately stochastic.)

Based on these considerations, it is clear we must correct for this non-linearity if we hope to approximate a tree that represents the passage of time. Ideally such a correction might be analytically determined, but in practice this is too complex and an empirical approach must be taken. However, a few general observations can be made to guide the distance correction function. We can draw a curve that describes the relationship between an uncorrected and corrected distance measure, with uncorrected values on the X-axis. If uncorrected distances are defined to lie between 0 and 1, it is clear that the curve close to the X-Y intercept should be close to linear (x = y) and that as x increases, y should increase more rapidly, producing an upwardly inflected curve.

 


Note A) In most places, I discuss divergence in terms of "percent identity" partly because this is in common use and partly because it is more intuitive to grasp. However "percent identity" is not nearly as simple as it might first appear, and it is not the best measure of sequence similarity. The Bonsai default is to use alignment scores rather than percent identity.


James H. Thomas, Department of Genome Sciences, University of Washington
7/31/2002