Lecture notes for February 26: Phylogenetic Trees


A phylogenetic tree or phylogeny expresses inferred relationships among species. It assumes that species form by splitting apart; otherwise it wouldn't be tree-shaped. We are only barely beginning to deal with the fact that some species form by hybridization or other non-treelike mechanisms and so don't fit on trees.

Inferring a phylogeny is a statistical process. There is always a possibility of error; all we can do is extract as much information from available data as possible.

A major obstacle to figuring out the right phylogeny is the wide variety of possible phylogenies. There are over a billion ways to arrange seven species. The number of ways to arrange 25 species exceeds the number of atoms in the universe. Until computers many phylogeny problems were completely hopeless; now they're just very hard.

Parsimony methods. The first phylogenies were made using visible traits of organisms. For example, Class Mammalia groups together organisms sharing traits such as milk and hair, and lacking traits such as feathers.

When different traits disagree, there are two possible solutions. You can make arguments to show why one is a good trait and one is a bad trait, often based on how well the trait matches other traits for the same organisms. For example, scales might be proposed as a defining trait, but the scaly pangolin (an anteater) has many traits in common with mammals and very few in common with reptiles or fishes, so we suppose that scales are not a good trait.

The problem with this is that it tends to reinforce whatever beliefs we currently have (for example, that the pangolin is not a reptile). It's unsatisfyingly subjective. For example, chimps and gorillas share some traits (knuckle-walking) to the exclusion of humans. Chimps and humans share some (meat-eating) to the exclusion of gorillas. Which trait is right? This was generally settled by prejudice: of course chimps and gorillas are more similar to each other than chimps and humans!

The rule of parsimony was developed to overcome this subjectivity. Parsimony involves collecting many traits and preferring the tree which demands the fewest changes (evolutionary events) to explain the traits. (Subjectivity in choosing the traits can still be a problem.)

A limitation of parsimony is that it implicitly assumes that all traits contribute the same amount of information, and all evolve at the same rate. When this is not true, parsimony can give actively misleading answers. This is usually seen as long branch attraction. If two branches in the tree are very long, they end up joined together erroneously. (The group of trees which cause long branch attraction is often called the Felsenstein Zone, after Dr. Felsenstein here at UW, who first demonstrated this problem mathematically.)

Distance methods. Instead of analyzing each trait one at a time, we could boil down all the information into some form of distance between two species. For example, we could count how many sites vary in the hemoglobin gene between each pair of species. The distances can then be used (in over 100 slightly different ways) to construct a tree.

This loses some information compared to analyzing individual traits, but it allows us to overcome some of the limits of parsimony. In particular, we can use corrected distances which take into account the possibility of multiple changes. In general, the success of distance methods depends more on their correction scheme than on the actual method used.

Maximum Likelihood methods. A third possibility is to analyze traits one at a time, but using an explicit model of how they evolve. In practice this is not possible with morphological traits (they are too complicated) but is worth trying with DNA or protein sequence data. These methods are called Maximum Likelihood methods because they attempt to find the tree on which the data are most likely, given the model. These methods require huge amounts of computer power but can give very good results if the model is accurate enough.

An example model for DNA would be: Sites evolve at 2 different rates, fast sites being 10% of all sites and evolving 3x as fast. Transitions are 5 times more frequent than transversions; otherwise, all changes are equally likely. The overall frequencies of A, C, G and T are 0.25, 0.27, 0.23 and 0.25. Evolution of adjacent sites is independent.

This is quite complex, but we can see that it's still a simplification; in particular, adjacent sites aren't really independent in a coding sequence. We have to hope the model is good enough to get accurate answers even though it is not perfect.

Testing trees. How good is an inferred tree? There are some guidelines:

When these things don't happen, the tree has to be taken as possible or probable but not certain, and interpreted cautiously. For example, Vigilant et al. published an important paper about the tree of human mtDNA in which they showed a tree that had only Africans near the root, and suggested that this meant that humans evolved in Africa. However, Swofford et al. showed that the same data were almost equally well explained by a large number of other trees, including some which put the root unexpectedly in Papua New Guinea.

The bootstrap is a means of testing whether the tree is broadly supported by the data, or only by a few details of the data. New data sets are made by resampling with replacement from the original. Trees are drawn from the new data sets, and a consensus of the new trees is created. If the new trees agree on a relationship it is considered well supported (for example, we could agree to treat as solid any relationship that shows up in 95% of the new trees).

Things to think about.

Gary Olsen said in his recent talk here at UW:

When trees disagree, there are four possible responses.

When is each of these appropriate?