Lecture Summary March 5: Coalescence (Part 2)

the expected probability of a coalescence is k(k-1)/4N and the expectation for the length time interval one needs to wait for the coalescnence is E(u)=4N/(k(k-1)) [the distribution can be approximated by a exponential distribution]; all these formulas are based on the assumption that N is large and that N >> k (there is a typo in the printed lecture note of March 2!)

Insert: How to infer population parameters

We can use genetic data from contemporary populations, for example DNA sequences or microsatellite data. With these we could construct a genealogy (using phylogenetic methods) and then naively use the depth of this genealogy [which is 4N(1-1/k)] to infer the population size. This estimate will be not very good because of the large variance of the coalescent, that produces with the same size N many different tree shapes and depth. A better approach to find an estimate is based on integrating over all possible genealogies that are weighted by their probability. A general way (not the only one) to do this is based on maximum likelihood. We calculate the probability of our data for a specific parameter and then try to maximize this probability by changing the parameter. For inferences based on the coalescence we use

Likelihood(Parameter P) = Probability(Data given Parameter) = L(P) = Prob(D|P)
L(P) = Sum over all genealogies (Prob(Parameter|G) Prob(D|G))

where G is a specific genealogy. Prob(P|G) is the probability of a genealogy given the coalescent with the parameters, for example N and growth or migration rates. Prob(D|G) is the probability of the (sequence) data for a specific tree G; this is the same as one would calculate in a phylogenetic maximum likelihood method. This is not practical for dataset larger than 2-3 sequences. For reasonable datasets one needs to relay on an approximation to this likelihood using Markov chain Monte Carlo sampling. This is computationally challenging and was not really feasible before the 1990, although the technique [Metropolis-Hastings algorithm] goes back to Metropolis, Teller and Rosenbluth (1953).

The mutation rate of the data and the parameters are constrained, with half the mutation rate and double the population size we will get similar trees compared to double the mutation rate and half the population size.

Extensions of the basic coalescent framework

The coalescent framework was first only used for inference of a single population parameter, the population size. Hudson (1990; Gene Genalogies and the coalescent process, Oxford Surveys in Evolutionary Biology 7:1-44) and others showed how it could be extended to include other population genetic forces, such as growth, migration rates, recombination, selection, speciation etc.

Migration

We need to augment our genealogies with migration events and incorporate migrate rates. So that the probability of a time interval is now dependent on the different population sizes and the migration rates. This procedure allows to estimate population parameters for rather complicate population models with many parameters. If 4Nm << 1 the populations will most likely coalesce separately.

Recombination

So far we assumed that a tip on a genealogy is a non-recombining gene. Long nuclear sequences most likely have undergone recombination in the past. Sites on the left side of a recombination event do not need to come from the same genealogy as the sites on the right side. There will be a different genealogy as one goes a distance along a nuclear sequence such that 4N_er <1 or r <1/(4N_e. In humans we can expect a recombination around every 1000 base pairs. So that there are many hundred thousands different gene trees for our genome.

Speciation and the coalescent

"Lineage sorting"

Individual gene phylogenies and species phylogenies do not necessarily need to show the same pattern. the differences can be a result of

Genetic drift: the ancestral population was very large
Balancing selection

Time of speciation and ancestral population size

Several studies attempt to estimate the time of species divergence based on a phylogenetic tree. This does not take into account the population size of the ancestral population. Lineages of a gene tree from the different species have first to be in the same species before they can coalesce, therefore the gene divergence always predates the species divergence. If the ancestral population is very large and the speciation event is rather recent, the difference might be rather large. (An overview: Edwards and Beerli (2000) Perspective: Gene divergence, population divergence, and the variance in the coalescence time in phylogeographic studies. Evolution 54(6): 1839-1854.