Lecture Summary March 5: Coalescence (Part 2)
the expected probability of a coalescence is k(k-1)/4N and
the expectation for the length time interval one needs to wait for the coalescnence is
E(u)=4N/(k(k-1)) [the distribution can be approximated by a exponential
distribution]; all these formulas are based on the assumption that N is
large and that N >> k (there is a typo in the printed lecture note
of March 2!)
Insert: How to infer population parameters
We can use genetic data from contemporary populations, for example DNA
sequences or microsatellite data.
With these we could construct a genealogy (using
phylogenetic methods) and then naively use the depth of
this genealogy [which is 4N(1-1/k)] to infer the population size.
This estimate will be
not very good because of the large variance of the coalescent, that
produces with the same size N many different tree shapes and depth.
A better approach to find an estimate is based on integrating over
all possible genealogies that are weighted by their probability.
A general way (not the only one) to do this is based on
maximum likelihood. We calculate
the probability of our data for a specific parameter and then try to
maximize this probability by changing the parameter.
For inferences based on the coalescence we use
Likelihood(Parameter P) = Probability(Data given Parameter) = L(P) = Prob(D|P)
L(P) = Sum over all genealogies (Prob(Parameter|G) Prob(D|G))
where G is a specific genealogy. Prob(P|G) is the probability
of a genealogy given the coalescent with the parameters, for example N and
growth or migration rates. Prob(D|G) is the probability of the (sequence) data
for a specific tree G; this is the same as one would calculate in a
phylogenetic maximum likelihood method.
This is not practical for dataset larger
than 2-3 sequences. For reasonable datasets one needs to relay on an approximation to this
likelihood using Markov chain Monte Carlo sampling.
This is computationally challenging and was not really
feasible before the 1990, although the technique
[Metropolis-Hastings algorithm] goes back to
Metropolis, Teller and Rosenbluth (1953).
The mutation rate of the data and the parameters are constrained,
with half the mutation rate and double the population size we will get
similar trees compared to double the mutation rate and half the population
size.
Extensions of the basic coalescent framework
The coalescent framework was first only used for
inference of a single population parameter, the population size.
Hudson (1990; Gene Genalogies and the coalescent process,
Oxford Surveys in Evolutionary Biology 7:1-44)
and others showed how it could be extended
to include other population genetic forces, such as growth,
migration rates, recombination, selection, speciation etc.
Migration
We need to augment our genealogies with migration events and incorporate
migrate rates. So that the probability of a time interval is now dependent
on the different population sizes and the migration rates.
This procedure allows to estimate population parameters for rather
complicate population models with many parameters.
If 4Nm << 1 the populations will most likely coalesce separately.
Recombination
So far we assumed that a tip on a genealogy is a non-recombining gene.
Long nuclear sequences most likely have undergone recombination in the
past. Sites on the left side of a recombination event do not need
to come from the same genealogy as the sites on the right side.
There will be a different genealogy as one goes a distance along a nuclear
sequence such that 4Ner <1 or
r <1/(4Ne. In humans we can expect a recombination around every
1000 base pairs. So that there are many hundred thousands different gene trees
for our genome.
Speciation and the coalescent
"Lineage sorting"
Individual gene phylogenies and species phylogenies do not necessarily
need to show the same pattern. the differences can be a result
of
- Genetic drift: the ancestral population was very large
- Balancing selection
Time of speciation and ancestral population size
Several studies attempt to estimate the time of species divergence based
on a phylogenetic tree. This does not take into account the population size
of the ancestral population. Lineages of a gene tree from the different species
have first to be in the same species before they can coalesce, therefore the
gene divergence always predates the species divergence. If the ancestral
population is very large and the speciation event is rather recent,
the difference might be rather large.
(An overview: Edwards and Beerli (2000) Perspective: Gene divergence,
population divergence, and the variance in the coalescence time in
phylogeographic studies. Evolution 54(6): 1839-1854.