Purpose: The purpose of this exercise is to illustrate the Coalescent process with various extensions in the discrete as well as the continuous form.
Time permitting, we shall also simulate a data set and look at a simple example data set
This manual can also be found at www.coalescent.dk
First, we will use two small java-applets that can “animate” the coalescent process, the Wright-Fisher animator (discrete generations) and the Hudson-animator (continuous time).
Second, we will simulate some sequences under the coalescent process with and without recombination and perform some simple analysis on these sequences.
Third we will have a look at a mitochondrial data set
These tools can both be accessed through
http://www.coalescent.dk. They have
been developed by different student programmers at the
It is important that you take time to
think about the different questions while doing each exercise – try to build
your intuition about the coalescent process. Write down the answers
\
By following the link Wright-Fisher animator at www.coalescent.dk it is possible to use the Wright-Fisher animator to follow the reproduction process forward in time and the coalescent process backwards in time one generation at a time. After you have followed the reproduction process a number of generations forwards in time it is possible to “untangle” the genealogy, and then to follow both how many descendants each of the original genes leave over generations (click on upper row), and to follow the ancestors to the sequence in the bottom row (by pressing the circles in the bottom row).
A new simulation is done by setting the parameters and pressing the new bottom. The simulation can then be controlled by the bottoms in the bottom (right) part of the window, e.g. one generation at a time. One bottom enables you nto untangle the resulting genealogy (i.e. rearranging individuals so that lines do not cross).
· How does the time to coalescence scale with N?
The Hudson-animator is reached through www.coalescent.dk, choosing
It is developed by student programmer Anders M. Mikkelsen as a tool for the visualization of the following continuous time processes.
·
The
basic coalescent and the Coalescent with recombination
·
Coalescent with exponential growth
·
Coalescent with migration
Please consult the manual before doing the
exercises, it can be found under help at the start page. It briefly
describes how to control the applet.
Exercises using the Hudson-animator
Coalescent with recombination.
The recombination rate is determined by rho=2Nc. In the animation, recombination
events are marked as blue nodes (in contrast to the green nodes of coalescent
events).
Look at a couple of simulations in
more detail. Study where in the sequence, recombination events occur
i.
Recombination
changing the topology of the tree
ii.
Recombination
changing the branch length but not the topology of the tree
iii.
Recombination
that does not change the tree
What is the total number of each of i, ii, and iii during the 5 replicate runs? How does that
match with theory (see book Chapter)
Coalescent with exponential growth
Now choose coalescent with exponential
growth. This is controlled with the parameter exp which is equal to Nb. This parameter measure how many times the
present population is larger than the population 2N (N=size of
present day population) ago. In human mitochondrial studies (no
recombination) all estimates suggest that exp>100.
Hint: If you push trees the same tree
will appear without any crossing branches
Coalescent with migration
Now choose coalescent with migration It
is only possible to simulate two populations, but the number of individuals and
the migration rates between the populations can be freely chosen. M1 is
the migration rate from population 1 to population 2. The separation between
populations is marked with a dotted line and
migration events are shown as blue nodes.
At www.coalescent.dk, choose coalescent, recombination and gene conversion – finite sites.
This program can simulate sequence data sets under the coalescent process with
recombination and/or gene conversion, using a couple of different substitution
model (JC and Kimura 2 parameter).
A window appears where the following choices
can be made:
Number of simulations,
number of sequences and length of
sequences are obvious.
Model of substitution: Jukes-Cantor, or Kimura's 2
parameter model, with different rates for transitions and transversions.
The rate is the mutation rate measured in 2N generations and is approximately
the percentage of nucleotides where two random sequences are expected to
differ, i.e. a rate of m=0.01-0.2
provides realistic data sets.
Rate heterogeneity denotes how much the substitution
rate varies over the sequence, with a being the parameter of a gamma
distribution, i.e. a large value of a means a small rate heterogeneity and vice
verse.
Rate of recombination is the recombination rate measured
in 2N generations as in the hudson-animator,
Rate of gene conversion is similarly defined as the
recombination rate (i.e. measured in 2N generations). For gene conversion, the
average length of the converted fragment should also be specified, 1/q
is the fraction of the sequence expected to be converted.
If outgroup is chosen, and extra sequence termed ”outgroup” is simulated. It does not recombine with the
other sequence, but has the distance chosen from the other sequences. It can be
used in later analysis for rooting trees, but we will not pursue this here so
simulate without outgroup.
format of sequences: (Phylip
format) or FASTA format)
Look at the
data sets in the DataAnalyzer application and the
associated R2 program (the simulated data sets can be considered non-coding).
The program can estimate various simple quantities, among these Gm and Hm.
The R2
program (by Anders Mikkelsen and Thomas
Christensen) tests for a correlation between linkage disequilibrium and
distance.. The measures of linkage disequilibrium used
are the standard D’ measure, the squared correlation measure r2,
and whether two sites are compatible or not (the CM measure).
Data sets available
The
following three data sets can be downloaded following the link at the www.coalescent.dk web page. Just save the
ones you want to use on the desktop. They can be downloaded in each of three
different alignment formats, just use FASTA.
waterbuck. mitochondrial D-loop sequences from 36 Waterbucks, from four African subpopulations
(Peter Arctander, personal communication)
sat1: 32 sequences from the foot-and-mouth disease
from South African buffaloes (Bastos et al. 2000). It
is coding with data in the second reading frame (-AA when using DataAnalyzer)
adh: Original Adh data from D. melanogaster: (11 sequences)
Kreitman, M., 1983 Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophilamelanogaster. Nature 304: 412-417.
Suggestions
for exercises
1. What is
the number of segregating sites, and Watterson’s
estimate of the scaled mutation rate?
2. What is
the average pairwise difference?
3. What is
Tajima’s D and what does it tell you?
4. What is Fu and Li’s D.
7. Estimate
the minimum number of recombination events needed to explain the data under the
infinite sites model. (DataAnalyzer can do this) Do
you get what you expected? If not, why not?
8.
Calculate LD and look at it as a function of distance (using the R2
application).