Summer course in Coalescent Theory, Uppsala 2007

Manual, Computer exercise 1

 

Purpose: The purpose of this exercise is to illustrate the Coalescent process with various extensions in the discrete as well as the continuous form.

 

Time permitting, we shall also simulate a data set and look at a simple example data set

 

This manual can also be found at www.coalescent.dk

 

First, we will use two small java-applets that can “animate” the coalescent process, the Wright-Fisher animator (discrete generations) and the Hudson-animator (continuous time).

 

Second, we will simulate some sequences under the coalescent process with and without recombination and perform some simple analysis on these sequences.

 

Third we will have a look at a mitochondrial data set

 

These tools can both be accessed through http://www.coalescent.dk. They have been developed by different student programmers at the Bioinformatics Research Center, University of Aarhus.

 

It is important that you take time to think about the different questions while doing each exercise – try to build your intuition about the coalescent process. Write down the answers

\

A. The Wright-Fisher animator

 

By following the link Wright-Fisher animator at www.coalescent.dk it is possible to use the Wright-Fisher animator to follow the reproduction process forward in time and the coalescent process backwards in time one generation at a time. After you have followed the reproduction process a number of generations forwards in time it is possible to “untangle” the genealogy, and then to follow both how many descendants each of the original genes leave over generations (click on upper row), and to follow the ancestors to the sequence in the bottom row (by pressing the circles in the bottom row).

 

A new simulation is done by setting the parameters and pressing the new bottom. The simulation can then be controlled by the bottoms in the bottom (right) part of the window, e.g. one generation at a time. One bottom enables you nto untangle the resulting genealogy (i.e. rearranging individuals so that lines do not cross).

  1. Try to set N=number of genes=10, and G=number of generations=15.
    • Forward in time: How many generations does each of the initial genes persist in the population?
      1. What is the mean and the variance in the number of descendants of a given gene?
    • Backwards in time (coalescence). Do all the genes coalesce within this time span?
      1. If so, what is the number of generations to coalescence of the sample?
      2. If not, how many ancestors are there and how large part of the sample are they ancestors to
      3. What would you have expected from theory?
    • Try to repeat this exercise 5 times writing down the number of ancestors and the time to coalescence. This should give you an idea about the variance of the process.
      1. How is the variance in time going from 3 to 2 ancestors compared to going from 2 to 1 ancestor? What do you expect from theory?

 

  1. Try the first exercise with 3 different values of N and adjust G.if necessary

·        How does the time to coalescence scale with N?

B. Hudson animator

 

The Hudson-animator is reached through www.coalescent.dk, choosing Hudson animator.

It is developed by student programmer Anders M. Mikkelsen as a tool for the visualization of the following continuous time processes.

 

·                                The basic coalescent and the Coalescent with recombination

·                                Coalescent with exponential growth

·                                Coalescent with migration

 

Please consult the manual before doing the exercises, it can be found under help at the start page. It briefly describes how to control the applet.

 

Exercises using the Hudson-animator

 

Coalescent with recombination.

 

  1. The basic coalescent. Choose n=5 sequences and rho=0 (no recombination). Press recalc and the animation starts. The speed can be controlled with speed. Details regarding each node can be seen in the small window at the right when moving the mouse over the node.
    1. What is the time to the first coalescence event? Write it down.
    2. What is the time to the most recent common ancestor? Write it down.
    3. Repeat a and b 5 times (by pressing recalc). How does the time to first coalescence and time to most recent common ancestor vary?
  2. Try with 10 and 20 sequences. What are the times to the first coalescence and the most recent common ancestor in these cases? Write it down for 5 different runs.

 

The recombination rate is determined by rho=2Nc. In the animation, recombination events are marked as blue nodes (in contrast to the green nodes of coalescent events).

 

  1. Set n=5, and rho=1. Press recalc and study the animation. Does any recombination events occur? If not, try again. How many recombination events do you expect to see? Try a total of 25 times and write down the number of recombination events in each case. What does the distribution look like?

Look at a couple of simulations in more detail. Study where in the sequence, recombination events occur

    1. Can you find examples that different part of the sequence have different most recent common ancestors (marked in green) at different time points?.
    2. Can you find any examples of “trapped” non-ancestral material (coloured white) between blocks of ancestral material?
    3. Try to press the trees window pane. Here it is possible to study each of the different trees over the sequence. How many different trees are there and how does this relate to the number of recombination events? Try to find examples of each of the following recombination types

                                                               i.      Recombination changing the topology of the tree

                                                             ii.      Recombination changing the branch length but not the topology of the tree

                                                            iii.      Recombination that does not change the tree

What is the total number of each of i, ii, and iii during the 5 replicate runs? How does that match with theory (see book Chapter)

  1. Try setting rho=2 and n=10
    1. How many recombinations occur now?
    2. At which time do different parts of the sequence find a most recent common ancestor?
    3. What is the time until the first part of the sequence finds a most recent common ancestor (calculate the average over 5 replicates). At which time have all the different parts of the sequence found a most recent common ancestor (average over 5 replicates)? Compare this time with the time to most recent common ancestor without recombination.
  2. Try setting rho=5 and n=20, how many recombination events occur now? How many different most recent common ancestors are there over the sequence now?

 

Coalescent with exponential growth

 

Now choose coalescent with exponential growth. This is controlled with the parameter exp which is equal to Nb. This parameter measure how many times the present population is larger than the population 2N (N=size of present day population) ago. In human  mitochondrial studies (no recombination) all estimates suggest that exp>100.

 

  1. Try to simulate n=10 sequences and different runs with exp=1, 10, 100, 1000
    1. How does the shape of the genealogical tree depend on the value of exp?
    2. How can that be?
    3. How would this altered shape be visible in a set of sequences evolving on the tree? Would there be fewer or more “singletons”? How would Tajima’s D be affected?

Hint: If you push trees the same tree will appear without any crossing branches

 

Coalescent with migration

 

Now choose coalescent with migration It is only possible to simulate two populations, but the number of individuals and the migration rates between the populations can be freely chosen. M1 is the migration rate from population 1 to population 2. The separation between populations is marked with a dotted line and migration events are shown as blue nodes.

 

  1. Try to set n=10, M1=0.5 and M2=0.5 and run the animation
    1. How many migration events occur in total? Note it down.
    2. How many of these happen from there are 10 until there are two sequences left?
    3. How many migration events occur during the time interval when there are only two ancestors left?
    4. Repeat the animation with the same parameters three times – how large is the variance?
  2. Try to push trees. Here you can observe the tree without migration events and without crossing branches, but you can see which population each individual came from.
    1. How many migration events can you infer must have happened?
    2. How does this correspond with the true number?
    3. Repeat this exercise a couple of times
  3. Try to investigate different migration rates and number of sequences.
    1. How is the shape of the tree affected (e.g. if M1=M2=0.1)?
    2. How would this change of tree shape be expected to be visible in a set of real sequences (would there be more or fewer singletons?).

 

 

2. Simulation of sequences

 

At www.coalescent.dk, choose coalescent, recombination and gene conversion – finite sites. This program can simulate sequence data sets under the coalescent process with recombination and/or gene conversion, using a couple of different substitution model (JC and Kimura 2 parameter).

A window appears where the following choices can be made:

Number of simulations, number of sequences and length of sequences are obvious.

Model of substitution: Jukes-Cantor, or Kimura's 2 parameter model, with different rates for transitions and transversions. The rate is the mutation rate measured in 2N generations and is approximately the percentage of nucleotides where two random sequences are expected to differ, i.e. a rate of m=0.01-0.2 provides realistic data sets.

Rate heterogeneity denotes how much the substitution rate varies over the sequence, with a being the parameter of a gamma distribution, i.e. a large value of a means a small rate heterogeneity and vice verse.

Rate of recombination is the recombination rate measured in 2N generations as in the hudson-animator,

Rate of gene conversion is similarly defined as the recombination rate (i.e. measured in 2N generations). For gene conversion, the average length of the converted fragment should also be specified, 1/q is the fraction of the sequence expected to be converted.

If outgroup is chosen, and extra sequence termed ”outgroup” is simulated. It does not recombine with the other sequence, but has the distance chosen from the other sequences. It can be used in later analysis for rooting trees, but we will not pursue this here so simulate without outgroup.

format of sequences: (Phylip format) or FASTA format)

 

Simulation with and without recombination

 

  1. Simulate 20 sequences of length 500 under the Jukes-Cantor model with m=0.02, no rate heterogeneity, no recombination, no outgroup. Fasta format. Run the program and save the sequences under a reasonable name (e.g. data1.fas).
  2. Next Simulate a data set of 20 sequences of length 500 and m=0.02 with rho=2. No outgroup, no rate heterogeneity, and Fasta format. Save as data2.fas

 

Look at the data sets in the DataAnalyzer application and the associated R2 program (the simulated data sets can be considered non-coding). The program can estimate various simple quantities, among these Gm and Hm.

The R2 program (by Anders Mikkelsen and Thomas Christensen) tests for a correlation between linkage disequilibrium and distance.. The measures of linkage disequilibrium used are the standard D’ measure, the squared correlation measure r2, and whether two sites are compatible or not (the CM measure).

 

 

  • Use the R2 program on the data set. Pick 100 permutations.
      1. Is there significant evidence for recombination for each of the three measures of linkage disequilibrium?
  • How much recombination occurred? Estimation of Gm (Hudsons estimator) and Hm (Myers estimator)

 

Analysis of real data sets

 

Data sets available

 

The following three data sets can be downloaded following the link at the www.coalescent.dk web page. Just save the ones you want to use on the desktop. They can be downloaded in each of three different alignment formats, just use FASTA.

 

waterbuck. mitochondrial D-loop sequences from 36 Waterbucks, from four African subpopulations (Peter Arctander, personal communication)

sat1: 32 sequences from the foot-and-mouth disease from South African buffaloes (Bastos et al. 2000). It is coding with data in the second reading frame (-AA when using DataAnalyzer)

adh: Original Adh data from D. melanogaster: (11 sequences)

Kreitman, M., 1983 Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophilamelanogaster. Nature 304: 412-417.

 

Suggestions for exercises

 

1. What is the number of segregating sites, and Watterson’s estimate of the scaled mutation rate?

2. What is the average pairwise difference?

3. What is Tajima’s D and what does it tell you?

4. What is Fu and Li’s D.

7. Estimate the minimum number of recombination events needed to explain the data under the infinite sites model. (DataAnalyzer can do this) Do you get what you expected? If not, why not?

8. Calculate LD and look at it as a function of distance (using the R2 application).