Population Genetics

Introduction
Why?
Publication
Support
Credits
Help
Web Site

Files
Menus
Windows
Charts
Scripts/Macros
Modules
How

Characters
Taxa
Trees
Glossary

New
Features


	Character Evolution Simulations & Randomizations Population Genetics Molecular Data Continuous Characters Studies

Population Genetics

The calculations in Mesquite's standard packages that concern population genetics include coalescence simulations and calculations involving gene trees. As yet, there are few of the traditional population genetics calculations (e.g., no Fst). Some of the relevant features and calculations are:

simulations of gene trees by coalescence, either within a single population or in a diverging population or species tree
simulations of sequence evolution, which can be used to evolve haplotypes on a gene tree
calculations of fit of a gene tree to a population tree or population subdivision (Slatkin & Maddison's s, Maddison's deep coalescences)
searching for population trees that optimize fit of gene trees
cluster analysis of populations using similarities of contained gene sequences
charts, scripting and production of batch analyses to yield statistical tests

In these analyses individual samples or haplotypes will generally be represented by taxa. Thus, each gene sequence will be a taxon, and the sequence itself will be a row in a DNA matrix.

Many of the features and calculations described below are illustrated in the example data files in the Mesquite_Folder/examples/Coalescence/ folder. The majority of these features are provided by the Coalescence package (mesquite.coalesce) and the taxa association package (mesquite.assoc); some are provided by the distance analysis package (mesquite.distance).

Importing and Exporting Data
Single populations: Coalescence simulations
- Example: Chart of coalescence depths
Multiple populations
Simulating sampled gene sequences
References

Importing and Exporting data

Mesquite can import and export gene sequences in text files in the following formats: Simple text table, NBRF/PIF, PHYLIP, .ss (NONA, Hennig86, WinClada) as well as NEXUS files.

Single population: Coalescence simulations

Gene trees within a single population can be simulated under the assumption of neutrality, panmixia and constant population size. A sample of such simulated gene trees can help you generate null expectations in tests, for instance of population subdivision. Mesquite simulates these gene trees by a coalescent process, beginning with the set of defined genes (taxa) and coalescing back in time until a single common ancestor is reached. Simulated coalescent trees can be viewed or used in other contexts where a source of trees is used — in the Tree Window or in Trees charts, for instance.

To view simulated gene trees, first prepare a data file with taxa representing the sampled gene copies. Select New Tree Window from the Taxa&Trees menu, and indicate you want Simulated Trees as your tree source. Choose Coalescent Trees as your tree simulator. You will be asked to indicate an effective population size. The simulated tree that appears will probably not be shown with branch lengths indicated; for optimal viewing we suggest indicating you want "Branches Proportional to Lengths" in the Drawing menu, and use the Tree Form of Curvogram. You may also want to choose a narrower line width. The tree might look something like this:

The blue numbers represent time in generations. To examine other simulated gene trees, scroll from one to another using the blue arrows in the upper left corner of the Tree Window. You will notice that depth of the trees vary, depending on when the last coalescence happened during the simulation. In order to fit the tree into the window nicely, the scale of generations changes. You may find it interesting to prevent this rescaling, so that all trees are shown to the same scale. You can do that using the "Fixed Scaling" menu item in the Drawing menu.

The parameters of the simulation may be changed using the Set Seed (Tree simulation) menu item, and the items in the Coalescence Simulations submenu. To change effective population size, select "Set Ne". These simulations treat the organisms as haploid. For reasonably large population sizes, an exponential approximation can be used in the simulations to avoid having to model all genes in the population explicitly. This exponential approximation is the default; you can turn it off using the "Exponential approximation" menu item.

Simulated gene trees can be generated and saved in a trees block in your data file. To do this, select Taxa&Trees>Make New Trees Block From>Simulated Trees>Coalescent Trees. Simulated gene trees can also be used in charts, as in the following example.

Example: Chart of coalescence depths

Simulated gene trees can be examined in charts by using Simulated Trees as your tree source. For instance, the example file 03-coalescenceDepth.nex shows a chart of time to final coalescence in a sample of 100 gene trees. This was made by asking for the Histogram for trees, using Tree Depth as the value to calculate (this is a secondary choice), and Simulated Trees, Coalescent Trees as the source of trees. Here is the same chart modified to ask for a sample of 10,000 gene trees.

The effective population size is 1,000. Although the average time to final coalescence is about 1760 generations, the curve (as expected) has a long tail. By touching on the Text tab of the window you can see that there was one simulated tree with depth >11,100 generations.

Multiple populations

Gene sequences from multiple populations can be analyzed and modeled in Mesquite. For instance the following shows a simple example of 6 genes within 3 extant populations, in which a coalescence simulation within a diverging population history (blue) generates a hypothetical gene tree (green).

To treat a set of gene sequences as being distributed across multiple populations, youincorporate them into a single matrix, then indicate to which population each sequence belongs. Three components must be established in your data file to do this:

A block of taxa representing the gene sequences. The sequences themselves do not need to be represented by a DNA matrix, although the matrix could of course be important for some analyses. For purposes of gene tree simulations, however, it is enough that each gene sequence be represented by a taxon in a taxa block.
A block of taxa representing the populations (or species). Each taxon in this block represents a different population.
A taxa association block, which is a special block of information that indicates how the taxa representing genes are associated with the taxa representing populations. It is this that indicates for each gene what population it belongs in.

Once these three components are established, you can ask to do calculations (e.g., gene tree simulations) using the genes in a way that pays attention to the populations in which each gene belongs. We will next describe how to set up these three components, then what calculations can be done.

Establishing the association between genes and populations

Currently there are two ways to set up a data file with genes in multiple populations. Let's suppose you want to set up a file with six genes distributed among three populations (genes a1 and a2 in population A, b1 and b2 in population B, and c1 and c2 in population C).

Step-by-step method

The first method does it step-by-step, as follows:

Make a block of taxa representing the genes. This can be done by importing a data file with gene sequences, by creating a new file in Mesquite, or by selecting New Block of Taxa from the Taxa&Trees menu. Give this block of taxa a recognizable name, like "Genes". Give the genes (taxa) names in the List of Taxa window that will appear. In our example, the names are a1, a2, b1, b2, c1, and c2.
Make a block of taxa representing the populations or species containing the genes, and give it a recognizable name, such as "Populations". Give the populations (taxa) names in the List of Taxa window that will appear, e.g. A, B and C.
Select New Association... from the Taxa&Trees menu. You will be asked which will the be first block of taxa in the association. Although you can choose either genes or populations to be the first block, it will be easier for you to edit the taxa association if you choose populations as the first block. Then, name the association and you will be presented a small editor to specify the association:

This editor is crude; you edit directly in NEXUS file format. You can erase the instructional comment (make sure you delete everything from the "[" to the "]" inclusive). The key section to edit is the list of populations beside each of which is to be the list of its included genes. You can do this by Cut and Paste, cutting b1 and b2 and pasting them right after the "/" on the B line (in front of the ","), and cutting c1 and c2 and pasting them right after the "/" on the C line, to yield:

Click "OK" and your taxa association is ready to be used.

Automated method (using a macro)

A macro is available help you set up the gene-population association. It is called "Set up Associated Taxa Block" and is available in the Macros submenu of the Window menu. It assumes your file has one (and only one) block of taxa already made, that representing the genes. If you select this macro menu item, the second block of taxa representing populations will be created, the List of Taxa window for the genes will be shown, and a column labelled "Associates" will be shown, as follows:

The column labelled "Associates" indicates what taxa in the second block (in this case populations) are associated. This macro automatically creates a single population called "associate 1" and indicates that the first gene belongs in it. [You will notice that the macro will show a window listing taxa in the new block, showing only the single taxon "associate 1".]. When the macro is done working, a dialog box will appear with some explanation.

You will probably want to rename "associate 1" to be the name of the population in which the first gene resides (in our case, "A"). You can do this in the List of Taxa window with the single row named "associate 1". Next, indicate what other genes belong with the first population by selecting their rows in the genes' List of Taxa window and touching on the column heading "Associates". Select "Set Associate" and choose the name of the first population.

You must now create the other populations and assign the genes to them. You can do that in several ways, but the quickest way is to select all of the genes that belong to the second population, and select Create New Associated Taxon in the menu that drops down when you touch the column heading "Associates":

In the dialog that appears indicate the name of this new population. Continue this until all genes are assigned to populations. You should end up with two List of Taxa windows that look something like this:

Editing already-created taxa associations

The instructions above indicate how to create two taxa blocks (genes, populations) and link them using a Taxa Association. After you have done this, you can modify the Taxa Association using either the direct editing of the TaxaAssociation block (to do this, select it from the "Edit Association" submenu of the Taxa&Trees menu) or via the Associates column of the List of Taxa window. To show the Associates column, make the List of Taxa window visible for the genes (selecting it under the List of Taxa submenu of the Taxa&Trees menu) then choose Associated Taxa under the Show Column submenu of the List menu.

Simulating coalescence within a population tree

Simulating gene trees evolving within a population tree is similar to simulating gene trees in a single population as described under Single Population. To be able to do it, however, you need to have already established an association between the gene taxa and populations, as described above. You also need to have a Tree Window open and showing a population tree, because the calculations need to be able to find a current population tree in which to perform the simulation.

If your data file is ready to go, you can generate and use gene trees simulated within the population tree in many contexts, such as a Tree Window or chart. You merely need to specify the source of trees to be Simulated Trees, and choose the simulator to be "Coalescence Contained within Current Tree". This is a secondary choice, and hence will be available under "Other Choices" in a menu or by selecting "Show Secondary Choices" in a dialog. Once you've selected Coalescence Contained, you will be asked to choose Effective Population Size.

The simulation starts at each extant population. Within each, the ancestry of the gene copies contained (as specified by the Taxa Association) is simulated by coalescence, going backward in time until the simulation arrives at the previous population divergence. These within-branch simulations use the same calculations and assumptions as the Single Population simulations (neutrality, panmixia). There is no migration among populations. The length of time allowed within this branch is the length of the branch, which is treated as the number of generations. (Thus, branch lengths of the population tree will typically be large, e.g., 1,000 to 1,000,000 or more.) The population size is determined by the chosen Ne, and is constant throughout the simulation unless modifed by branch widths. Branch widths, which can be controlled by the Adjust Lineage Widths tool (the horizontal ruler) in the Tree Window, are treated as multipliers of the basic Ne. Thus, if the lineage width is unspecified or is 1.0, then the indicated Ne is used directly. If the lineage width of a branch is 0.5, then the population size along that branch is 0.5*(indicated Ne). Population fluctuations such as bottlenecks can be introduced explicitly along a single branch by inserting extra nodes within the branch using the Insert Nodes tool, and then varying the widths of the different segments independently. This is shown in the example file 08-fluctuating.nex.

By the time the simulation reaches a branch point, i.e. a population divergence, coalescence may have resulted in a single remaining ancestor of the sampled gene copies, or there may remain more than one ancestor. Whatever gene ancestors remain, they are united with the gene ancestors remaining in the sister population into the ancestral (pre-divergence) population. Coalescence then proceeds from there, moving backwards along the ancestral branch, and so on, until the root of the population tree is reached. Then the simulation continues in the root until only a single gene ancestor remains. The branch lengths in the resulting gene tree reflect the generations in which each coalescence occurred.

Reconstructing gene history within population history

If we are given a gene tree and a population tree, how can we interpret how the gene tree fits within the population tree? In what ancestral population did each gene divergence occur? Mesquite is currently able to make this reconstruction under only one assumption: that the only process occurring is lineage sorting (there is no migration among populations). Thus, the reconstruction reconciles the gene tree into the population tree (Page and Charleston, 1997) so as to minimize the depths of gene tree divergences (i.e., minimizing the implied incompleteness of lineage sorting).

This reconstruction is performed by Mesquite when visualizing gene trees within population trees using the Contained Associates tree drawing mode and when counting deep coalescences, both of which are described below. Three parameters determine how the reconstruction is done:

Treat Contained As Unrooted — If enabled, then all possible rootings of the gene tree are tried to find that which minimizes incompleteness of lineage sorting. This is typically an appropriate setting for empirical gene trees (which are typically unrooted) but inappropriate for fitting simulated gene trees into the population history on which they were simulated, because the roots of these gene trees are known. (Default: disabled)
Contained Polytomies auto-resolve — If enabled, then polytomies in the gene tree will be automatically resolved into dichotomies so as to minimize incompleteness of lineage sorting. In the visualization of Contained Associates, such resolved areas are colored magenta. (Default: enabled)
Use Branch lengths of Contained tree — If enabled, then the branch lengths of the contained (gene) tree will be respected in fitting into the population tree. Thus, if the gene tree's branches are long and the population tree's branches short, the gene tree will be interpreted as extending deep in time past the root of the species tree, even if this implies lineage sorting is more incomplete than it might otherwise need to be. If disabled, then the branch lengths of the gene tree are ignored in minimizing incompleteness of lineages sorting. For fitting simulated gene trees into the population history on which they were simulated, it is usually best to enable this option, for then the fit will reflect the actual history. (Default: enabled)

Visualizing gene history in population history

The visualization of green gene trees embedded with blue population trees shown elsewhere on this page is done by the Contained Associates tree drawer. This tree drawer can be used in various contexts where trees are drawn (e.g, the Tree Window), but it requires that the data file is already prepared with two blocks of taxa and their association. To ask for this visualization, have open a Tree Window showing trees of Populations (Contained Associates draws population trees). Select Drawing>Tree Form>Other Choices and then choose Contained Associates from the dialog box. You will be asked what gene trees to draw within the population tree. The gene trees could be from any of the usual sources: stored in the file, or simulated. You can choose to show gene trees simulated within the population tree itself at that moment.

Once Contained Associates is showing the tree, many of its controls will be in the Contained menu, not in the Drawing menu as usual. If you want the contained gene tree to appear in a separate window in addition to embedded in the population tree, select Contained>Display Contained Tree.

Measuring fit between genes and populations

If the genes of a population do not form a clade in the gene tree (i.e., a monophyletic group) on the gene tree, then there is discordance between the gene tree and the population subdivision. To measure this discordance, Mesquite currently has two measures:

s of Slatkin & Maddison (1989). This measures discord between the gene tree and the subdivision into populations; it does not pay attention to a population tree. It treats the populations as a categorical character and counts the number of parsimony steps in this character on the gene tree; the more scattered on the gene tree are the genes from a population, the higher the s value. If the populations have been long separate and the only process causing this scatter is migration, then s can be interpreted as the minimum number of migration events between the populations.

Deep coalescences of W. Maddison (1997). This measures the discordance between a gene tree and a population tree. It assumes that all discord between the population and gene trees is due to incomplete lineage sorting, and counts the number of extra gene lineages (beyond the minimum of 1) on each branch of the population tree summed over all population branches. The number of extra lineages is counted on the best fit of the gene tree into the population tree using the reconstruction methods described in the previous section. Deep coalescence can be calculated in various ways:
- Deep Coalescences (gene tree) — This calculates deep coalescence from the gene tree's point of view. That is, it can be applied to a gene tree. It seeks a current population tree (for instance, on in a Tree Window) and measures the fit of the gene tree within that current population tree.
- Deep Coalescences (species tree) — This calculates deep coalescence from the population (or species) tree's point of view. That is, it can be applied to a species tree. It seeks a gene tree from available tree sources (e.g., gene trees stored in the data file) and measures how well it fits within the species tree
- Deep Coalescence Multiple Loci — This also calculates deep coalescence from the population (or species) tree's point of view, but for multiple gene trees simultaneously. It seeks a set of gene trees from available tree sources (e.g., gene trees stored in the data file) and sums deep coalescences for all of them. For instance, if you choose Stored Trees as the source of gene trees, all of the gene trees in a single stored tree block will be used. By summing deep coalescences for these gene trees, each is treated as if its descent was independent from the others, that is, an independent locus. You might expect each locus to be represented by a different block of taxa, but the Coalescence package of Mesquite cannot yet sum deep coalescences across taxa blocks. What if your include loci with different sample sizes in the different populations (or species), for instance locus P has 7 sequences from species A, 3 from B, 4 from C, while locus Q has 5 sequences from A, 3 from B and 6 from C? You can create a taxa block with sufficient taxa to accommodate both (7 genes contained in A, 3 in B, 6 in C) and then for each the gene trees representing the different loci exclude the extra genes as needed.

These measures assign a value to a tree, and thus are available as "Numbers for Trees". They can be calculated and displayed in various contexts, such as a histogram of values for trees, or in the Tree Legend, or as a column in the List of Trees window. They can also be used in tree searches.

Example: Effect of population divergence time on s

If you have reconstructed a gene tree from sampled sequences from two populations, you may want to use the degree of scrambling of the genes from the two populations as a measure of completeness of lineage sorting, and thus time since divergence. Although the s statistic was designed to measure gene flow, it might be used instead to measure time since divergence (assuming there is no ongoing gene flow). In the example below simulations are used to derive the expected s values under different divergence times. Ten genes are in each of two populations. A population tree with branch lengths of 5,000 and 10,000 are compared. These branch lengths are used as number of generations for the coalescence simulations, which here are using an effective population size of 10,000. The green-in-blue images of gene trees in population trees show the Contained Associates tree drawing mode of the Tree Window, with the gene trees simulated by Coalescence Contained within Current Tree. The charts are a separate calculation based on the same population trees; they are Histograms for Trees, in which the block of taxa for the trees are the genes, the value for the trees is the s statistic, and the gene trees are derived by simulations using Coalescence Contained within Current Tree. Note that with divergence 5,000 generations ago, s values of 3 are fairly common, whereas they are quite uncommon with divergence at 10,000 generations. By adjusting branch lengths, hypotheses can be tested and confidence limits derived.

Inferring the population or species tree

Mesquite can infer relationships of populations using contained gene trees or gene sequences, but currently its algorithms are relatively crude. Tree search and Cluster analysis are two options.

Tree search

Population trees may be inferred via a tree search that finds those population trees in which observed gene trees fit best (Maddison, 1997; Page and Charleston, 1997). Mesquite's tree search facility can be used to seek population trees that minimize deep coalescences (Maddison, 1997). Select Taxa&Trees>Make New Trees Block From>Other Choices and indicate Tree Search in the dialog box. Select populations as the taxa for the new trees block (the search will produce population trees). When asked for the criterion for the tree search, check the "Show Secondary Choices". You will see three choices for deep coalescences. Deep Coalescences (gene tree) is inappropriate because it assesses deep coalescence from the point of view of the gene tree, but your goal is to assess and choose species (population) trees. Thus, choose either Deep Coalescence (species tree) or Deep Coalescence Multiple Loci. Deep Coalescence (species tree) will ask you what gene tree to use as the basis for measuring deep coalescence within the candidate population trees. Deep Coalescence Multiple Loci will ask you what block of gene trees to use. More details on these criteria are given under measuring fit.

Mesquite's tree search does not yet infer branch lengths, and thus the fit between gene trees and species trees is measured so as to ignore branch lengths.

Cluster analysis

Cluster analysis can be used to infer population trees by similarity of contained genes. Select Taxa&Trees>Make New Trees Block From>Other Choices and indicate Cluster Analysis in the dialog box. Indicate you want the taxa of the new trees block to be populations (as we are building a populations tree). For the measure of distance, indicate Distance of Contained Taxa (this a secondary choice), and then for the distance among contained taxa choose Uncorrected Distance or Patristic Distance (a secondary choice). "Uncorrected distance" counts the simple number of difference in gene sequences, and thus choosing this will yield a population tree that depends on a data matrix of the contained genes, but not on a gene tree. "Patristic Distance" measures distance along the branches of the gene tree, and thus requires a gene tree but no gene sequence matrix. If for the method to count distances among contained taxa you choose "Closest", and then "Single Linkage" as the cluster method, you will have an inference method similar to that implied by Takahata (1989): the similarity between two populations is judged by their most similar pair of gene sequences (not their average pairwise sequence divergence).

Simulating sampled gene sequences

Simulations can be used to generate gene sequences evolved under genetic drift and various models of mutation, either within a single population or within a history of diverging populations. To do this, the gene trees simulated as described above (in a single population or multiple populations) are used as the basis, and mutations layered over top of the gene tree to yield a series of simulated sampled sequences. If done on a single gene tree, a gene sequence matrix results (each taxon a sampled gene copy; each character a site in the sequence). However, one can replicate this process automatically to produce many matrices, and thus obtain statisticial distributions to test hypotheses. We introduce below some of the possible ways to generate sequences. For more details see the Character Simulations page, which explains how the Genesis package of Mesquite can be used to simulate nucleotide evolution.

To simulate sequence evolution you should first define a model of evolution as described here. A key issue in simulating is using the scaling factor of the model to compensate for the units by which branch lengths are measured. Gene trees simulated by coalescence have branch lengths measured in generations, which may be in the thousands or millions, whereas most standard stochastic models expect trees whose branch lengths are much less than 10 for typical sequence divergences. For gene trees with lengths measured in generations, small scaling factors (e.g., less than 0.0001) should be used. We do not yet have recommendations as to exactly what scaling factor to use. We suggest you simulate a few matrices to find the scaling factor that gives you sequence divergences in the range desired.

Generating a single matrix of sequences

Gene tree already available in Tree Window

If a gene tree (simulated or otherwise) is shown in a Tree Window, you can simulate sequences simply by selecting Characters>Make New Matrix From>Simulated Matrices on Current Tree. (By "shown in a Tree Window" we don't mean shown as a thin green tree within the blue population tree in the Contained Associates drawing mode. We mean in a Tree Window dedicated to showing trees for the block of taxa corresponding to genes.) If your file has multiple taxa blocks you'll be asked for which you want a new matrix; indicate the taxa corresponding to genes. Indicate that you want Evolve DNA Characters. You will be asked to choose a model of evolution and a number of characters (i.e., sequence length). After it's done, the simulated matrix will be shown to you in a Character Matrix Editor window. If the sequences appear highly saturated (many changes) it may mean the scaling factor was improperly set.

Gene tree not in Tree Window

Select Characters>Make New Matrix From>Other Choices..., then indicate Simulated Matrices on Trees as the choice. Using this, Mesquite will get the gene tree on which to simulate sequences not from a Tree Window, but from some other available source of trees, such as gene trees stored in a trees block, or simulated at that moment by coalescence. After being asked for details about the model of evolution and number of characters, you will be asked to specify "Source of trees on which to simulate character evolution for matrices". Here you choose the source of gene trees. One possiblity is to choose Simulated Trees then Coalescent Trees or (under secondary choices) Coalescence Contained within Current Tree, in which case the matrix will be simulated on a gene tree simulated by coalescence at that moment.

Generating a series of matrices

Multiple replicate sequence matrices can be generated and stored to files. This can allow you to perform a statistical test, for instance generating 100 sequence matrices under some hypothetical scenario, then examining them to see if their properties match those of an observed matrix.

On a single gene tree

To generate multiple sequence matrices evolved on a single gene tree, display the gene tree in a Tree Window. Then choose Characters>Save Multiple Matrices>Simulated Matrices on Current Tree. You will be asked to set up the simulation as for a single matrix (see above), except that you will also be asked to supply a base name for the file, the number of matrices, and the file format. If you choose "test" as the base name, 4 matrices, and the NEXUS file format, then four matrices will be simulated and written to the files test0.nex, test1.nex, test2.nex and test3.nex. You will be asked where to save the files.

Each matrix on a different gene tree

Matrices alone —To simulate a series of sequence matrices, each one evolved on a separate gene tree, select Characters>Save Multiple Matrices>Other Choices..., then select Simulated Matrices on Trees. Your choices will be similar to the preceeding single tree case, except that instead of automatically choosing a gene tree in a tree window, it will use gene trees from the selected tree source. The first matrix will be simulated on the first gene tree from the tree source, the second from the second, and so on. Thus, if you choose Simulated Trees, Coalescent Trees as your tree source, then you will be generating a series of matrices, each simulated on a different gene tree simulated by coalescence within a population. These matrices represent a series of replicates of samples of gene sequences from a population evolving under drift with the specified model of mutation. You can similarly use Coalescence Contained within Current Tree to simulate genes evolving in a divergent population history.

Matrices plus batch files —The many sequence files generated by the preceding option can be analyzed by hand or by some other program. However, if you want Mesquite to help you analyze them automatically, you can ask it to generate batch files that can script Mesquite or some other program to analyze the resulting files one after another, automatically. A batch file is simply a text file with instructions to a program; what to do, step by step. Whether you can do your desired analysis this way depends on whether the program you want to use to analyze the files can be scripted using batch files, and whether someone has designed a batch file template for the analysis.

To generate a series of matrices and corresponding batch files to analyze them, select Analysis>Batch Architect> Export Matrices & Batch Files. You will be asked to specify how the matrices are to be generated, and then you will be presented with a dialog titled "Export Matrices & Batch Files". In this dialog you indicate the number of matrices to save, the base name for the files, and what batch file template to use. An example is given below, and details on the use of Export Matrices & Batch Files are given on the page on Character Simulations.

Example: Multiple simulations of sequence samples

Suppose you had 20 gene copies sampled from two populations A and B and you reconstruct a gene tree. You notice that the copies from A and B do not form respective monophyletic groups, but rather are somewhat intermingled on the tree. You calculate the degree of intermingling using Slatkin & Maddison's s, and determine it to be 4. You want to know: what would be the probability of observing an s of 4 if the two populations had effective population sizes about the same as the number of generations since divergence, say 10,000. (Assume the populations have been completely isolated since divergence.) How can this probability be calculated?

First set up a data file with 20 genes associated with 2 populations, and display a tree window for populations showing a tree with divergence at 10,000 generations:

One way to get a quick answer is to select Analysis>New Histogram for>Trees and indicate you want trees representing genes, and thatyou want Simulated Trees, with the tree simulator Coalescence Contained within Current Tree. Indicate as the value to calculate (under secondary choices) s of Slatkin and Maddison. This would simulate a series of gene trees within your proposed population history and plot their distribution of s values. However, these are the true gene trees simulated, and you don't know that your empirical gene tree is in fact correct. It would be better if you simulated not just gene tree evolution, but gene tree reconstuction also, so that you would be able to compare simulated reconstructed gene trees with your empirical reconstructed gene tree. Thus, we will simulate the gene trees, simulate sequence evolution on them, and take those sequences and attempt to reconstuct gene trees from them.

The procedure is as follows. First, set up the file to have the two taxa blocks (genes, populations) and the association between them. Display a tree window for the populations and set it to match your hypothetical scenario of population history. Our population history has two populations diverging 10,000 generations ago, with effective population size of 10,000. Build a model of sequence evolution. For instance, we built one like this:

The scaling factor (here, 1.0E-6 which is 0.000001) was chosen to yield fairly low sequence divergences under the expectation that the branch lengths on the gene tree could be as long as 10,000 to 50,000 (generations). We tested it by simulating a few matrices to see that the divergences were as desired.

Then choose Analysis>Batch Architect>Export Matrices & Batch Files, indicating to save matrices for the genes. The matrices to be exported come from Simulated Matrices on Trees (a secondary choice). Choose Evolve DNA Characters, your model, and 1000 characters to indicate how each matrix is to be simulated. For the trees on which to simulate, indicate Simulated Trees, then Coalescence Contained within Current Tree (a secondary choice). Indicate an Ne of 10,000.

You will be shown the Matrices & Batch Files dialog:

Choose a template that will reconstruct the trees for you from the matrices. We will assume for the rest of this example that the template "Basic PAUP Tree Search" was chosen. Indicate how many matrices to make, and the base name to give to the files (here, "simForS"). Generating the matrices may take a while. Produced will be a series of files (here, "simForS0.nex", "simForS1.nex", etc.) with the matrices, and the files TreeFileList, [base name]BasisTrees.nex, and commands.nex. TreeFileList is a simple text file listing tree file names that will be produced by PAUP*, in this case simForS0.trees, etc. [base name]BasisTrees.nex (in this example, its name will be simForSBasisTrees.nex) stores all of the gene trees simulated and used to generate the matrices. These are saved to document what was done in the simulation. The file commands.nex is the batch file that tells PAUP* to execute each of the data files and for each infer trees. Start PAUP* and ask it to execute commands.nex.

PAUP* should produce the following files, assuming the base name you choose is "basename":

basename0.trees, basename1.trees, etc. — trees reconstructed by PAUP from each of the matrices
CBbasename — the consensus trees from each of the tree searches, stored as a series of distinct tree blocks in a single file
consensus.trees — the consensus trees fused into a single tree block
ConsCons.trees — the consensus of the consensus trees from all the matrices. (In this example, this file is not very useful and will likely contained fully unresolved tree.

The key file for us is consensus.trees, because it contains the 100 trees reconstructed, one for each matrix. Back in Mesquite, ask to Link or Include this file. Then select Analysis>New Histogram for>Trees and indicate you want trees representing genes, Stored Trees, and the value to calculate (under secondary choices) s of Slatkin and Maddison. These Stored Trees will be those reconstructed from the simulated matrices. The chart may look like this:

As you can see, it appears unlikely that you would reconstruct a gene tree giving an s value of 4 under the scenario of population history.

References

Maddison, W.P. 1997. Gene trees in species trees. Systematic Biology 46:523-536.

Slatkin, M. and W. P. Maddison. 1989. A cladistic measure of gene flow inferred from the phylogeny of alleles. Genetics 123: 603-613.

Takahata, N. 1989. Gene genealogy in three related populations: Consistency probability between gene and population trees. Genetics 122:957-966.

Page, R.D.M and M.A. Charleston. 1997. From gene to organismal phylogeny: Reconciled trees and the gene tree species tree problem. Molecular phylogenetics and evolution. 7:231-240.

Mesquite Home Page | Mesquite Manual

Population Genetics

Contents

Importing and Exporting data

Single population: Coalescence simulations

Example: Chart of coalescence depths

Multiple populations

Establishing the association between genes and populations

Step-by-step method

Automated method (using a macro)

Editing already-created taxa associations

Simulating coalescence within a population tree

Reconstructing gene history within population history

Visualizing gene history in population history

Measuring fit between genes and populations

Example: Effect of population divergence time on s

Inferring the population or species tree

Tree search

Cluster analysis

Simulating sampled gene sequences

Generating a single matrix of sequences

Gene tree already available in Tree Window

Gene tree not in Tree Window

Generating a series of matrices

On a single gene tree

Each matrix on a different gene tree

Example: Multiple simulations of sequence samples

References