
Advanced Demography Modeling Software Overview
Explore Fastsimcoal 2, a powerful demography modeling software, allowing efficient genetic diversity generation for various markers and complex evolutionary scenarios. Learn how to estimate demographic parameters using site frequency spectrum (SFS) analysis and visualize population genetics data.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Wu Dongyi & Fan Gong 2022-1-12
Fastsimcoal 2 (fsc2) is a very flexible demography modeling software developed in 2016 by Laurent Excoffier's group at the University of Bern
Feature simcoal 2 Implemented under a faster continous-time sequential Markovian coalescent approximation allowing it to efficiently generate genetic diversity for different types of markers along large genomic regions, for both present or ancient samples. Simulation flexibility It includes a parameter sampler allowing its integration into Bayesian or likelihood parameter estimation procedure.
Feature fastsimcoal 2 can handle very complex evolutionary scenarios population bottleneck and expansion population resize migration
Fastsimcoal 2 now allows one to estimate demographic parameters from the site frequency spectrum (SFS) using simulations to compute the expected SFS and a robust method for the maximization of the composite likelihood. Site frequency spectrum (SFS) ?
Site-frequency spectrum (SFS) is a term used in population genetics to indicate the frequency of alleles at specific loci in a population, or the abundance of alleles in the gene pool Site frequency spectra (SFSs) are widely used to summarize patterns of genomewide variation at the single nucleotide polymorphisms (SNPs) that abound in virtually all organisms.
For one population called the 1-dimensional SFS this is very easy to visualise as a concept Counting how many alleles (y- axis) occur a certain number of times (categories of the x-axis) within the population SFS can be inferred from the derived alleles frequency VCF The full site-frequency spectrum 0 (homozygous for reference allele) 1 (heterozygous) 2 (homozygous for substitute allele). Genotype code:
Expanding the SFS to multiple populations2D-SFS The full site-frequency spectrum (i, j) contains the number of snps The number of alleles is i in Pop1 and j in Pop2 The VCF file corresponds to the matrix
arbitrarily complex evolutionary scenarios Infer parameters time, effective population size, growth rate......
Usage Preparing the input files Observed (joint) SFS : ${PREFIX} _DAFpop0.obs Template file : ${PREFIX}.tpl Estimation file : ${PREFIX}.est
Preparing the input files Observed (joint) SFS : ${PREFIX} _DAFpop 0.obs The observed SFS can be a derived SFS (i.e. also known as DAF or an unfolded SFS) if the ancestral state is unknown or a minor allele frequency SFS (i.e. MAF or folded SFS) otherwise. The VCF file was converted into *.arp file using PGDSpider , The result of the *.arp file was used as the input file for Alequin to calculate SFS (*.obs) 1 2 Easysfs software: VCF SFS (*.obs) ANGSD software: VCF SFS (*.obs) (If the data set has a large number of missing data and low coverage 3
Preparing the input files Template file : ${PREFIX}.tpl 1. Number of populations samples 2. Deme sizes 3. Sample sizes and sampling times 4. Growth rates 5. Migration matrices 6. Historical events 7. Genetic information
Preparing the input files Template file : ${PREFIX}.tpl 1. Number of populations samples The number of samples (or demes) to simulate or given for estimating parameters. 2. Deme sizes The number of individuals for haploid species or two times the number of individuals for diploid species. 3. Sample sizes and sampling times The haploid size of the sample, optionally the sampling time
Preparing the input files Template file : ${PREFIX}.tpl 4. Growth rates The exponential growth parameter for all population samples 5. Migration matrices 0 implies no migration between demes 6. Historical events Each historical event is defined by 7 numbers Time: Number of generations t before present at which the historical event happened Source deme: The first deme is 0 Sink deme: Source and sink refer to demes exchanging migrants Migrants: the probability for each lineage in the source deme to migrate to the sink deme. New size: The relative new size of the sink deme Growth rate: The new growth rate of the sink deme Migr.matrix: New migration matrix to be used further back in time
Preparing the input files Template file : ${PREFIX}.tpl 7. Genetic information consist of 3 subsections Number of independent chromosome segments: The number of independent chromosome segments that needs to be simulated Number of blocks: The number of blocks to be simulated per chromosome. Block specificities: The properties of genetic data to be simulated per block(4 subsections) Date type: DNA SNP MICROSAT Num loci: Number of markers with this data type to be simulated 4 subsections Recombination rate and mutation rate Optional parameters
Preparing the input files Estimation file : ${PREFIX}.est The estimation file is divided into two main sections 1. The [PARAMETERS] section: Each parameter can be an integer or a float, as specified by a first indicator variable. Each parameter can either be uniformly or log-uniformly distributed between a minimum and a maximum value that need to be specified. Lists the prior distributions of simple parameters 2. The [COMPLEX PARAMETERS] section: Lists parameters that are obtained as simple operations between any 2 simple or complex parameters or between a parameter and a scalar. The following simple operations are possible between two parameters, or between a parameter and a scalar: + , - , * , and "/"
Example of an Input File The observed derived SFS is
Example of an Input File Template file : Defined the number of populations to be simulated as Two The effective population sizes of the two populations need to be inferred, represented by name tags Then we define the number of haplotypes we want to simulate The growth rate is 0 There was gene flow between populations and so we add two migration matrices When we go back to TDIV generation from now on, N1 (source) of 1 will return to N2 (sink), and the effective population of N2 (sink) is Resize times of the original, and the population growth rate is 0. The migration matrix is the first matrix the simulation on a 1 chromosome, and there's no 0 structural difference, so there's only one 1block define the type, quantity (length), mutation rate, recombination rate and so on of the simulated data
Example of an Input File Estimation file All name tags introduced in the Template file need to be defined in the estimation file. For each tags, the parameter distribution (uniform or log-uniform) and search range (min and max) are given on a single line.
Command line fsc27 -t *.tpl -e *.est -d -0 -n 100000 -L 40 -s 0 -M -q -c 80 -n: Number of simulations to perform per parameter file or sets of parameter -L: Number of loops (ECM cycles) to be performed when estimating parameters from SFS. Default is 20. -M: Perform parameter estimation by maximum composite likelihood from the SFS -d: Computes the site frequency spectrum (SFS) for the derived alleles for each population sample and for all pairs of samples (joint 2D SFS) -0: Does not take into account monomorphic sites in observed SFS for parameter inference. -q: Outputs minimal messages to the console instead of detailed messages -C: Minimum observed SFS entry count taken into account for parameter estimation (default = 1)") -c: Number of openMP threads to be used for simulation of independent chromosomes and for parameter estimation
Output File *.bestlhoods file with the Maximum likelihood estimates for each parameter specified output in the est file and the model likelihoods NPOP1$ NPOP2$ TDIV$ 754838 751671 1.80295e-07 922419 RESIZE$ MaxEstLhood MaxObsLhood -95264.743 -92613.844 *_maxL.par model specification file with the best parameter estimates. It is basically the tpl file with the keywords replaced by estimated values. * .simparam file with an example of the settings to run the simulations.
for i in {1101} do mkdir run$i cp ${PREFIX}.tpl ${PREFIX}.est ${PREFIX}_jointMAFpop1_0.obs run$i"/" cd run$i fastsimcoal2 -t ${PREFIX}.tpl -e ${PREFIX}.est -m -0 -C 10 -n 10000 -L 40 s 0 -M q cd .. done fastsimcoal2 should not just be run once because it might not find the global optimum of the best combination of parameter estimates right away. It is better to run it 100 times or more.
Joana wrote a script for you that automatically extracts the files of the best run and copies them into a new folder which it calls bestrun. Just run it in the directory where all the folders run are located: fsc-selectbestrun.sh
In order to find the best model, the likelihoods of the best run of each model should be compared. the maximum likelihood value the number of parameters estimated by the model calculateAIC.py Fang Gong s script
Finally determine the optimal parameters merge_para.py calculateCI.py Fang Gong s script NPOP1$ NPOP2$ TDIV$ 754838 751671 1.80295e-07 922419 RESIZE$ MaxEstLhood MaxObsLhood -95264.743 -92613.844 Best model