Rapid Imputation Methods for Medium- or Low-Coverage Sequence Data in Livestock Genetics

fast imputation using medium or low coverage n.w
1 / 16
Embed
Share

Learn about fast imputation techniques using medium- or low-coverage sequence data presented by Paul VanRaden at the World Congress on Genetics Applied to Livestock Production. The study discusses the cost differences between chips and sequence data, imputation methods, the value of high-density chips, and details of the imputation algorithm. Various data sets and tests on simulated sequenced bulls are also explored, highlighting the computational requirements for this process.

  • Livestock Genetics
  • Imputation Methods
  • Sequencing Data
  • Paul VanRaden
  • Computational Analysis

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Fast Imputation Using Medium- or Low-Coverage Sequence Data Paul VanRaden and Chuanyu Sun Animal Genomics and Improvement Lab USDA-ARS, Beltsville, MD, USA National Association of Animal Breeders Columbia, MO, USA paul.vanraden@ars.usda.gov 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (1) Paul VanRaden

  2. Topics l Cost of chip vs. sequence data w Chips: Nonlinear increase with SNP density w Sequence: Linear increase with read depth l Imputation methods for sequence data w Few programs designed for low read depth l Value of including HD chip in sequence data 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (2) Paul VanRaden

  3. Analysis of chip vs. sequence data Chip data Sequence data Genotypes are observed AA, AB, BB (2, 1, 0) Exact data, SNP subset Impute only missing data 3K, 6K, 50K, 77K, 777K Error rate < 0.05% Computation important Genotype probabilities Counts of A, counts of B Approximate data, all SNP Impute all genotypes 30 million SNPs + CNVs Error rate 0.5% to 10% Computation is crucial 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (3) Paul VanRaden

  4. Imputation algorithm (findhap v4) l Prior allele probabilities = pop n frequency l Compute Prob(nA, nB | genotypes, errate) l Test ancestor haplotype likelihoods first l Find most likely 2 haplotypes from library l Compute haplotype posteriors from priors l Test long, then medium, then short segments 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (4) Paul VanRaden

  5. Data sets and imputation tests Data category / parameter Levels tested Simulated sequenced bulls Read depths Error rates Include HD chip in sequence SNPs in sequence and HD 250, 500, 1,000, 10,000 1, 2, 4, 8, 16 0%, 1%, 4%, 16% Yes or no 30 million and 600,000 Human chromosome 22 SNPs in sequence and HD 1,102 actual genomes 394,724 and 39,440 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (5) Paul VanRaden

  6. Computation required l Bulls: 250 sequenced + 250 HD, 1 chromosome l Time (10 processors): findhap 10 min, BeagleV4 3 days l Memory: findhap 5 Gbytes, Beagle <5 Gbytes l Input data: findhap 0.5 Gbytes, Beagle 5 Gbytes w findhap: 2 bytes / SNP [A, B counts stored as hexadecimal] w Beagle: 20 bytes / SNP [Prob(AA), Prob (AB), Prob(BB)] l Output data: findhap 1 byte vs. Beagle 20 bytes / SNP 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (6) Paul VanRaden

  7. Accuracy of Findhap vs. Beagle Sequence + HD Correct Impute from HD Correct Program Depth Corr n Corr n Findhap 8X 4X 2X 98.7 95.8 91.3 0.981 0.939 0.879 95.0 93.1 89.2 0.926 0.897 0.837 Beagle 8X 4X 2X 99.0 95.0 79.5 0.984 0.918 0.602 97.1 78.2 63.5 0.956 0.582 0.100 250 bulls had sequence + HD, 250 others were imputed from HD 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (7) Paul VanRaden

  8. Accuracy from HD for bulls * depth Sequenced Bulls Total Depth Depth Correct Corr n 250 500 1,000 10,000 8X 4X 2X 1X 2,000X 2,000X 2,000X 10,000X 95.0 96.7 96.5 95.8 0.926 0.954 0.951 0.939 Sequences had 1% error, HD imputed using findhap 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (8) Paul VanRaden

  9. Accuracy including HD in sequence Sequenced bulls HD in sequence? No Bulls with HD only HD in sequence? No Read Depth Yes Yes 16X 8X 4X 2X 1X .999 .985 .920 .847 .788 .999 .988 .958 .919 .878 .977 .970 .906 .831 .753 .977 .974 .954 .917 .853 Correlations of estimated with true genotypes for 500 bulls sequenced with 1% error and 250 bulls with HD only 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (9) Paul VanRaden

  10. Imputation from 10K, 60K, 1X, or 2X 1 0.9 0.8 Imputation accuracy 0.7 0.6 0.5 Corr 0.4 nCount 0.3 0.2 0.1 0 10k 60k 1x 2x SNP Reference population is 500 bulls, 8X read depth, 1% error 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (10) Paul VanRaden

  11. Sequenced human read depth * error Correct genotypes % Error rate 0% 1% Genotype correlation Error rate 0% 1% Read Depth 4% 16% 4% 16% 16X 8X 4X 2X 1X 1.000 .996 .986 .970 .951 .999 .994 .983 .969 .951 .998 .990 .979 .964 .945 .989 .981 .969 .951 .932 .999 .982 .929 .853 .754 .997 .968 .915 .841 .745 .989 .952 .896 .817 .718 .947 .904 .840 .749 .647 884 humans sequenced for 394,724 SNPs on chromosome 22 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (11) Paul VanRaden

  12. Software at http://aipl.arsusda.gov l Simulate genotypes (programs written 2007) w pedsim.f90, markersim.f90, genosim.f90 l Simulate A and B counts, Poisson plus error w geno2seq.f90 l Impute using haplotype likelihood ratios w findhap.f90 version 4 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (12) Paul VanRaden

  13. Actual HD genotype correlations2 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (13) Paul VanRaden

  14. Simulated HD correlations2 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (14) Paul VanRaden

  15. Conclusions l High read depth is expensive (linear cost) l Low read depth requires additional math w Haplotype probabilities | (A B counts, error) l Imputation improved with findhap version 4 w Up to 400 times faster than Beagle w findhap more accurate for low coverage l Some gain from including HD in sequence 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (15) Paul VanRaden

  16. Acknowledgments l Jeff O Connell and Derek Bickhart provided helpful advice on sequence analysis and software design and testing 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (16) Paul VanRaden

More Related Content