Mutation Detection and Pair-HMM Analysis

Mutation Detection and Pair-HMM Analysis
Slide Note
Embed
Share

In this research study, the focus is on mutation detection using Pair-HMM analysis for SNP discovery, HapMap, species identification, bisulfite sequencing, and RNA editing. The optimization of memory usage and error rates in NGS platforms are explored, along with the mathematics involved in Pair-HMM calculations for sequence alignment.

  • Mutation Detection
  • Pair-HMM Analysis
  • NGS Platforms
  • Error Rates
  • Sequence Alignment

Uploaded on Feb 17, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. GNUMAP-SNP PARALLEL PAIR PARALLEL PAIR- -HMM SNP DETECTION SNP DETECTION HMM Nathan Clement The University of Texas Austin, TX, USA

  2. Outline Motivation NGS Issues and Requirements Pair-HMM Memory Optimizations Results Conclusion

  3. Motivation Mutation Detection: SNP discovery HapMap and resequencing Species Identification Bisulfite Sequencing Epigenetic influences RNA editing

  4. Error Rates* Instrument Run Time Mb/run Bases/re ad 650 Primary Error Type Substitution Error Rate (%) 0.1-1 3730xl (Capillary) 454 FLX+ 2 h 0.06 18-20 h 900 700 Indel 1 Illumina HiSeq2000 Ion Torrent 318 chip PacBio RS 10 days 600,000 100+100 Substitution 0.1 2 h >1000 >100 Indel ~1 0.5-2h 5-10 860-1100 CG Deletions 16 * Data current as of May 2011: Glenn, Travis C, Field guide to next-generation DNA sequencers, Molecular Ecology Resources, vol 11, pp 759-769, 2011

  5. Pair-HMM Pair-wise Alignment: A | A G T | T A | A -- G | G A C C | C -- C -- A Equivalent Hidden Markov State Sequence: GX q GX q TMG TMG Begin End TGM M pGG M pCC TGM M pTT M pCA M pAA M pAA TGM TMM TMM TMG GY q

  6. Pair-HMM (Mathematics) Match Gap (in both directions)

  7. Pair-HMM (M) a t a c g a c t a g t a g a c c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 0.00 0.00 0.00 0.00 0.00 0.32 0.68 0.00 0.00 0.00 0.00 0.00 0.00 0.32 0.68 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

  8. Pair-HMM (X) a t a c g a c t a g t a g a c c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

  9. Pair-HMM (Y) a t a c g a c t a g t a g a c c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

  10. Pair-HMM A C G T a 1.00 0.00 0.00 0.00 g 0.00 0.00 0.68 0.31 t 0.32 0.00 0.00 0.68 a 0.99 0.00 0.00 0.00 g 0.00 0.00 1.00 0.00 a 1.00 0.00 0.00 0.00 c 0.00 1.00 0.00 0.00 c 0.00 1.00 0.00 0.00

  11. Expected Results CHR POS TOT A C G T SNP? PVAL chrX 1755234 17.00 0.00 0.00 17 0.00 N chrX 1755235 18.00 0.00 18.00 0.00 0.00 N chrX 1755236 19.00 9.99 0.00 9.00 0.01 Y:g->a/g 2.54e-08 chrX 1755237 19.50 0.00 0.00 0.00 19.50 N chrX 1755238 19.50 0.00 0.00 19.50 0.00 N chrX 1755239 46.00 0.01 19.49 0.00 0.00 N

  12. Why Inline SNP Calling? Post-Processing Disk space, less memory Inline Requires more memory Less disk space Can include specifics probabilities for each read

  13. Previous Optimizations Two methods for speeding up mapping: 1. Entire genome on one machine 2. Split memory among different machines Must normalize across all genome portions MPI reduction

  14. Previous Optimizations Sequence Processing Rate for Memory Allocation 2000 Expected Single Machine Spread Memory 1000 500 # sequences / second 200 100 50 20 10 1 2 4 8 16 32 64 128 256 # Processes

  15. Memory Requirements Human Genome (3gb) HashMap 12GB 4 bits/character = 1.5GB 5 floating point values per base (plus N) = sizeof(float)*5 * 3GB=60GB Also stores total for easy computation = sizeof(float) * 3GB = 12GB Total of 90GB per run

  16. Three Memory Optimizations Normal (no optimization) Integer discretization Centroid discretization

  17. Integer Discretization Only need one floating point value (for total) and 1 byte/nucleotide. Parts per 255 Biggest hit: Going into and out of integer space

  18. Integer Discretization Added from ri: 1.0 0.00 Step 1: Convert from Integer Space Step 2: Add from ri to Genome Step 3: Convert back to Integer Space 0.68 0.31 0.01 0.00 Genome Total A 12.0 13.0 Total A C 231 228 C G 7 13 G T T 12 11 N N 3 2 3 2 Total A Total A 12.0 13.0 C C 10.9 11.6 G G 0.33 0.64 T 0.56 0.57 T N 0.15 0.15 N 0.15 0.15

  19. Centroid Discretization Many states not used: [255, 255, 255, 255, 255] [0, 0, 0, 0, 0] Many states not biologically relevant SNP transition (common) vs transversion (not likely) MSA uses this compression to perform fast alignment of one-to-many alignment

  20. Centroid Discretization (cont)

  21. Centroid Discretization (cont) Benefits Doesn t waste impossible or infrequently used space Much smaller memory footprint Drawbacks: Slight overhead in converting from centroid to floating point spaces Rounding error (how significant?)

  22. Speed Comparison

  23. Optimization Stats (chrX) Optimization Memory Normal CharDisc CentDisc Mem % 100% 54.2% 42.2% Wallclock 04:25:55 04:36:58 04:27:29 TP 1309 677 166 FP 127 0 9058 4.76GB 2.58GB 2.01GB

  24. Conclusion For high error rates, HMM approach is ideal, but requires more memory Distributing the genome across processors doesn t scale linearly Discretization methods provide good memory reductions (up to 42%) Centroid discretization performs poorly Integer discretization can be used when available memory is low

  25. Questions

Related


More Related Content