Mutation Detection and Pair-HMM Analysis

Slide Note

In this research study, the focus is on mutation detection using Pair-HMM analysis for SNP discovery, HapMap, species identification, bisulfite sequencing, and RNA editing. The optimization of memory usage and error rates in NGS platforms are explored, along with the mathematics involved in Pair-HMM calculations for sequence alignment.

aam_wat Follow

Uploaded on Feb 17, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

GNUMAP-SNP PARALLEL PAIR PARALLEL PAIR- -HMM SNP DETECTION SNP DETECTION HMM Nathan Clement The University of Texas Austin, TX, USA

Outline Motivation NGS Issues and Requirements Pair-HMM Memory Optimizations Results Conclusion

Motivation Mutation Detection: SNP discovery HapMap and resequencing Species Identification Bisulfite Sequencing Epigenetic influences RNA editing

Error Rates* Instrument Run Time Mb/run Bases/re ad 650 Primary Error Type Substitution Error Rate (%) 0.1-1 3730xl (Capillary) 454 FLX+ 2 h 0.06 18-20 h 900 700 Indel 1 Illumina HiSeq2000 Ion Torrent 318 chip PacBio RS 10 days 600,000 100+100 Substitution 0.1 2 h >1000 >100 Indel ~1 0.5-2h 5-10 860-1100 CG Deletions 16 * Data current as of May 2011: Glenn, Travis C, Field guide to next-generation DNA sequencers, Molecular Ecology Resources, vol 11, pp 759-769, 2011

Pair-HMM Pair-wise Alignment: A | A G T | T A | A -- G | G A C C | C -- C -- A Equivalent Hidden Markov State Sequence: GX q GX q TMG TMG Begin End TGM M pGG M pCC TGM M pTT M pCA M pAA M pAA TGM TMM TMM TMG GY q

Pair-HMM (Mathematics) Match Gap (in both directions)

Pair-HMM (M) a t a c g a c t a g t a g a c c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.00 0.00 0.00 0.00 0.00 0.00 0.32 0.68 0.00 0.00 0.00 0.00 0.00 0.00 0.32 0.68 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

Pair-HMM (X) a t a c g a c t a g t a g a c c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Pair-HMM (Y) a t a c g a c t a g t a g a c c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Pair-HMM A C G T a 1.00 0.00 0.00 0.00 g 0.00 0.00 0.68 0.31 t 0.32 0.00 0.00 0.68 a 0.99 0.00 0.00 0.00 g 0.00 0.00 1.00 0.00 a 1.00 0.00 0.00 0.00 c 0.00 1.00 0.00 0.00 c 0.00 1.00 0.00 0.00

Expected Results CHR POS TOT A C G T SNP? PVAL chrX 1755234 17.00 0.00 0.00 17 0.00 N chrX 1755235 18.00 0.00 18.00 0.00 0.00 N chrX 1755236 19.00 9.99 0.00 9.00 0.01 Y:g->a/g 2.54e-08 chrX 1755237 19.50 0.00 0.00 0.00 19.50 N chrX 1755238 19.50 0.00 0.00 19.50 0.00 N chrX 1755239 46.00 0.01 19.49 0.00 0.00 N

Why Inline SNP Calling? Post-Processing Disk space, less memory Inline Requires more memory Less disk space Can include specifics probabilities for each read

Previous Optimizations Two methods for speeding up mapping: 1. Entire genome on one machine 2. Split memory among different machines Must normalize across all genome portions MPI reduction

Previous Optimizations Sequence Processing Rate for Memory Allocation 2000 Expected Single Machine Spread Memory 1000 500 # sequences / second 200 100 50 20 10 1 2 4 8 16 32 64 128 256 # Processes

Memory Requirements Human Genome (3gb) HashMap 12GB 4 bits/character = 1.5GB 5 floating point values per base (plus N) = sizeof(float)*5 * 3GB=60GB Also stores total for easy computation = sizeof(float) * 3GB = 12GB Total of 90GB per run

Three Memory Optimizations Normal (no optimization) Integer discretization Centroid discretization

Integer Discretization Only need one floating point value (for total) and 1 byte/nucleotide. Parts per 255 Biggest hit: Going into and out of integer space

Integer Discretization Added from ri: 1.0 0.00 Step 1: Convert from Integer Space Step 2: Add from ri to Genome Step 3: Convert back to Integer Space 0.68 0.31 0.01 0.00 Genome Total A 12.0 13.0 Total A C 231 228 C G 7 13 G T T 12 11 N N 3 2 3 2 Total A Total A 12.0 13.0 C C 10.9 11.6 G G 0.33 0.64 T 0.56 0.57 T N 0.15 0.15 N 0.15 0.15

Centroid Discretization Many states not used: [255, 255, 255, 255, 255] [0, 0, 0, 0, 0] Many states not biologically relevant SNP transition (common) vs transversion (not likely) MSA uses this compression to perform fast alignment of one-to-many alignment

Centroid Discretization (cont)

Centroid Discretization (cont) Benefits Doesn t waste impossible or infrequently used space Much smaller memory footprint Drawbacks: Slight overhead in converting from centroid to floating point spaces Rounding error (how significant?)

Speed Comparison

Optimization Stats (chrX) Optimization Memory Normal CharDisc CentDisc Mem % 100% 54.2% 42.2% Wallclock 04:25:55 04:36:58 04:27:29 TP 1309 677 166 FP 127 0 9058 4.76GB 2.58GB 2.01GB

Conclusion For high error rates, HMM approach is ideal, but requires more memory Distributing the genome across processors doesn t scale linearly Discretization methods provide good memory reductions (up to 42%) Centroid discretization performs poorly Integer discretization can be used when available memory is low

Questions

Mutation Detection and Pair-HMM Analysis

Download Presentation

Presentation Transcript

Related

More Related Content