Advanced Statistical Methods for Variant Calling in Massively Parallel Sequencing Data

statistical methods for improved variant calling n.w

1 / 21

Embed Share

Explore the complex viral dynamics and variant calling techniques in massively parallel sequencing data, focusing on viral populations, mutation rates, and sequencing technologies. Understand the challenges in accurately identifying and quantifying viral quasispecies for improved variant calling in HIV and HCV studies. Delve into the intricacies of Sanger sequencing limitations and the advancements of massively parallel sequencing for more accurate frequency estimates of viral variants. Follow the process of fragmentation, amplification, and sequencing by synthesis in massively parallel sequencing to analyze viral population DNA fragments. Gain insights into the evolving field of viral dynamics and the identification of drug-sensitive and drug-resistant viral variants, essential for effective treatment strategies.

kasp_5 Follow

Uploaded on Mar 21, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Statistical methods for improved variant calling of massively parallel sequencing data. Pictured above: The structure of HIV. VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014

OUTLINE Viral dynamics Massive parallel sequencing Variant calling VirVarSeq ViVaMBC Results HCV plasmids HCV clinical sample

Viral dynamics A virus is a small infectious agent that replicates only inside the living cells of other organisms. High replication rate (1011replications a day for HIV) High mutation rate Viral population consist of closely related subgroups, viral quasispecies, which we want to identify and quantify. 3

Viral dynamics Drug-sensitive variants Drug-resistant variant Number of virusus in population Heterogeneous viral population Undetectable Time Before treatment On treatment 4

Sequencing Sanger sequencing detection limit: 20-30% no accurate estimate of frequency Massively parallel sequencing ACGGTTTCCGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCCGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGATTTCTGTCTGGG detection limit << 20% more accurate estimate of frequency 5

Massively parallel sequencing Fragmentation Amplification Viral population DNA Fragments Sequencing by synthesis Example, one fragment: A A G T A C G G T T T C T G C A C 6

Massively parallel sequencing Viral population @HWUSI-EAS1524:17:FC:1:120:19254:21417 1:N:0:GATCAG GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA + G@GG@GG@GGHHHBH>GEGDGGBGEGG?GGHHHH>GEGBG@?BEF?DBB<GDGGGGFGG3GGEBA>EC:; @HWUSI-EAS1524:17:FC:1:120:9430:21420 1:N:0:GATCAG ATCGGAAGAGCACACGNCTGAACTCCAGTCACGATCAGATCCCGTATGCCGTCTTCTGCTTGAAAAAAAA + DDDDDDDDDD2DDDDD#DDDDDDDDDDDDDDDDDDDDDDD2:8:7;<@>;DDDDDDDDDDD:DDDDD### @HWUSI-EAS1524:17:FC:1:120:12760:21420 1:N:0:GATCAG ATCATACTGTCTTACTNTGATAAAACCTCCAATTCCCCCTANCATTNTTGGTTNCCATCTTCCTTGCAAA + HHHHHHHHHHHHHHHG#GGGFFFF@HHHHHHGHHHHHHHHF#FFEB#BBBA>B#BFFFFFHHHHHHHHHG 7

Variant calling Distinguish low-frequency variants from sequencing error. VirVarSeq ViVaMBC Adaptive filtering approach based on quality scores. Model approach which models the error probabilities based on quality scores. based clustering Verbist et al. 2014, Bioinformatics. doi: 10.1093/ bioinformatics/btu587. Verbist et al. 2014, BMC bioinformatics. under revision. 8

VirVarSeq Extract reads that cover codon of interest Filter based on the quality scores. Build a codon table ... ... ... ... Reference Reads ... CGA CGA CCA CGA CGA CGA CCA CCA CGT CGA CGA GGA CGA CGA CCA CGA CGA CGA CCA CCA CGT CGA CGA GGA ... ... ... ... ... ... ... Pos x ... ... Codon Freq Codon Table Filtering CGA 0.62 ... ... ... CCA 0.25 ... ... GGA 0.13 ... * codon = nucleotide triplets which specifies a single amino acid 9

VirVarSeq Definition of the Q-threshold (QIT) : QIT Fit mixture distribution on Q- scores with 3 components: Point prob around Q 2 Error distribution Reliable call distribution Image or graphic goes here Intersection point is threshold. 10

ViVaMBC Extract reads that cover codon of interest Perform Model Based Clustering Model the error probability Clusters unknown, EM algorithm ... ... Reference Reads ... CGA CGA CCA CGA CGA CGA CCA CCA CGT CGA CGA GGA Pos x ... ... ... ... ... ... ... ... ... CCA Codon Freq CCA CCA CGT GGA Codon Table Clustering CGA 0.62 ... ... ... CGA CCA 0.25 CGA CGA CGA CGA GGA 0.13 ... ... CGA CGA Cluster medoids = variant Size of Cluster = Frequency N Clusters = N variants ... 11

Results HCV plasmids Two plasmids Amino acids 1 to 181 of NS3 region differ at two codon positions (36 and 155) mixed 4 different proportions 12

Results HCV plasmids Other variants (11481 max) are false positives. VirVarSeq reports: more false positives with frequencies going up to 0,5% 13

Results - HCV clinical sample VirVarSeq reports more variants. Above 1% methods in agreement, even above 0.5%. VirVarSeq Image or graphic goes here Few false pos in GC region for ViVaMBC ? ViVaMBC 14

VirVarSeq vs ViVaMBC When applying reporting limits of 1% or 0.5%, methods are in agreement. Below this limit, trade-off between sensitivity and specificity, with VirVarSeq less specific. VirVarSeq Adaptive approach Easy development Runs fast ViVaMBC More elegant Longer development time Longer run time 15

Acknowledgements 2 2 1 Promoters: Prof.Dr.L.Bijnens2 Prof.Dr.O.Thas1, Prof.Dr.L.Clement1 and Yves Wetzels, Tobias Verbeke, Joris Meys1for IT support Scientists within discovery sciences2 Non-clinical statistics team2 16

Thank you bverbis2@its.jnj.com 10-10-2014

Back-up

ViVaMBC Notation: ri:best base calls of read i (i=1 ...n) si:second best base calls of read i (i=1 ...n) zij:zij=1 when read i belongs to haplotype j (j=1...k) j:probability to belong to haplotype j Complete Data Likelihood: 19

ViVaMBC Complete Data Likelihood: Likelihood depends on cluster membership zij EM algorithm 20

Library preparation Sequencing by synthesis 21

Advanced Statistical Methods for Variant Calling in Massively Parallel Sequencing Data

Download Presentation

Presentation Transcript

Related

More Related Content