Advanced Statistical Methods for Variant Calling in Massively Parallel Sequencing Data

statistical methods for improved variant calling n.w
1 / 21
Embed
Share

Explore the complex viral dynamics and variant calling techniques in massively parallel sequencing data, focusing on viral populations, mutation rates, and sequencing technologies. Understand the challenges in accurately identifying and quantifying viral quasispecies for improved variant calling in HIV and HCV studies. Delve into the intricacies of Sanger sequencing limitations and the advancements of massively parallel sequencing for more accurate frequency estimates of viral variants. Follow the process of fragmentation, amplification, and sequencing by synthesis in massively parallel sequencing to analyze viral population DNA fragments. Gain insights into the evolving field of viral dynamics and the identification of drug-sensitive and drug-resistant viral variants, essential for effective treatment strategies.

  • Statistical Methods
  • Variant Calling
  • Massively Parallel Sequencing
  • Viral Dynamics
  • HIV

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Statistical methods for improved variant calling of massively parallel sequencing data. Pictured above: The structure of HIV. VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014

  2. OUTLINE Viral dynamics Massive parallel sequencing Variant calling VirVarSeq ViVaMBC Results HCV plasmids HCV clinical sample

  3. Viral dynamics A virus is a small infectious agent that replicates only inside the living cells of other organisms. High replication rate (1011replications a day for HIV) High mutation rate Viral population consist of closely related subgroups, viral quasispecies, which we want to identify and quantify. 3

  4. Viral dynamics Drug-sensitive variants Drug-resistant variant Number of virusus in population Heterogeneous viral population Undetectable Time Before treatment On treatment 4

  5. Sequencing Sanger sequencing detection limit: 20-30% no accurate estimate of frequency Massively parallel sequencing ACGGTTTCCGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCCGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGATTTCTGTCTGGG detection limit << 20% more accurate estimate of frequency 5

  6. Massively parallel sequencing Fragmentation Amplification Viral population DNA Fragments Sequencing by synthesis Example, one fragment: A A G T A C G G T T T C T G C A C 6

  7. Massively parallel sequencing Viral population @HWUSI-EAS1524:17:FC:1:120:19254:21417 1:N:0:GATCAG GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA + G@GG@GG@GGHHHBH>GEGDGGBGEGG?GGHHHH>GEGBG@?BEF?DBB<GDGGGGFGG3GGEBA>EC:; @HWUSI-EAS1524:17:FC:1:120:9430:21420 1:N:0:GATCAG ATCGGAAGAGCACACGNCTGAACTCCAGTCACGATCAGATCCCGTATGCCGTCTTCTGCTTGAAAAAAAA + DDDDDDDDDD2DDDDD#DDDDDDDDDDDDDDDDDDDDDDD2:8:7;<@>;DDDDDDDDDDD:DDDDD### @HWUSI-EAS1524:17:FC:1:120:12760:21420 1:N:0:GATCAG ATCATACTGTCTTACTNTGATAAAACCTCCAATTCCCCCTANCATTNTTGGTTNCCATCTTCCTTGCAAA + HHHHHHHHHHHHHHHG#GGGFFFF@HHHHHHGHHHHHHHHF#FFEB#BBBA>B#BFFFFFHHHHHHHHHG 7

  8. Variant calling Distinguish low-frequency variants from sequencing error. VirVarSeq ViVaMBC Adaptive filtering approach based on quality scores. Model approach which models the error probabilities based on quality scores. based clustering Verbist et al. 2014, Bioinformatics. doi: 10.1093/ bioinformatics/btu587. Verbist et al. 2014, BMC bioinformatics. under revision. 8

  9. VirVarSeq Extract reads that cover codon of interest Filter based on the quality scores. Build a codon table ... ... ... ... Reference Reads ... CGA CGA CCA CGA CGA CGA CCA CCA CGT CGA CGA GGA CGA CGA CCA CGA CGA CGA CCA CCA CGT CGA CGA GGA ... ... ... ... ... ... ... Pos x ... ... Codon Freq Codon Table Filtering CGA 0.62 ... ... ... CCA 0.25 ... ... GGA 0.13 ... * codon = nucleotide triplets which specifies a single amino acid 9

  10. VirVarSeq Definition of the Q-threshold (QIT) : QIT Fit mixture distribution on Q- scores with 3 components: Point prob around Q 2 Error distribution Reliable call distribution Image or graphic goes here Intersection point is threshold. 10

  11. ViVaMBC Extract reads that cover codon of interest Perform Model Based Clustering Model the error probability Clusters unknown, EM algorithm ... ... Reference Reads ... CGA CGA CCA CGA CGA CGA CCA CCA CGT CGA CGA GGA Pos x ... ... ... ... ... ... ... ... ... CCA Codon Freq CCA CCA CGT GGA Codon Table Clustering CGA 0.62 ... ... ... CGA CCA 0.25 CGA CGA CGA CGA GGA 0.13 ... ... CGA CGA Cluster medoids = variant Size of Cluster = Frequency N Clusters = N variants ... 11

  12. Results HCV plasmids Two plasmids Amino acids 1 to 181 of NS3 region differ at two codon positions (36 and 155) mixed 4 different proportions 12

  13. Results HCV plasmids Other variants (11481 max) are false positives. VirVarSeq reports: more false positives with frequencies going up to 0,5% 13

  14. Results - HCV clinical sample VirVarSeq reports more variants. Above 1% methods in agreement, even above 0.5%. VirVarSeq Image or graphic goes here Few false pos in GC region for ViVaMBC ? ViVaMBC 14

  15. VirVarSeq vs ViVaMBC When applying reporting limits of 1% or 0.5%, methods are in agreement. Below this limit, trade-off between sensitivity and specificity, with VirVarSeq less specific. VirVarSeq Adaptive approach Easy development Runs fast ViVaMBC More elegant Longer development time Longer run time 15

  16. Acknowledgements 2 2 1 Promoters: Prof.Dr.L.Bijnens2 Prof.Dr.O.Thas1, Prof.Dr.L.Clement1 and Yves Wetzels, Tobias Verbeke, Joris Meys1for IT support Scientists within discovery sciences2 Non-clinical statistics team2 16

  17. Thank you bverbis2@its.jnj.com 10-10-2014

  18. Back-up

  19. ViVaMBC Notation: ri:best base calls of read i (i=1 ...n) si:second best base calls of read i (i=1 ...n) zij:zij=1 when read i belongs to haplotype j (j=1...k) j:probability to belong to haplotype j Complete Data Likelihood: 19

  20. ViVaMBC Complete Data Likelihood: Likelihood depends on cluster membership zij EM algorithm 20

  21. Library preparation Sequencing by synthesis 21

Related


More Related Content