Genomic Mutation Analysis and Null Model Development for Cancer Research

1 / 63

Embed Share

Explore the development of the LARVA method focusing on whole-genome somatic mutations with computational efficiency improvements and variant data annotation. The null model creation is discussed to distinguish mutations in cancer vs. background mutation processes. Discover the motivation behind the null model, including factors from the MutSig tool, to recreate expected somatic variations due to background mutation processes. The starting point for the null model is elucidated based on gene expression, DNA replication timing, and HiC chromatin structure. Dive into the methods, results, and implications for cancer research.

reb_gut Follow

Uploaded on May 08, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

LARVA: A Bugs Life LL s Group Meeting August 13, 2014 1

Outline LARVA Methods Development Whole genome somatic mutation null model Computational Efficiency Improvements p-value stability tests LARVA Results Variant Data Annotation Data Results 2

Outline LARVA Methods Development Whole genome somatic mutation null model Computational Efficiency Improvements p-value stability tests LARVA Results Variant Data Annotation Data Results 3

Null Model: Motivation Due to breakdown in DNA repair processes in cancer progression, many single nucleotide variants (hereafter called variants ) and other types of mutation accumulate due to background mutation processes Need to distinguish mutations arising from background vs. cancer For a dataset with n variants and m samples, Simulate the creation of nrand random datasets Each of these random datasets also has n variants and m samples, but variants positions are randomized Using null model: Probability distribution over whole genome that represents the different susceptibilities of each region to background mutation Determined if mutation observed from actual data matches the pattern expected under the null model 4

Null model intended to recreate the expected somatic variation due to background mutation processes Based on factors used in MutSig Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature499, 214 218 (2013). 6

Null Model: Starting Point Broad Institute s MutSig tool uses a null model based on three factors Gene expression: Genes with higher expression are subject to more transcription-coupled repair fewer variants1 DNA replication timing: Depletion of free nucleotides later in replication phase leads to higher error rates2 HiC chromatin structure: Research indicates that cancer SNV density anti-correlates with open chromatin3 1. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature483, 603 307 (2012). 2. Chen, C. L. et al. Impact of replication timing on non-CpG and CpG substitution rates in mammalian genomes. Genome Research20, 447 457 (2010). 3. Lieberman-Aiden, E. et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science326, 289 293 (2009). 7

Null Model: MutSig Version Factors from the previous slide go into the gene- specific background mutation rate MutSig also incorporates a patient-specific background mutation rate Based on observed variants pooled from every patient in a cohort For genes where data is too sparse, the mutation rate is determined by pooling data from genes with similar properties Significance levels (p-values) determined from whether observed variant counts significantly exceed expected variant counts False-discovery rate (q-values) then derived from p- values Genes with q 0.1 are reported as significantly mutated 8

Null Model: Larval Form for LARVA LARVA is about exploring recurrent variation in noncoding annotations Relatively unexplored in previous work Hence, LARVA s null model innovations include: Applicability to whole genome Outperform previous methods in false discovery rate Started with collecting whole genome data with literature support for the existence of a correlation between these data and cancer SNV density 9

Null Model: Data Involved Whole genome signal tracks 1KG SNV density: Derived from phase 1 data GC percent: GC 5 base track from UCSC Genome Browser H3K4me1 marks: Average of 12 cell lines from UCSC Genome Browser H3K4me3 marks: Average of 10 cell lines from UCSC Genome Browser H3K9me3 marks: Average of 6 cell lines from UCSC Genome Browser 10

Null Model: Data Involved H3K27ac marks: Average of 17 cell lines from UCSC Genome Browser HiC chromatin state: Derived from Lieberman-Aiden et al1 data (also used in MutSig) DNA recombination rate: deCODE recombination rate from UCSC Genome Browser DNA replication timing: Derived from Chen et al2 data (also used in MutSig) RNA-seq: Average of cell lines in CCLE (Cancer Cell Line Encyclopedia) Target data: Baca et al3 WG prostate cancer variants: 57 samples, ~350,000 variants 1. Lieberman-Aiden, E. et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science326, 289 293 (2009). 2. Chen, C. L. et al. Impact of replication timing on non-CpG and CpG substitution rates in mammalian genomes. Genome Research20, 447 457 (2010). 3. Baca, S. C. et al. Punctuated Evolution of Prostate Cancer Genomes. Cell153, 666 677 (2013). 11

Null Model: Larval Form for LARVA Initial thoughts about using this data Divide the whole genome into 100,000-bp bins Derive a rank for each bin in each data track based on the cumulative distribution function (CDF) of each track e.g. Bins with a higher rank in the DNA replication timing track would receive a higher value in the resulting predicted somatic variant density track Initially assumed all weights = 1 predicted_somatic_variant_density r ( )=w1log CDF reptiming r ( ) w3log 1-CDF H3K4me3 r ( ) ( ( )+w2log 1-CDF H3K4me1 r ( ) ) ( )+ ) ( ) ( ) ( )+w4log 1-CDF expression r ( ) )+w5log CDF SNV_density r ( ) ( ) ( ) ( ( 12

Null Model: Validation Does this model actually make sense? In other words, where s the validation? Back up a few steps How correlated are these tracks? 13

Null Model: Pairwise Correlations Matrix of pairwise correlations between all tracks and Level 2 somatic mutations from TCGA prostate cancer data Most significant correlations are between 1KG SNV density and other tracks Correlation matrix legend Blue: Correlation with significant p-value Red: Correlation > 0.30 with significant p-value P-value matrix legend Red: P-value < 0.05 Correlation matrix 1KG SNV density H3K4me1 marks H3K4me3 marks RNA-seq 1 -0.0291 -0.1595 0.8062 0.1313 0.0085 -0.421 -0.0652 0.6253 -0.3238 0.3904 -0.118 0.2629 0.0079 DNA replication timing GC bias DNA recombination rate PRAD somatic variant density 1KG SNV density H3K4me1 marks H3K4me3 marks RNA-seq DNA replication timing GC bias DNA recombination rate PRAD somatic variant density 1 1 -0.0139 0.0447 -0.4055 -0.156 -0.0205 1 -0.1075 0.0742 0.028 0.0708 1 -0.2226 -0.0997 -0.1804 1 0.4188 0.1819 1 0.0368 1 P-value matrix 1KG SNV density H3K4me1 marks H3K4me3 marks RNA-seq 1 0 0 0 0.1512 0 0 0 0 0.182 DNA replication timing GC bias DNA recombination rate PRAD somatic variant density 1KG SNV density H3K4me1 marks H3K4me3 marks RNA-seq DNA replication timing GC bias DNA recombination rate PRAD somatic variant density 1 0 1 0.0183 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0.0005 1 14

Null Model: Pairwise Correlations Later used a wider collection of data against the Baca somatic variant density Also 1kb bins (down from 100kb) Correlations > 0.5 highlighted in red Mostly between histone tracks Baca somatic track s column indicates that all other tracks are uncorrelated Correlation matrix DNA recombination rate DNA replication timing Baca somatic GC percent H3K27ac 1 0.0645 -0.0249 -0.0262 -0.0299 -0.0255 -0.0251 -0.0266 H3K4me1 -0.0262 -0.2955 0.9935 H3K4me3 -0.0299 -0.3409 0.7988 0.8213 H3K9me3 -0.0255 -0.3254 0.9868 0.9869 0.7907 HiC 1KG Baca somatic GC percent H3K27ac H3K4me1 H3K4me3 H3K9me3 HiC 1KG DNA recombination rate DNA replication timing 0.0645 -0.0249 -0.2833 -0.0251 0.1858 0.0313 0.0302 0.0561 -0.0023 -0.0266 0.3121 -0.044 -0.0475 -0.0486 -0.0692 0.1295 0.0049 0.396 -0.1146 -0.1149 -0.1478 -0.1277 0.1157 0.1709 0.0393 -0.189 0.0245 0.0356 0.0023 0.079 -0.3642 -0.1695 1 -0.2833 -0.2955 -0.3409 -0.3254 0.1858 0.3121 1 0.9935 0.7988 0.9868 0.0313 -0.044 1 0.8213 0.9869 0.0302 -0.0475 1 0.7907 0.0561 -0.0486 1 -0.0023 -0.0692 1 0.1295 1 0.0049 0.0393 0.396 -0.189 -0.1146 0.0245 -0.1149 0.0356 -0.1478 0.0023 -0.1277 0.079 0.1157 -0.3642 0.1709 -0.1695 1 -0.0856 -0.0856 1 15

Null Model: Trial and (Mostly) Error Generalized linear model fitting Representative example of the results Support vector machines Tried virtually every function family in the documentation Cross-validation error never dropped below 0.85 target predicted 16

Null Model: The Auction July 16, 2014: The day the LARVA null model went on the auction block Way more entertaining than the World Cup! (eyewitness account) The Bids: Embryo Model: Use 1KG variant density as a proxy for all background mutation processes Simplistic given the current state of the art 17

Null Model: The Auction Fetal model: Take all regions, use a factor to do a categorization into low, medium, and high regions Look at how the somatic variant density varies between regions Is there a significant difference between these categories? Baby model: In a cohort, model the distribution of X in each patient Where X is a Poisson distributed whole genome signal track e.g. histones, expression, chromatin, DNA reptiming, etc. Sum the lambda s from each patient to create a single model to use as the null model 18

Null Model: Working with the Fetal Model Created a hierarchical clustering of the bins with the 9 covariates currently being worked with 1KG SNV density, GC percent, HiC, H3K4me1, H3K4me3, H3K9me3, H3k27ac, DNA recombination rate, DNA replication timing 1kb bins means ~ 3 million bins total Due to memory constraints, had to downsample a hundredfold So the actual matrix used was every 100th row of the original matrix Produced a hierarchical clustering with R (plot on next slide) 19

Null Model: Hierarchical Clustering of Covariates There appear to be five major clusters When you look at this height 20

Null Model: Hierarchical Clustering of Covariates Split the tree into 5 clusters using cutree in R Retrieve the somatic variant bins for each cluster Sizes of each cluster Cluster 1: 28353 Cluster 2: 211 Cluster 3: 149 Cluster 4: 64 Cluster 5: 33 Most of the entries are zero 21

Null Model: Boxplot of Clusters 15 10 5 0 1 2 3 4 5 22

Null Model: (New) Trial and (Hopefully Less) Error Model the null mutation rate for each class of annotations we re studying Determine expected mutation rate relative to other [exons, pseudogenes, introns, ] Within each class, take the probability pi that a single point mutation falls into each bin as: pi= ki ki Where ki = the number of intersecting 1KG variants in bin i ki is the sum of 1KG variants over all class bins 23

Null Model: (New) Trial and (Hopefully Less) Error Intersect somatic variant track with annotation bins Let Si be the number of intersecting somatic variants in bin i Let S = total intersecting somatic variants across all class bins Si , the # of expected mutation in bin i, is: Si '=S pi From this we derive the of the Poisson distribution that represents the expected mutation rate: li=Si Li ' Length of bin i 24

Null Model: (New) Trial and (Hopefully Less) Error The p-value significance that a bin has more mutations than expected is: p-value=1- Pois x,li ( x si ) Adjust for FDR (R s p.adjust) The bins with adjusted p-value 0.05 are considered significant IF (percent of bins with significant p-values in [coding exons|introns|UTRs]) < (percent of bins with significant p-values in whole genome) THEN We can say that ranking bins by a class-based mutation rate will produce fewer false positives that a whole genome-based mutation rate ELSE start_over(); 25

Null Model: (New) Trial and (Hopefully Less) Error Divided the human genome into coding exons, UTRs, introns, and intergenic regions Split into 1kb bins CDS and UTR have much lower percent of significant bins compared to whole genome Introns and intergenic regions less so This might be workable with some additional work 1kb bins CDS Num bins Num adjusted p-values <= 0.05 % bins 4216 505 0.119781784 UTR Num bins Num adjusted p-values <= 0.05 % bins 19318 1548 0.080132519 Introns Num bins Num adjusted p-values <= 0.05 % bins 1093694 401057 0.366699461 Intergenic Num bins Num adjusted p-values <= 0.05 % bins 1890738 1050318 0.555506897 hg19 (whole genome) Num bins Num adjusted p-values <= 0.05 % bins 3095665 1356025 0.438039969 26

Outline LARVA Methods Development Whole genome somatic mutation null model Computational Efficiency Improvements p-value stability tests LARVA Results Variant Data Annotation Data Results 27

LARVA-SAM computational efficiency: Random variant position picking LARVA-SAM used to not scale above a few hundred nrand datasets Past this, timescale stretches from hours into days or weeks Code profiling revealed the most compute-intensive step was picking the position of each variant Old algorithm: Treat the entire exome/genome as a number line from 1 to weight_sum Use a pseudorandom number generator over [1:weight_sum] Find the annotation/region that contains the coordinate produced by the rand generator 28

LARVA-SAM computational efficiency: Random variant position picking Remedy Use the position picking code as a precomputation Generate a variant pool that has the distribution implied by the exome/wg null models Then create random variant datasets with variants randomly drawn from this pool ~1 million variant pool 29

LARVA-SAM Computational Efficiency: The Rewrite Discovered that LARVA-SAM s efficiency was held back by the use of files to communicate results between processes CPAN Parallel::ForkManager module has the same problem Decided to migrate to true interprocess communication with OpenMPI Interface between Perl and MPI was kludgy Decided to rewrite the entire software suite in C++, since the Perl MPI module was translating all the MPI commands into C anyway 30

Before 1400 1200 19 hr 4 min 18 hr 52 min 1000 Running time (min) 800 600 7 hr 45 min 7 hr 40 min 400 5 hr 13 min 4 hr 50 min 200 2 hr 4 min 1 hr 55 min 0 all prostate vs. KEGG all prostate vs. exons Gr 4 glioma vs. KEGG Gr 4 glioma vs. exons all prostate vs. KEGG all prostate vs. exons Gr 4 glioma vs. KEGG Gr 4 glioma vs. exons 31 nrand=300 nrand=120 LARVA-SAM run

After Same scalability as before Up to 40x speedup over Perl version! Running Time of LARVA-SAM (C++/OpenMPI version) Under Various Parameter Settings 18 16 16 14 14 12 Running Time (min) 10 10 10 8 8 7 6 4 3 3 2 0 all prostate vs. KEGG all prostate vs. exons Gr 4 glioma vs. KEGG Gr 4 glioma vs. exons all prostate vs. KEGG all prostate vs. exons Gr 4 glioma vs. KEGG Gr 4 glioma vs. exons 32 nrand=300 nrand=120

Before LARVA-SAM Running Time w.r.t. # CPU cores 140 120 100 Running Time (min) 80 60 40 20 0 15 30 60 # CPU Cores 33

After Running Time vs. CPU core count - Query=all prostate vs. KEGG, nrand=180 30 25 20 Running Time (min) 15 10 5 0 1 2 4 8 15 30 # CPU cores Diminishing returns as # CPU cores increase, but what is the optimal number of cores? 34

Determining the Optimal Number of CPU Cores Decided to determine this based on the performance gain relative to the number of CPU cores added For two timing tests t1 and t2, where t1.ncpu < t2.ncpu, calculate performance gain as follows: t2.running_time-t1.running_time / t1.running_time t2.ncpu-t1.ncpu ( ) Example: t1: 2 cores, 23 min t2: 4 cores, 13 min ( ) Perf _gain=13-23 / 23 0.217 4-2 35

Determining the Optimal Number of CPU Cores Percent performance gain/cores added Percent performance gain/cores added nrand=120 nrand=120 0.7 0.6 Percent performance improvement per Percent performance improvement per 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 core core 0.1 0.1 0 0 2 4 8 15 30 2 4 8 15 30 # CPU cores # CPU cores Percent performance gain/cores added Percent performance gain/cores added nrand=300 nrand=180 0.5 Percent performance improvement per 0.6 Percent performance improvement per 0.4 0.5 0.3 0.4 0.3 0.2 0.2 core 0.1 core 0.1 0 0 2 4 8 15 30 2 4 8 15 30 36 # CPU cores # CPU cores

Outline LARVA Methods Development Whole genome somatic mutation null model Computational Efficiency Improvements p-value stability tests LARVA Results Variant Data Annotation Data Results 37

p-value stability: Overview LARVA-SAM identifies significantly higher and lower patterns of recurrent variation relative to the pattern seen in random variant datasets Creation of random variant datasets is computationally expensive What is the lowest number of random variant datasets (nrand) to generate that is sufficient to achieve stable p-values in the LARVA-SAM significance test? 38

p-value stability: Overview Current version of LARVA-SAM produces p-values for: nsamp The number of samples with variants that overlap the given pathway nannot The number of annotations (exons) mutated in at least 2 samples (nsamp >= 2) ngene The number of genes mutated in at least 2 samples (nsamp >= 2) nvar The number of positions where variants from multiple samples overlap exactly (recurrent variants) Pathway analysis used here looked at nsamp, nannot, and nvar (ngenewasn t implemented yet) 39

Given the expected multisample variant frequency distribution for each annotation (e.g. gene) or annotation set (e.g. pathway), how does the observed variant frequency compare? Assume the expected distribution is Gaussian How outlier is the observed variant frequency in this distribution? 41

p-value stability: Methods Run the same query through LARVA-SAM at different nrand settings LARVA-SAM(all prostate, KEGG) nrand range: 500 to 10,000 Between consecutive runs r1 and r2, where r1.nrand < r2.nrand, look at how many significance threshold crossings there are 42

p-value stability: Methods with Example Consider the following example data: Observed nsamp Observed nannot Observed nvar nsamp expected avg 1.15E+02 9.60E+01 1.40E+02 1.16E+02 9.27E+01 1.42E+02 nsamp p- value nannot expected avg 1.70E+01 6.38E+00 nannot p- value 1.78E-04 1.30E-09 0.0573091 6.67E-04 2.25E-11 nvar expected avg 1.25E-01 nrand Pathway 500kegg_pathways_in_cancer.txt kegg_focal_adhesion.txt kegg_neuroactive_ligand_receptor_interaction.txt 1000kegg_pathways_in_cancer.txt kegg_focal_adhesion.txt kegg_neuroactive_ligand_receptor_interaction.txt nvar p-value 1.26E-125 0.25 0.04163226 1.25E-01 2.50E-01 1.88E-01 0.01868649 6.25E-02 154 131 122 154 131 122 30 20 31 30 20 31 8 1 3 8 1 3 2.20E-08 6.02E-31 1.88E-03 6.60E-10 3.09E-08 1.21E-03 40.75 1.76E-18 5.26E-44 1.69E+01 5.94E+00 41.8125 0.01921396 3.43E-34 At p=0.05, the kegg_neuroactive_ligand_ receptor_interaction pathway s nannot p-value changes from 0.0573091 to 0.01921396 when nrand is increased from 500 to 1000 In this data, there is one significance threshold crossing at p=0.05 Do this analysis for all 185 KEGG pathways 3 p-values produced for each pathway: nsamp, nannot, nvar (555 p-values total) nrand values covered: 500, 1000, 1500, 2000, 4000, 6000, 8000, 10,000 43

Significance Threshold Crossings at p=0.05 12 11 10 10 10 10 10 Number of Significance Threshold Crossings 8 8 7 6 6 nsamp 6 nannot nvar 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 500 --> 1000 1000 --> 1500 1500 --> 2000 2000 --> 4000 4000 --> 6000 6000 --> 8000 8000 --> 10000 nrand change nrand = 2000 is the sweet spot 44

Significance Threshold Crossings at p=0.01 16 15 14 13 12 12 Number of Significance Threshold Crossings 10 10 10 10 9 nsamp 8 nannot nvar 6 5 4 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 500 --> 1000 1000 --> 1500 1500 --> 2000 2000 --> 4000 4000 --> 6000 6000 --> 8000 8000 --> 10000 nrand change nrand = 2000 is the sweet spot 45

Outline LARVA Methods Development Whole genome somatic mutation null model Computational Efficiency Improvements p-value stability tests LARVA Results Variant Data Annotation Data Results 46

Variant Data Collection Obtaining cancer genome data means collecting from cancer patients Access to any data that is personally identifiable, or could be linked to personally identifiable data, is protected Applications declaring our research use for such data With annual renewal Guarantee the physical and network security of the data Allow access only to specifically named individuals 47

Variant Data Collection TCGA The Cancer Genome Atlas NCI and NHGRI initiative Processed data is public Exome variants for colon, ovarian, and rectal cancer More raw forms of the data is protected accessed later upon Data Access Committee approval BAM files representing WGS, WXS, and RNA-Seq sequencing runs, among others Focused now on obtaining prostate cancer data, but the door is open for analyzing many other types of cancer Accessible via CGHub (cgquery + GeneTorrent) MG needs to generate a new keyfile for downloading every year 48

Variant Data Collection ICGC International Cancer Genome Consortium Studying 21 cancers in 18 countries Other Prostate Cancer sources Berger et al. 7 WGS samples Baca et al. 57 WGS samples Weischenfeldt et al. 11 exome samples Grasso et al. 61 exome samples Barbieri et al. 112 exome samples Broad Institute 20 deep sequenced (~30x) WGS tumor/normal pairs Recently acquired BAMs and produced our own variant calls with GATK pipeline Other Cancers Glial tumor, breast cancer, medulloblastoma, melanoma (exome), lung cancer (exome) 49

Outline LARVA Methods Development Whole genome somatic mutation null model Computational Efficiency Improvements p-value stability tests LARVA Results Variant Data Annotation Data Results 50

Genomic Mutation Analysis and Null Model Development for Cancer Research

Download Presentation

Presentation Transcript

Related

More Related Content