Computational Techniques for Regulatory DNA Motif Identification

development of computational techniques n.w
1 / 41
Embed
Share

Explore the development of computational techniques for identifying regulatory DNA motifs by Cankun Wang. The research delves into a novel program, WTSA, for ChIP-exo data and its applications on large biological datasets. Gain insights into transcription regulation, DNA motif representation, and the challenges of DNA motif identification due to vast sequence combinations. Discover the significance of Transcription Factor Binding Sites (TFBSs) in transcriptional regulation. This study sheds light on the complexities of DNA motif analysis and the essential role it plays in understanding genetic regulation.

  • Computational Techniques
  • DNA Motif Identification
  • Regulatory DNA
  • Transcription Regulation
  • Cankun Wang

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Development of Computational Techniques for Identification of Regulatory DNA Motif Cankun Wang Department of Agronomy, Horticulture & Plant Science 4/4/2025

  2. Outline Introduction WTSA, a novel DNA motif identification program for ChIP-exo data Applications of DNA motif identification on big biological data 2 of 39

  3. Outline 1. Introduction WTSA, a novel DNA motif identification program for ChIP-exo data Applications of DNA motif identification on big biological data 3 of 39

  4. Transcription regulation Transcription factor binding sites (TFBSs) is key to the mediation of transcriptional regulation Information on experimentally validated functional TFBSs is limited 4 of 39

  5. DNA Binding Site Motif (DNA Motif) motif motif motif DNA motifs are short, recurring patterns that are presumed to have a regulatory function Usually 8 20 base pairs(bp) long 5 of 39

  6. Representation of DNA Motif Regulation by OXygen1(ROX1) binding sites and sequence motif: Nucleotide codes table 1. Counts of nucleotides at each position Visual representation Consensus sequences Binding sites of a given transcription factor does not have to have the exact same sequence in every case 2. Information content, adjusted to background PWM(position weight matrix) Ref: D'haeseleer, What are DNA sequence motifs? Nature Biotechnology, 2016 6 of 39 And Motif Logo

  7. Identification of DNA Motifs is difficult Real DNA promoter sequence: TGTGAAAGACTGTTTTTTTGATCGTTTTGACAAAAATGGAAGTCCACA AAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATCCCATAG TGATGTACTGCATGTATGCAAAGGACGTCAGATTACCGTGCAGTACAG TAAACGATTCCACTAATTTATTCCATGTCACTCTTTTCGCATCTTTGT ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGG ACTTTTTTTTCATATGCCTGACGGAGTTGACACTTGTAAGTTTTCAAC If sequence s length=200 , number of sequences 30 ~ 1065 combinations If 10000 combinations/sec > 1044 centuries Too many combinations to consider! 7 of 39

  8. How have we identified DNA motifs in the past? BoBro 2.0 DMINDA ChIP-seq protocol BioProspector HOMER WTSA 2001 2011 2017 2009 2014 2019 2004 2007 2010 DESSO BoBro 1.0 DMINDA 2 ChIP-ChIP protocol MEME rGADEM, ChIPMonk ChIP-exo protocol Recent years, chromatin immunoprecipitation (ChIP) technologies provide an unprecedented opportunity to discover DNA motif 8 of 39

  9. In-vitro motif locating from ChIP-Seq ChIP-sequencing(ChIP-seq), one of the most popular method used to analyze protein interactions with DNA DNA-bound protein mapped to genome to generate the peaks These peak regions will be served as potential binding sites for motif identification 9 of 39

  10. ChIP-exo - A modification from ChIP-Seq The ChIP-exo method adds a unique exonuclease (exo) digestion step 10 of 39

  11. ChIP-exo method has the best resolution The ChIP-exo method obtains a near base-pair resolution 11 of 39

  12. Objective: Using ChIP-exo data to improve DNA Motif prediction accuracy ChIP-seq_peak ChIP-exo peak 285 300 280 250 275 200 270 150 265 Score Score 260 100 255 50 250 0 245 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1 6 CTGTTTTTTTGATTCGTCGACAAAAATGGAA 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 ATTGATTATTTGATCGTTCGTCATACTTTGCT Figure: Distribution of weight scores for each nucleotide ChIP-exo is more closely related to actual DNA motif length ChIP-exo peak score can be used to enhance DNA nucleotide signal The method called Weighted Two-Stage Alignment tool (WTSA) 12 of 39

  13. Outline Introduction 2. WTSA, a novel DNA motif identification program for ChIP-exo data Applications of DNA motif identification on big biological data 13 of 39

  14. WTSA workflow Matrix approximation & graph construction Data pre-processing Weighted two stage alignment Optimization evaluation Expansion 14 of 39

  15. 15 of 39

  16. WTSA format Similar to the FASTA format, the wtsa format begins with a single-line description, followed by lines of sequence data third line represents the weighted scores extracted from bedtools. >NC_000913.0:3073606-3074021 CCGCCCGGCGTCCGGATTCATACAAAGCACGAACCACATTAC 0,0,0,0,0,0,20.0,16.0,24.0,8.0,20.0,0,0,0,8.0,41.0,0,20.0,0,0,0 >NC_000913.0:2565089-2565533 GGCAACAACGCAGGGTTACAGCAGAAGATCACTGTGTTGGATA 4.0,8.0,4.0,0,0,0,0,0,0,0,0,0,0,0,8.0,0,49.0,0,0,0,16.0,8.0,0,0 16 of 39

  17. WTSA format Calculate match score based on a binomial distribution model For two length L segments si and sj with k position identity: ? ??,?? = lg( ? ??(?,?,?)), B(.) is binomial distribution, and p=0.25 Integrate weight score from ChIP-exo data and ?? are weight score of each nucleotide Where ?? 17 of 39

  18. Build approximation matrix Construct a weighted graph G and find clique Mark both elements among the top t scores Vertices represent nucleotide start position across all the L-segment sequence alignments Edges connect every pair of motif start position between two promoters with the largest scores Edge weight is the score based on two sequence pairs similarity and ChIP-exo weight scores 18 of 39

  19. Motif expansion P-value is very close to a Poisson distribution Approximate p(x) as P-value by simply summing up p(x) in its motif closures 19 of 39

  20. Motif length adjustment From previous Motif expansion, we have a temporary result with default motif length=10 Motif 1 Motif 2 Instance 1 Instance 2 Instance 1 Instance 2 Instance 3 Iterate all motif instance and calculate the overlap frequency extend motif Length Frequency 10 1 13 4 14 2 Instance 4 ... Instance 5 Instance 6 Instance 7 We set 13 as the optimized motif length and use it to perform motif expansion again 20 of 39

  21. Motif optimization A optimized motif is obtained, overlapped motif instances are removed 21 of 39

  22. Experiment datasets 3 datasets from publications in Escherichia coli using ChIP-exo method BioProject ID Publish Date #of Bases Transcription factor Fur Cra ArgR #of Identified Sites 556 387 462 PRJNA238003 PRJNA274571 PRJNA258521 2014.9 2018.4 2015.3 193.4M 88.6M 768.7M We downloaded Fur, Cra, ArgR, GadE, GadW, GadY, OxyR, UvrY, SoxR and SoxS TFs. (10 datasets) 3 used for motif evaluation, the others had limited annotation thus could not be used for evaluation 22 of 39

  23. WTSA performance evaluation against published tools List of tools: BioProspector, 2001, Cited by 959 BoBro, 2011, Cited by 24 MEME-ChIP, 2009, Cited by 3363 Homer, 2010, cited by 3150 ChIPMunk, 2014, Cited by 23 rGADEM, 2011, Cited by 42 We evaluated using ENCODE-ChIP-Seq data and showed that rGADEM was the best-performing tool. (BMC Bioinformatics. 2016. Evaluating tools for transcription factor binding site prediction) WTSA (with ChIP-exo weight data) Comparison of tools using evaluation method xx, xy, yy 23 of 39

  24. Evaluation method 1: profile level Submit the motif result to TOMTOM suite, compares DNA motif results against a database of known motifs Save the E-value, Q-value for comparison TOMTOM query result example Ref: Discovering Gap Patterns from Protein Sequences 24 of 39 through Pattern-Directed Aligned Pattern Clustering, IEEE, 2018

  25. Performance comparison of results on TOMTOM profile level WTSA provides stable prediction results on the -log2(E-value) and -log2(Q-value) metrics, WTSA outperforms on Fur and ArgR TF than all other methods, MEME-ChIP slightly performed better than WTSA on Cra TF 25 of 39

  26. Evaluation method 2: TFBS level overlapped >25% True True Predicted Predicted True negative(TN) False positive(FP) True positive(TP) ?? ???????????(??) = ?? + ?? ?? ???????? ?????????? ?????(???) = ?? + ?? ??????? ???? ???????????(???) =?? + ??? 2 ?????? =2 ?? ??? ?? + ??? Ref: M Tompa, et al. Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotech, 2005 26 of 39

  27. Performance comparison of results on TFBS level WTSA achieved a stable high motif prediction performance on the TFBS level F- score comparisons The rGADEM program outperforms on the Cra TF data at the TFBS level F-score, while WTSA has the best positive prediction value on the Cra TF data 27 of 39

  28. Evaluation method 3: Nucleotide level True Predicted Predicted False positive (FP) True positive (TP) False negative (FN) False positive (FP) ?? ???????????(??) = ?? + ?? ?? ???????? ?????????? ?????(???) = ?? + ?? ?? ??????????? ??????????? (??) = ?? + ?? + ?? ?????? =2 ?? ??? Ref: ?? + ??? M Tompa, et al. Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotech, 2005 28 of 39

  29. Performance comparison of results on nucleotide level WTSA achieved the highest F-scores, sensitivity and positive prediction value on all three datasets 29 of 39

  30. Unique features of the WTSA A workflow to perform DNA motif analysis directly from ChIP-exo data Binomial distribution and nucleotide level weight score from ChIP-exo for optimizing motif instances Unique DNA motif result length auto-optimization Improved performance advantages in the comparison with popular existing tools 30 of 39

  31. Outline Introduction WTSA, a novel DNA motif identification program for ChIP-exo data 3. Applications of DNA motif identification on big biological data 31 of 39

  32. IRIS3 - Integrated Cell-type-specific Regulon Inference Server from Single-cell RNA-Seq How gene expression programs are controlled requires identifying regulatory relationships between transcription factors (TF) and target genes? To identify the cell-type-specific regulons (CTS-Rs), a group of genes co-regulated by the same transcription regulator in a specific cell type. We developed Integrated Cell-type-specific Regulon Inference Server from Single-cell RNA-Seq (IRIS3) 32 of 39

  33. IRIS3 overall pipeline Integrate WTSA 33 of 39

  34. IRIS3 web server Screenshot for an example cell type specific regulon Gene list including marker genes DNA motif details A list of button of toolkits for further analysis 34 of 39

  35. IRIS3 web server functions Screenshot for heatmap and t-SNE plot from the previous regulon example 35 of 39

  36. DESSO Prediction of Regulatory Motifs from Human ChIP-Sequencing Data using a Deep Learning Framework Established web server interface and test data 36 of 39

  37. Tools developed from BMBL BoBro command-line tookit DMINDA web server motif identification motif scanning motif comparison analyzing co-occurring motifs Regulon prediction Prokaryotic & eukaryotic data base WTSA DNA motif identification on ChIP-exo data. DESSO Motif prediction using Deep Learning 37 of 39

  38. Publication Xia, Ye, Seth DeBolt, Qin Ma, Adam McDermaid, Cankun Wang, Nicole Shapiro, Tanja Woyke, and Nikos C. Kyrpides. Improved Draft Genome Sequence of Bacillus Sp. Strain YF23, Which Has Plant Growth-Promoting Activity. Edited by David Rasko. Microbiology Resource Announcements 8, no. 15 (April 11, 2019). https://doi.org/10.1128/MRA.00099-19. Xia, Ye, Seth DeBolt, Qin Ma, Adam McDermaid, Cankun Wang, Nicole Shapiro, Tanja Woyke, and Nikos C. Kyrpides. Improved Draft Genome Sequence of Pseudomonas PoaeA2-S9, a Strain with Plant Growth-Promoting Activity. Edited by Irene L. G. Newton. Microbiology Resource Announcements 8, no. 15 (April 11, 2019). https://doi.org/10.1128/MRA.00275-19. Monier, Brandon, Adam McDermaid, Cankun Wang, Jing Zhao, Allison Miller, Anne Fennell, and Qin Ma. IRIS-EDA: An Integrated RNA-Seq Interpretation System for Gene Expression Data Analysis. PLOS Computational Biology 15, no. 2 (February 14, 2019): e1006792. https://doi.org/10.1371/journal.pcbi.1006792. Wang, Yan, Sen Yang, Jing Zhao, Wei Du, Yanchun Liang, Cankun Wang, Fengfeng Zhou, Yuan Tian, and Qin Ma. Using Machine Learning to Measure Relatedness Between Genes: A Multi-Features Model. Scientific Reports 9, no. 1 (December 2018). https://doi.org/10.1038/s41598-019-40780-7. Han, Siyu, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang, and Ying Li. LncFinder: An Integrated Platform for Long Non-Coding RNA Identification Utilizing Sequence Intrinsic Composition, Structural Information and Physicochemical Property. Briefings in Bioinformatics. Accessed November 24, 2018. https://doi.org/10.1093/bib/bby065. McDermaid, Adam, Xin Chen, Yiran Zhang, Cankun Wang, Shaopeng Gu, Juan Xie, and Qin Ma. A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation. Frontiers in Genetics 9 (2018). https://doi.org/10.3389/fgene.2018.00313. 38 of 39

  39. Submission plan WTSA : a weighted two stage alignment tool for DNA motif identification on ChIP-exo data. (Bioinformatics or Nucleic Acids Research) Cankun Wang (First author) IRIS3: Integrated cell-type-specific regulon inference server from single-cell RNA-Seq. (Genome Biology) Anjun Ma*, Cankun Wang* (Co-first author), Adam McDermaid1,2, Bingqiang Liu3, and Qin Ma1,$ Prediction of Regulatory Motifs from Human ChIP-Sequencing Data using a Deep Learning Framework. (Nucleic Acids Research) Jinyu Yang, Anjun Ma, Adam D. Hoppe, Cankun Wang, Yang Li, Yan Wang, Bingqiang Liu, and Qin Ma 39 of 39

  40. Acknowledgment Committees Dr. Anne Fennell Dr. Qin Ma Dr. Trevor Roiger BMBL members Anjun Ma, Juan Xie, Yuzhou Chang, Adam McDermaid, Jinyu Yang, Shaopeng Gu, Zhaoqian Liu, Jing Jiang, Junyi Chen, Weiliang Liu, Zichun Zhang, Minxuan Sun, Jennifer Xu 40 of 39

  41. Thanks Cankun Wang 4/4/2025

More Related Content