Machine Learning Approach for Hierarchical Classification of Transposable Elements

Machine Learning Approach for Hierarchical Classification of Transposable Elements
Slide Note
Embed
Share

This study presents a machine learning approach for the hierarchical classification of transposable elements (TEs) based on pre-annotated DNA sequences. The research includes data collection, feature extraction using k-mers, and classification approaches. Proper categorization of TEs is crucial for understanding their impact on genetic evolution. Various machine learning methods are applied to predict hierarchical categories of TEs, offering insights into their functional roles in genomes.

  • Machine Learning
  • Transposable Elements
  • DNA Sequences
  • Classification
  • Genetic Evolution

Uploaded on Dec 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Prediction of Hierarchical Classification of Transposable Elements using Machine Learning Approach Avdesh Mishra, Manisha Panta, MdTamjidul Hoque, Joel Atallah Computer Science and Biological Sciences Department, University of New Orleans

  2. Presentation Overview Introduction Data Collection Feature Extraction Hierarchical Classification Approaches Machine Learning Methods for the Prediction of Hierarchical Categories Results Conclusion 2 12/15/2024

  3. Transposable Elements Transposable elements (TEs) or jumping genes are the DNA sequences that have Intrinsic capability to move within a host genome from one genomic location to another Genomic location can either be same or different chromosome TEs were first discovered by Barbara McClintock (a.k.a. maize scientist) in 1948 TEs play an important role in: Modifying functionalities of genes E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer. Hence, proper classification of identified TEs in a genome is important to understand their particular role in germline and somatic evolution. 3 12/15/2024

  4. Illustration of TEs Taxonomy Proposed by Wicker et al. Root Class II Class I 1 (DNA Transposons) (Retrotransposons) 1.1 Subclass 1 Subclass 2 LTR DIRS PLE LINE SINE TIR Crypton Helitron Maverick Copia DIRS Penelope R2 tRNA Tc1- Mariner Maverick- Polinton 1.1.2 Crypton Helitron Gypsy Ngaro RTE 7SL hAT Bel-pao VIPER Jockey 5S Mutator Retrovirus L1 Merlin ERV I Transib P PiggyBac PIF- 4 Harbinger 12/15/2024 CACTA

  5. Data Collection For our study, we collected pre-annotated DNA sequences of TEs. The hierarchical annotations of TEs were performed based on Wicker s taxonomy. For the annotation of TEs, the repetitive DNA sequences were obtained from two different public repositories: Repbase PGSB Repbase repository contains TEs from different eukaryotic species. PGSB is a compilation of plant repetative sequences from different databases: TREP TIGR repeats PlantSat Genbank PGSB 18680 Repbase 34561 Fasta Sequences 5 12/15/2024

  6. Feature Extraction Each TE in a dataset is represented by a set of k-mers Which are obtained by frequency count of substring of length k E.g. for k=2, all combinations of (AA, AT, AG, AC .CC) in the sequence are extracted C C G C A A A A G T T G T C For k=2 For k=3 For k=4 AA = 2 CC = 2 TT = 2 CCG = 1 CAA = 1 AAG = 1 CCGC = 1 AAAA = 1 GTTG = 1 For each TE, k-mers with k sizes of 2, 3 and 4 were used as features. Feature values were standardized such that the mean = 0 and standard deviation = 1 6 12/15/2024

  7. Hierarchical Classification Approaches Classification of TEs can be treated as hierarchical classification problem The hierarchical classification can be represented by a directed acyclic graph or a tree Hierarchical classification of TEs is performed based on top-down strategies Two recent top-down strategies for the hierarchical classification of TEs are: non-Leaf Local Classifier per Parent Node (nLLCPN) Local Classifier per Parent Node and Branch (LCPNB) 7 12/15/2024

  8. non-Leaf Local Classifier per Parent Node Approach In nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph. Is classified as either 1 or 2 Root CCGCAAAAGTTGTC Is classified as either itself or 2.1 1 2 CCGCAAAAGTTGTC Is classified as either itself or 2.1.1 1.1 1.4 1.5 2.1 CCGCAAAAGTTGTC 1.1.1 2.1.1 2.1 Is classified as 2.1.1.2 1.1.2 2.1.1.2 CCGCAAAAGTTGTC 1.1 2.1.1.1 2.1.1.8 8 12/15/2024 2.1.1.5

  9. Local Classifier per Parent Node and Branch Approach In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction probabilities are obtained for all the classes. Root The path leading to final classification: 2(0.6) 2.1(1) 2.1.1(0.8) 2.1.1.1(0.4) Average = (0.6+1+0.8+0.4)/4 = 0.7 0.4 1 2 0.2 0.6 0.6 1.1 1.4 1.5 2.1 1 0.2 0.2 1.1.1 2.1.1 2.1 0.8 0.2 0.4 0.2 1.1.2 2.1.1.2 0.4 0.4 1.1 2.1.1.1 2.1.1.8 0.2 9 12/15/2024 0.2 2.1.1.5

  10. Machine Learning Methods for the Prediction of Hierarchical Categories We applied several machine learning methods at each non-leaf node of the directed acyclic graph. Artificial Neural Network (ANN) ExtraTree Classifier (ET) Gradient Boosting Classifier (GBC) Logistic Regression (LogReg) Random Forest (RF) Support Vector Machines (SVM) 10 12/15/2024

  11. Machine Learning Methods for the Prediction of Hierarchical Categories The state-of-the-art method implements ANN which single hidden layer consisting of 200 nodes as a multi-class classifier Whereas, in this study we propose a SVM based multi-class classification We implemented SVM with RBF kernel and optimized the cost and gamma parameters using grid search approach for optimal performance. 11 12/15/2024

  12. Performance Measures ?????? ???? ????????? ( ?) = ?|?? ??| ?|??| ?????? ???? ?????? ( ?) = ?|?? ??| ?|??| ?????? ???? ? ??????? ( ?) =2 ? ? ? + ? Here, Ci and Zi represents the set of true and predicted classes for an instance i respectively. The performance of each of the classifier is evaluated using 3-fold cross-validation strategy. 12 12/15/2024

  13. Results Table I Shows comparative results of different machine learning approaches in the PGSB hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch. MIPS - nLLCPN GBC 86.75% 86.25% 0.864972486 MIPS - LCPNB 86.11% 86.45% 0.862758219 SVM 88.21% 86.51% 0.873518029 ANN 82.13% 85.51% 0.837699065 ExtraTree 76.03% 78.94% 0.774524643 Random Forest 76.98% 79.55% 0.782458818 LogReg 76% 78.89% 0.774172489 hP hR hF hP hR hF 87.34% 86.10% 0.867151847 82.93% 83.44% 0.831846433 84.50% 85% 0.847494297 84.12% 84.69% 0.844037783 83.55% 84.21% 0.838769007 13 12/15/2024

  14. Results Table II Shows comparative results of different machine learning approaches in the Repbase hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch. Repbase - nLLCPN GBC 81.98% 84.04% 0.830022352 Repbase - LCPNB 81.94% 84.59% 0.832277949 SVM 85.44% 86.64% ANN 80.27% 83.32% 0.817704912 ExtraTree 76.02% 78.93% 0.774524643 Random Forest 76.98% 79.55% 0.782458818 LogReg 75.99% 78.89 0.774172489 hP hR hF 0.860347824 hP hR hF 0.863959027 85.75% 87.05% 80.57% 83.26% 0.818944098 76.95% 79.99% 0.78444174 77.67% 80.27% 0.789473439 76.12% 79.16% 0.776128202 14 12/15/2024

  15. Results Fig.1. Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in PGSB dataset. Fig.1. hF Comparision for PGSB dataset 0.88 0.87 Hierarchical f-measure 0.86 0.85 0.84 0.83 0.82 0.81 LogReg Randoom Forest ExtraTree ANN GBC SVM nLLCPN LCPNB 0.840740388 0.838769007 0.855858795 0.844037783 0.856005417 0.847494297 Machine Learning Methods 0.837699065 0.831846433 0.864972486 0.862758219 0.873518029 0.867151847 15 12/15/2024 nLLCPN LCPNB

  16. Results Fig.2. Shows hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in Repbase dataset. Fig.2. hF Comparision for Repbase dataset 0.88 0.86 Hierarchical f-measure 0.84 0.82 0.8 0.78 0.76 0.74 0.72 LogReg Randoom Forest ExtraTree ANN GBC SVM nLLCPN LCPNB 0.774172489 0.776128202 0.782458818 0.789473439 0.774524643 0.78444174 Machine Learning Methods 0.817704912 0.818944098 0.830022352 0.832277949 0.860347824 0.863959027 16 12/15/2024 nLLCPN LCPNB

  17. Conclusion and Future Work Advanced Machine Learning approach improves the prediction accuracy of hierarchical classification of TEs Optimization of the cost and gamma parameters of support vector machine (SVM) with radial basis function (RBF) kernel leads to a better hierarchical classification of transposable elements We plan to improve the classification accuracy by following approaches: Addition of biochemical related features Implementing advanced machine learning techniques Implementing novel hierarchical classification approache 17 12/15/2024

  18. 18 12/15/2024

More Related Content