
Gene Conservation and Mutation Analysis
Explore the conservation and mutation processes in gene, pseudogene, and noncoding RNA families in humans, focusing on unique scoring methods and family-level evolution studies. The research delves into adjusting scoring pipelines, mutation processes within families, and future work on improving conservation scores and detecting conserved regions in noncoding RNA families.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Analysis of conservation and mutation processes in gene, pseudogene and noncoding RNA families (in human) Yan Zhang 6/9/2014
Adjusting pipeline and scores Adjusting pipeline and scores with application to gene-pseudogene families. Two novelties in the study We assign pseudogenes to unique gene families, unlike previous studies which assigned pseudogenes to unique parent genes. Proposed two types of scores: Point-wise score & Alignment-wise score for measuring conservation. Known conservation scores: PhastCons, GERP Calculate cross-species conservation, given a cross-species phylogenetic tree. Our scores: Calculate conservation within gene-pseudogene families (or, RNA family, in the future) in a species. No within-family phylogenetic tree is given. It is what we will infer mutation processes and evolution within a family
Working progress Gene-pseudogene family study Family expansion/contraction Conservation and selection Compare the point-wise conservation between genes and pseudogenes in the same family Calculate the alignment-wise conservation for genes and pseudogenes respectively Mutation processes within family Relate the point-wise and alignment-wise scores to allele frequency
Future work Improving conservation scores Contiguous conserved region detection Significance level Score comparison noncoding RNA family study Detect conserved sites and conserved regions in noncoding regions Functional element overlap analysis Compare the detected conserved regions with ENCODE MCS Family evolution in genes, pseudogenes, and noncoding RNAs Family expansion/contraction Learning stable constrained minimum spanning tree from data
Point-wise scores 1) Multinomial pmf. adjusted by sum (n-1)! P-score = ------------------------ * p1X1* p2X2* p3X3* p4X4 X1!X2!X3!X4! Local DNA parameters in a superfamily: p1, , p4, corresponding to A, C, T, and G. X1, , X4are the observed counts of A, C, T, and G at this point; their sum is n. Pros: Take into account prior frequencies of A, C, T and G in the superfamily. Cons: Sensitive to the number of aligned sequences. (Adjustment by sum is one solution, but not strong.)
Point-wise scores 1) Multinomial pmf. 2) Empirical entropy Take into account prior Not sensitive to #sequences Take into account prior Not sensitive to #sequences 3) Entropy, Dirichlet prior Take into account prior Not sensitive to #sequences The lower the scores, the more conserved the point is. Undergone purifying selection or deletion.
Dirichlet prior Dirichlet-multinomial distribution: Dirichlet distribution is the conjugate prior of multinomial distribution pk, k=1, ,4, is the probability of draw value k Pv = (p1, p2, p3, p4) is the parameter of multinomial P(X1, X2, X3, X4|Pv) Pv ~ a conjugate prior distribution, i.e. Dirichlet(a1, a2, a3, a4)
Alignment-wise scores For all point-wise score in the aligned region: A-score = -2 * log(mean(point-wise score)) 1/L, L is the length of the aligned region. Mean is not sensitive to L, unlike product. Inspired by AIC. The higher the score, the more conserved the aligned region is.
Preliminary result (using my sandbox2 dataset) Sandbox2 dataset: The largest gene-paralog family with 1335 genes 4 highly similar pseudogenes
The biggest gene superfamily Local prior frequencies A_base C_base G_base T_base Conservation scores Point-wise entropy.Dirichelet Locations along the aligned sequences Penalized alignment score of entropy.Dirichlet: 0.187
The group of very similar pseudogenes Local prior frequencies C_base A_base T_base G_base Conservation scores Point-wise entropy.Dirichelet Locations along the aligned sequences Penalized alignment score of entropy.Dirichlet: 1.08
Think about Choose appropriate alignment similarity score Too big superfamilies might not be appropriate Validity of transition probability in paralogs Large indels Incorp. more info (structure) while analyzing RNA families