
Processing Microbiome Data: Secondary Analysis with Phyloseq
Dive into secondary data analysis with Phyloseq for microbiome data processing, including taxonomic assignment and species level assignment using R. Follow along from the end of a primary data analysis to enhance your data insights.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Lecture #11: Processing Microbiome Data: Secondary Data Analysis with Phyloseq (Part 1)
Alternative Secondary Analysis We may want to stay within the R universe and perform secondary data analysis with Phyloseq The process and types of analysis will be similar We can pick up from the end of lecture #7 At the end of the primary data analysis (lecture #7) We had a chimera filtered sequence table from DADA2 It was called seqtab.nochim We continue from that point here 2
Assign Taxonomy > taxa <- assignTaxonomy(seqtab.nochim, rdp_train_set_14.fa.gz ) This command performs taxonomic assignment on the sequence variants within the sequence table It uses the RDP training set 3
Assign Taxonomy (cont) > unname(head(taxa)) [,1] [,2] [,3] [,4] [1,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" [2,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" [3,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" [4,] "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" [5,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" [6,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" [,5] [,6] [1,] "Porphyromonadaceae" "Barnesiella" [2,] "Prevotellaceae" "Prevotella" [3,] "Porphyromonadaceae" "Barnesiella" [4,] "Lachnospiraceae" NA [5,] "Porphyromonadaceae" "Barnesiella" [6,] "Porphyromonadaceae" "Barnesiella" These are the classifications for the 6 most abundant sequence variants Note that there is no species level Let s fix that 4
Species Level Assignment > taxa.spec <- addSpecies(taxa, rdp_species_assignment_14.fa.gz ) This call adds species assignment level to the taxa table 5
Phylogenetic Tree In order to assess the community composition of our samples with a notion of phylogenetic relatedness, we must build a tree for the sequence variants This tree will be used in the assessment of phylogeny aware distances between communities We begin by performing a multiple sequence alignment on the sequence variants 6
Multiple Sequence Alignment > seqs <- getSequences(seqtab.nochim) > names(seqs) <- seqs # Propagate the sequence variant names to the tips labels of the tree > alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA) 7
Build Phylogenetic Tree > phang.align <- phyDat(as(alignment, matrix ), type= DNA ) > dm <- dist.ml(phang.align) > treeNJ <- NJ(dm) > fit <- pml(treeNJ, data=phang.align) > fitGTR <- update(fit, k=4, inv=0.2) > fitGTR <- optim.pml(fitGTR, model= GTR , optInv=TRUE, optGamma=TRUE, rearrangement= stochastic , control=pml.control(trace=0)) # Intensive! 8
Mapping File Recall that we had created a mapping file for secondary analysis with QIIME It is a tab separated text file so we can read it into R > mapfile <- read.table( ../Secondary/mappingfile.txt , row.names=1) The row.names=1 tells it to use the first row of the table for sample names 9
Mapping File (cont) The column names do not import easily so we will set them manually since there are only three > names(mapfile)[1] <- Mouse > names(mapfile)[2] <- Receive > names(mapfile)[3] <- SampleType > names(mapfile) [1] Mouse Receive SampleType 10
Mapping File (cont) Phyloseq provides a function called sample_data to make this data frame into a phyloseq sample table > sdmap <- sample_data(mapfile) > sdmap Sample Data: [61 samples by 3 sample variables]: Mouse Receive BKCM01 01 CHOW BKCM02 02 CHOW BKCM03 03 CHOW SampleType Cecal Cecal Cecal 11
Moving to Phyloseq > library(phyloseq) > library(ggplot2) Load these libraries if not already done Now we create the phyloseq object: > pso <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE), tax_table(taxa.spec), sample_data(mapfile), phy_tree(fitGTR$tree)) > pso phyloseq-class experiment-level object otu_table() OTU Table: [ 803 taxa and 61 samples ] sample_data() Sample Data: [ 61 samples by 3 sample variables ] tax_table() Taxonomy Table: [ 803 taxa by 7 taxonomic ranks ] phy_tree() Phylogenetic Tree:[ 803 tips and 801 internal nodes ] 12
Alpha Diversity Let s take a look at some Alpha Diversity metrics using our newly constructed phyloseq object called pso We will group samples by the Receive category, and color by SampleType (Fecal vs Cecal) We produce plots for Observed Species, Chao1, and Shannon metrics and sort by the Shannon metric > plot_richness(pso, x= Receive , measures=c( Observed , Chao1 , Shannon ), color= SampleType , sortby= Shannon ) 13
Taxonomic Filtering In many environments, the set of organisms present are well studied and RDP is fairly comprehensive The gut microbiota of a model organism is such a case In this case it is advisable to filter sequence variants for which a high-rank taxonomy wasn t assigned Such a variant is almost always a sequencing artifact This type of filtering would not be appropriate for poorly characterized environments or novel specimens Phylum is a good rank to filter 15
Exploring Phylum Level Create a table of features for each Phylum present: > table(tax_table(pso)[, Phylum ], exclude = NULL) Actinobacteria Bacteroidetes 16 91 Candidatus_Saccharibacteria Firmicutes 8 669 Proteobacteria Tenericutes 11 2 Verrucomicrobia <NA> 2 16 4
Remove Unassigned Phyla Remove any taxa that have an NA or unassigned phylum > ps0 <- subset_taxa(pso, !is.na(Phylum) & !Phylum %in% c( , uncharacterized ) > ps0 phyloseq-class experiment-level object otu_table() OTU Table: [ 799 taxa and 61 samples ] sample_data() Sample Data: [ 61 samples by 3 sample variables ] tax_table() Taxonomy Table: [ 799 taxa by 7 taxonomic ranks ] phy_tree() Phylogenetic Tree:[ 799 tips and 797 internal nodes ] Note that 4 taxa have been removed (803 -> 799) Also removed from phylogenetic tree associated with ps0 17
Feature Prevalence Prevalence The number of samples in our data set in which a given feature (e.g. variant, OTU) appears > prevalence < > prevalence <- - apply(X= apply(X=otu_table MARGIN= MARGIN=ifelse ifelse( (taxa_are_rows taxa_are_rows(ps0), yes=1, no=2), FUN=function(x){sum(x>0)}) no=2), FUN=function(x){sum(x>0)}) > prevalence < > prevalence <- - data.frame data.frame(Prevalence=prevalence, (Prevalence=prevalence, TotalAbundance TotalAbundance= =taxa_sums taxa_sums(ps0), tax_table tax_table(ps0)) (ps0)) Constructs a data frame to explore prevalence otu_table(ps0), (ps0), yes=1, (ps0), (ps0), 18
Feature Prevalence (cont) > prevalence0 < > prevalence0 <- - subset(prevalence, Phylum subset(prevalence, Phylum %in% %in% get_taxa_unique get_taxa_unique(ps0, Phylum )) (ps0, Phylum )) > > ggplot ggplot(prevalence0, (prevalence0, aes Prevalence/ Prevalence/nsamples nsamples(ps0), color=Phylum)) + (ps0), color=Phylum)) + geom_hline geom_hline( (yintercept yintercept=0.05, alpha=0.5, =0.05, alpha=0.5, linetype linetype=2) + =2) + geom_point geom_point(size=2, alpha=0.7) + scale_x_log10() + + scale_x_log10() + xlab xlab( Total Abundance ) + + ylab ylab( Prevalence [ ( Prevalence [Frac Frac. Samples] ) + facet_wrap facet_wrap(~Phylum) + (~Phylum) + theme( theme(legend.position legend.position= none ) = none ) Produces a plot of the prevalence of each Phylum aes( (TotalAbundance TotalAbundance, , (size=2, alpha=0.7) ( Total Abundance ) . Samples] ) + 19
The dashed line illustrates a potential cutoff of 5% prevalence. That would require a prevalence of a species to appear in more than 3 of 61 samples since 3 / 61 = 4.918% 20
Apply 5% Prevalence Filter We will filter out taxa with less than a 5% prevalence > > prevt prevt < <- - 0.05 * 0.05 * nsamples nsamples(ps0) > > prevt prevt [1] 3.05 [1] 3.05 > > keepTaxa keepTaxa < <- - rownames rownames(prevalence)[( (prevalence)[(prevalence$Prevalence >= >= prevt prevt)] )] > ps1 < > ps1 <- - prune_taxa prune_taxa( (keepTaxa keepTaxa, ps0) phyloseq-class experiment-level object otu_table() OTU Table: [ 460 taxa and 61 samples ] sample_data() Sample Data: [ 61 samples by 3 sample variables ] tax_table() Taxonomy Table: [ 460 taxa by 7 taxonomic ranks ] phy_tree() Phylogenetic Tree:[ 460 tips and 459 internal nodes ] (ps0) prevalence$Prevalence , ps0) 21
All of the taxa that were below the dashed line have been filtered out of the new phyloseq object ps1. Those were taxa that appeared in 3 or less of the 61 total samples. Remaining taxa have > 5% prevalence. 22