
16S Gene Sequencing Challenges & Solutions
Explore the challenges and solutions in 16S gene sequencing, including data reduction, error correction, identification of protein families, and more. Learn about the issues with chimeras, ambiguous assignments, and abundant sequences, and how to address them effectively for downstream analysis.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Robert Edgar Independent scientist robert@drive5.com www.drive5.com
Data reduction Make tractable for downstream analysis Read dereplication & error-correction Metagenomics Identify protein families de novo Community sequencing: identify OTUs
Challenges USEARCH solutions
16S gene Environmental sample with bacteria Primers 16S segments Bacterial chromosome PCR Amplified segments Biological sequences Reads Chimeric artifacts formed from 2 biological sequences during PCR
Error correction Chimeras Big problem with 16S / 18S / ITS Covered this morning: UCHIME Other PCR errors Sequencer error Bad base calls, indels, homopolymers Cluster at 97% (3% radius) One cluster = one OTU = one species (maybe!)
Bigger dot = more reads Radius 3% = species 3% Centroid, ideally should be most abundant = most likely to be biological. Differs from rep. seq. due to: Sequencing error Biological variation
Ambiguous assignments Which OTU?
Abundant sequences <3% different Outliners create spurious OTU(s) 2% Arbitrary choice of OTU rep. seq.
Full-length 16S gene (~1500nt) Next-gen reads of hypervariable region (~300nt) Variation greater in short region, may be > 3%.
Variation between populations Diseased Healthy
Variation between populations Diseased Healthy
Paralogs and segmental duplications 16S gene Duplication > 3% diverged Bacterial chromosome Two OTUs for one species
Alignment variation and defining % identity G A T T A C A - - G A A T T A A C A G A - T T A - C A G A A T T A A C A No diffs or 2 diffs? 3 diffs or 5 diffs? Program A Program B Different programs produce different results from the same algorithm & same input data because alignments and %id definition vary. This can bias validation, e.g. Schloss & Westcott (2011) AEM.
Hard to define an OTU or an optimal set of OTUs Phylogenetic tree B 2.5% 1.5% B A C A 4% C
Hard to define an OTU or an optimal set of OTUs B B A C A C Optimal OTUs per Schloss & Westcott s MCC measure can be non-monophyletic.
OTUs are hacks Do not exist in nature Cannot be defined and validated robustly But can still be useful!
One program, one binary Suite of high-throughput algorithms Search, clustering, dereplication, chimera detection Orders of magnitude faster than BLAST Free for academic use (32-bit)
Sort sequences Greedy list removal
Typical state: one database sequence per cluster (centroid). Cluster assignments written sequentially to file, not stored in RAM. In RAM for fast access. Clusters Input sequences Database
Initial state: empty database = no clusters. Clusters Input sequences Database Input sequences processed in file order.
Next input sequence searched against database. Clusters Input sequences Database USEARCH USEARCH algorithm: very fast database search (>>BLAST).
Hit: input sequence assigned to cluster & discarded. Clusters Input sequences Database Hit Record written to output file(s). Optional: alignment, other info.
No hit: query added to database, becomes centroid of new cluster. Clusters Input sequences Database No hit
Very fast Input order matters Centroid is always first member found How to sort?
Longest sequences typically outliers, tend to split OTUs. Centroid: CENTROID------- Seq1: CENTROIDINSERTED Seq2: CENTROIDTERMINAL If you don t sort by length, fragments can become centroids and member sequences may have many differences.
Most abundant sequence is likely to be biological & a good choice of centroid
If read errors are rare: Abundance = size of dereplication cluster If read errors are common: Have a circular problem: Abundances needs clustering, but Clustering needs abundances.
Calculate consensus sequence. UCLUST can do this for each cluster. G A T G A C G T C A A G T C A T A G GBiological sequence G A T T A C G T C A - A G T C A A A G G Read 1 G A T G A C G A C A - A G T C A T A G - Read 2 G G T G A C G T C A A A G - C A T A G G Read 3 G A T G A C G T C A A G T C A T A G G Consensus
Dereplicate: sort by length & run UCLUST Longest sequences are centroids in first round. Tend to be outliers & split a natural OTU.
Find consensus sequences Consensus sequences converge on most abundant sequence in cluster, most likely to be a correct amplicon sequence. Common for two clusters to converge on same consensus sequence: merges an OTU that was split in first round.
Before taking consensus after.
Consensus sequences denoised amplicons Amplicon abundance cluster size Circular problem solved. Filter chimeras Abundances needed by de novo UCHIME as well
Sort by abundance Run UCLUST at 97% Centroid is final OTU.
Assign reads to OTUs: USEARCH at 97%. Most reads match an OTU. Outliers need special treatment: can be assigned to closest OTU, or reclustered at 97%.
Python script, runs multiple USEARCH steps Very fast and highly scalable 106 reads in minutes on a laptop Ad hoc, but good biological results Other algorithms are also ad hoc Average linkage standard but not justified by theory Does not address read error correction, other challenges
Technical issues Clustering threshold for error correction 97% seems to work well so far But can merge distinct amplicons degrades abundance estimate Higher threshold might be better if read errors rare Minimum cluster size threshold Clusters <4 reads discarded after error-correction step Rare species / false-positive trade-off
Not like QIIME or mother Not a complete suite of analysis tools Not "packaged" specifically for 16S Lower-level algorithms Typically used by "pipelines" Multiple steps Typical step is USEARCH command or file conversion Implemented by scripts (bash, perl, Python...).
Task USEARCH QIIME Knight mothur Schloss Pyronoise Quince Perseus Quince ESPRIT Sun Edgar reads to OTUs filtered reads to OTUs Phylotype Err. correction Chimera filter (ref db) Chimera filter (de novo) Compare pops. (UNIFRAC) Diversity ( , )