
Performing Variant Calling Workshop for Genomics Analysis
Explore a step-by-step guide for variant calling analysis in genomics using the IGB Biocluster and Integrative Genomics Viewer. Learn how to access the lab setup, start the VM, and follow instructions for a successful workshop experience. Find detailed directions on accessing directories, setting up the lab, and copying necessary files for the analysis.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Variant Calling Workshop Chris Fields PowerPoint by Casey Hanson Edited by Saba Ghaffari Variant Calling Workshop | Chris Fields | 2020 1
Introduction In this lab, we will do the following: 1. Perform variant calling analysis on the IGB biocluster. 2. Visualize our results on the desktop using the Integrative Genomics Viewer (IGV IGV) tool. Variant Calling Workshop | Chris Fields | 2020 2
Start the VM Follow instructions for starting VM. (This is the Remote Desktop software.) The instructions are different for UIUC and Mayo participants. Instructions for UIUC users are here: http://publish.illinois.edu/compgenomicscourse/files/2020/06/SetupV M_UIUC.pdf Instructions for Mayo users are here: http://publish.illinois.edu/compgenomicscourse/files/2020/06/VM_Set up_Mayo.pdf Variant Calling Workshop | Chris Fields | 2020 3
Where is this? Where is this? Step 0A: Accessing the IGB Biocluster 1. In the VM, Run Putty.exe Putty.exe 2. In the hostname hostname textbox type: biologin.igb.illinois.edu 3. Click Open Open 4. If popup appears (it may not), Click Yes Yes 5. Enter login credentials assigned to you; e.g., login as: class00 class00 password: (type it out here) Now you are logged Now you are logged on to biocluster. on to biocluster. Variant Calling Workshop | Chris Fields | 2020 4
Step 0B: Lab Setup The data and code needed for the lab, as well as the output files from the lab, are located in the following directory: DON DON T TYPE THIS! T TYPE THIS! /home/classroom/hpcbio/mayo_workshop/2019/Mayo-Variant-Calling/ You don t need to do anything yet. You and the TA may together consult it later if you unsure about your runs. We ll call the above the common directory for the lab, since it is visible to all students We ll call the above the common directory for the lab, since it is visible to all students Variant Calling Workshop | Chris Fields | 2020 5
Note: a directory is the Unix word for folder Note: your home directory is your landing folder when you log on. Note: for better organization, we don t put all files in the home directory. We create a folder (directory) inside it, and move into it, just as you would move inside a sub- folder in Windows/Mac by double-clicking it. Step 0C: Lab Setup The three commands in the gray box below will achieve the following: a) Create a directory called 03_Variant_Calling home directory. b) Change your current directory to this new directory c) Copy all code (files ending with .sh ) from the common directory (previous slide) to your current directory. DON DON T RUN ANYTHING YET 03_Variant_Calling in your T RUN ANYTHING YET $ mkdir ~/03_Variant_Calling # Make new directory inside your home directory. mkdir stands for make directory $ cd ~/03_Variant_Calling # Move into this directory. The cd stands for change directory $ cp /home/classroom/hpcbio/mayo_workshop/2019/Mayo-Variant-Calling/*.sh . Variant Calling Workshop | Chris Fields | 2020 6 # Copy all files whose names end with .sh to current directory. These are code files
Step 0C: Lab Setup Q: Why ~/03_Variant_Calling ? The three commands in the gray box below will achieve the following: A: This is Linux short hand for a directory called 03_Variant_Calling inside of the home directory ~ a) Create a directory called 03_Variant_Calling home directory. b) Change your current directory to this new directory c) Copy all code (files ending with .sh ) from the common directory (previous slide) to your working directory. RUN the first two commands now (shown in red font). (TYPE & HIT ENTER) RUN the first two commands now (shown in red font). (TYPE & HIT ENTER) 03_Variant_Calling in your $ mkdir ~/03_Variant_Calling # Make new directory inside your home directory. mkdir stands for make directory $ cd ~/03_Variant_Calling # Move into this directory. The cd stands for change directory $ cp /home/classroom/hpcbio/mayo_workshop/2019/Mayo-Variant-Calling/*.sh . Variant Calling Workshop | Chris Fields | 2020 7 # Copy all files whose names end with .sh to current directory. These are code files
Q: What is this . (period)? Step 0C: Lab Setup A: This is Linux short hand for current directory The three commands in the gray box below will achieve the following: Note: Space after cp and another space before the . a) Create a directory called 03_Variant_Calling home directory. b) Change your current directory to this new directory c) Copy all code (files ending with .sh ) from the common directory (previous slide) to your working directory. RUN the third command now (shown in red font) RUN the third command now (shown in red font) 03_Variant_Calling in your $ mkdir ~/03_Variant_Calling # Make new directory inside your home directory. mkdir stands for make directory $ cd ~/03_Variant_Calling # Move into this directory. The cd stands for change directory $ cp /home/classroom/hpcbio/mayo_workshop/2019/Mayo-Variant-Calling/*.sh . # Copy all files whose names end with .sh FROM the common directory TO current directory. These are code files. Variant Calling Workshop | Chris Fields | 2020 8
Copied Files: Copied Files: Step 0C: Lab Setup annotate_snpeff.sh The three commands in the gray box below will achieve the following: call_variants_ug.sh hard_filtering.sh a) Create a directory called 03_Variant_Calling home directory. b) Change your current directory to this new directory c) Copy all code (files ending with .sh ) from the common directory (previous slide) to your working directory. DON DON T RUN ANYTHING NOW. YOU RE DONE WITH THESE COMMANDS T RUN ANYTHING NOW. YOU RE DONE WITH THESE COMMANDS 03_Variant_Calling in your post_annotate.sh HOW TO CHECK IF THE HOW TO CHECK IF THE COPYING WORKED? COPYING WORKED? $ mkdir ~/03_Variant_Calling # Make new directory inside your home directory. mkdir stands for make directory $ cd ~/03_Variant_Calling # Move into this directory. The cd stands for change directory $ cp /home/classroom/hpcbio/mayo_workshop/2019/Mayo-Variant-Calling/*.sh . # Copy all files whose names end with .sh FROM the common directory TO current directory. These are code files Variant Calling Workshop | Chris Fields | 2020 9
CHECKPOINT REACHED. LAB PAUSE HERE. CHECKPOINT REACHED. LAB PAUSE HERE. Variant Calling Setup In this exercise, we will use data from the 1000 Genomes project (EXOME, 60x coverage) to call variants, in particular single nucleotide polymorphisms. The initial part of the GATK pipeline (alignment, local realignment, base quality score recalibration alignment, local realignment, base quality score recalibration) has been done, and the BAM file has been reduced for a portion of human chromosome 20. This is the data we will be working with in this exercise. Variant Calling Workshop | Chris Fields | 2020 10
Step 1A: Running a Variant Calling Job In this step, we will start a variant calling job using the sbatch sbatch command. Note: submitting the code for execution on the biocluster does not Note: submitting the code for execution on the biocluster does not mean it has run successfully. It may take a few minutes to run. mean it has run successfully. It may take a few minutes to run. RUN the command shown in red font. (TYPE IT, HIT ENTER) RUN the command shown in red font. (TYPE IT, HIT ENTER) $ sbatch call_variants_ug.sh # This will execute call_variants_ug.sh on the biocluster. # After you hit enter, it should say something like: # Submitted batch job 5143759 Variant Calling Workshop | Chris Fields | 2020 11
Step 1A: Running a Variant Calling Job In this step, we will start a variant calling job using the sbatch sbatch command. Additionally, we will gather statistics about our job using the squeue squeue command. RUN the command shown in red font. BUT REPLACE <userID> with your user RUN the command shown in red font. BUT REPLACE <userID> with your user id, e.g., class07 id, e.g., class07 $ sbatch call_variants_ug.sh # This will execute call_variants_ug.sh on the biocluster. $ squeue -u <userID> # Get statistics on your submitted job Variant Calling Workshop | Chris Fields | 2020 12
Step 1B: Output of Variant Calling Job As long as the code you submitted (through sbatch) is running, it will show something like this: Repeat the squeue has finished. When it finishes, it will only show the header row. When you don t see it listed anymore, it s done! Quick tip: You don t need to retype or repaste it every time. Hit the UP-ARROW key and see what happens! You should have 4 4 files when it has completed. squeue - -u <userID> u <userID> command periodically to see if your job Files Files raw_indels.vcf raw_indels.vcf.idx raw_snps.vcf HOW TO CHECK THIS? HOW TO CHECK THIS? Variant Calling Workshop | Chris Fields | 2020 13 raw_snps.vcf.idx
CHECKPOINT REACHED. LAB PAUSE HERE. CHECKPOINT REACHED. LAB PAUSE HERE. Discussion What did we just do? We ran the GATK GATK UnifiedGenotyper UnifiedGenotyper to call variants. Look at file structure. Which file(s)? How to look ? Variant Calling Workshop | Chris Fields | 2020 14
Step 1C: SNP and Indel Counting In this step, we will count the # of SNPS and Indels identified in the raw_snps.vcf raw_snps.vcf and raw_indels.vcf raw_indels.vcf files. We will use the program grep grep, which is a text matching program. RUN the command shown in red font. RUN the command shown in red font. Be SUPER CAREFUL with spaces and case. Be SUPER CAREFUL with spaces and case. $ grep -c -v '^#' raw_snps.vcf # Get the number of SNPs in file raw_snps.vcf # -v Tells grep to show all lines not beginning with # in raw_snps.vcf. # -c Tells grep to report the total number of lines that match the above criterion. Each such line is a SNP. # Output should be approx. 14400. Variant Calling Workshop | Chris Fields | 2020 15
Step 1C: SNP and Indel Counting In this step, we will count the # of SNPS and Indels identified in the raw_snps.vcf raw_snps.vcf and raw_indels.vcf raw_indels.vcf files. We will use the program grep grep, which is a text matching program. RUN the command shown in red font. RUN the command shown in red font. $ grep -c -v '^#' raw_snps.vcf # Get the number of SNPs in file raw_snps.vcf # -v Tells grep to show all lines not beginning with # in raw_snps.vcf. # -c Tells grep to report the total number of lines that match the above criterion. Each such line is a SNP. # Output should be approx. 14400. $ grep -c -v '^#' raw_indels.vcf # Get the number of indels. Variant Calling Workshop | Chris Fields | 2020 16 # Output should be approx. 1069.
Step 1D: SNP and Indel Counting in dbSNP In this step, we will count the number of SNPs and Indels in dbSNP dbSNP SNPs and Indels have the rs# rs1000 RUN the commands shown in red font. RUN the commands shown in red font. in dbSNP. rs# identifier where # is a number, e.g., $ grep -c 'rs[0-9]*' raw_snps.vcf # Get the number of dbSNP SNPs. # Report all lines in raw_snps.vcf containing rs followed by a number. # -c Tells grep to return the total number of returned lines. # Output should be approx. 13329 $ grep -c 'rs[0-9]*' raw_indels.vcf # Get the number of dbSNP indels. # Output should be approx. 983. Variant Calling Workshop | Chris Fields | 2020 17
Step 2A: Hard Filtering Variant Calls We need to filter these variant calls in some way. In general, we would filter on quality scores. However, since we have a very small set of variants, we will use hard filtering. This was explained in lecture. Please refer to slides for lecture. This was explained in lecture. Please refer to slides for lecture. RUN the commands shown in red font. (REPLACE <userID> appropriately.) RUN the commands shown in red font. (REPLACE <userID> appropriately.) Output Files Output Files $ sbatch hard_filtering.sh # Execute hard_filtering.sh on the biocluster. hard_filtered_snps.vcf hard_filtered_snps.vcf $ squeue -u <userID> hard_filtered_indels.vcf hard_filtered_indels.vcf Periodically, repeat the Periodically, repeat the squeue finished finished. When you don t see it listed anymore, it s done! Quick tip: You don t need to retype or repaste it every time. Hit the UP-ARROW key and see what happens! squeue command to see if your job has command to see if your job has Where are these? Where are these? Variant Calling Workshop | Chris Fields | 2020 18
Step 2B: Hard Filtering Variants Calls In this step, we will count the # of filtered SNPs and Indels. RUN the commands shown in red font. RUN the commands shown in red font. $ grep -c 'PASS' hard_filtered_snps.vcf # Count # of passes # Output approx. 8547. $ grep -c 'PASS' hard_filtered_indels.vcf # Count # of PASSES # Output approx. 1069 Variant Calling Workshop | Chris Fields | 2020 19
CHECKPOINT REACHED. LAB PAUSE HERE. CHECKPOINT REACHED. LAB PAUSE HERE. Step 2B: Discussion 1. Did we lose any variants? 2. How many PASSED the filter? 3. What is the difference in the filtered and raw input? 4. Why are these approximate (why do results slightly differ)? Variant Calling Workshop | Chris Fields | 2020 20
Step 2B: Discussion Some of the filters are as following: QD: QD: variant confidence/ quality by depth, QD < 2 MQ: MQ: RMS mapping quality, MQ < 40 FS: FS: Phred-scaled p-value, FS > 60.0 QD < 2 MQ < 40 FS > 60.0 What is the difference in the filtered and raw input? In the filtered input the FORMAT column has information on whether the SNP has passed all the filters ( PASS ) or has failed any of them. Variant Calling Workshop | Chris Fields | 2020 21
Step 3A: Annotating Variants With SnpEff With our filtered variants in hand, we now need to annotate them with SnpEff SnpEff. . SnpEff SnpEff adds information about where variants are in relation to specific genes. RUN the commands shown in red font. (REPLACE <userID> appropriately.) RUN the commands shown in red font. (REPLACE <userID> appropriately.) Output Files Output Files $ sbatch annotate_snpeff.sh hard_filtered_snps_annotated.vcf hard_filtered_snps_annotated.vcf # This will execute snpeff.sh on the biocluster. hard_filtered_indels_annotated.vcf hard_filtered_indels_annotated.vcf Where are these? Where are these? $ squeue -u <userID> Periodically (every 20 seconds or so), run the Periodically (every 20 seconds or so), run the squeue command to see if your job has finished command to see if your job has finished. squeue - -u <userID> u <userID> Variant Calling Workshop | Chris Fields | 2020 22
Step 3B: Summarizing Annotated Variants The IDs for the human assembly version we us are from Ensemble. The Ensemble format is ENSG ENSGXXXXXXXXXXX. XXXXXXXXXXX. Example: FOXA2 s Ensemble ID is ENSG00000125798. In this step, we would like to see if there are any variants of FOXA2. RUN the commands shown in red font. RUN the commands shown in red font. $ grep -c 'ENSG00000125798' hard_filtered_snps_annotated.vcf # Get the number of SNPS in FOXA2, ENSG00000125798. # Output should be 3. $ grep -c 'ENSG00000125798' hard_filtered_indels_annotated.vcf # Get the number of Indels in FOXA2, ENSG00000125798. # Output should be 0. Variant Calling Workshop | Chris Fields | 2020 23
Step 4 (Optional): GATK Variant Annotator Run this on your own (later), we ll skip this step in live session. SnpEff SnpEff adds a lot of information to the VCF file. GATK Variant Annotator helps remove a lot of the extraneous information. RUN the commands shown in red font. RUN the commands shown in red font. $ sbatch post_annotate.sh # This will execute post_annotate.sh on the biocluster. $ squeue -u <userID> What files are created? Check for yourself. (Or ask TA after session) What files are created? Check for yourself. (Or ask TA after session) Variant Calling Workshop | Chris Fields | 2020 24
CHECKPOINT REACHED. LAB PAUSE HERE. CHECKPOINT REACHED. LAB PAUSE HERE. We re done finding variants and annotating them. Files created are in the current directory. We want to visualize the variants now. But for that we ll go back to the VM (Remote Desktop). What happens to the files? We already copied over the result files to your VM, so you don t have to copy them now! Exit PuTTY by either closing the window or typing exit in the command prompt. Variant Calling Workshop | Chris Fields | 2020 25
Visualization of Results In this exercise, we will visualize the results of the previous exercise using the Integrated Genomics Viewer (IGV). Integrated Genomics Viewer (IGV). We are going to do visualization on VM We are going to do visualization on VM Variant Calling Workshop | Chris Fields | 2020 26
Local Files (for UIUC users) The files needed for this laboratory exercise are in a folder called VM on your Desktop. See if you can locate it. Double click on it to get inside. Once you are inside the folder, navigate to subfolder 03_Variant_Calling\results You should see something like this: Variant Calling Workshop | Chris Fields | 2020 27
Local Files (for Mayo Clinic users) The files needed for this laboratory exercise are in a folder called datafiles on your Desktop. See if you can locate it. Double click on it to get inside. Once you are inside the folder, navigate to subfolder 03_Variant_Calling\results You should see something like this: Variant Calling Workshop | Chris Fields | 2020 28
Run IGV IGV software should be on your VM Desktop or searchable through the search bar at the bottom. CHECKPOINT REACHED. LAB PAUSE HERE. CHECKPOINT REACHED. LAB PAUSE HERE. Variant Calling Workshop | Chris Fields | 2020 29
Step 5A: Visualization With IGV Note: If you cant find Human (b37) , click Note: If you cant find Human (b37) , click on More , type human in the Filter text on More , type human in the Filter text box; you should see this now. box; you should see this now. Switch the genome to Human (b37) Human (b37). Variant Calling Workshop | Chris Fields | 2020 30
Step 5B: Loading VCF Files On the menu bar, click File File Click Load from File Load from File Navigate to: UIUC VM: Desktop -> VM -> 03_Variant_Calling -> results Mayo VM: Desktop -> datafiles -> 03_Variant_Calling -> results Hold the Ctrl Ctrl key down. Click both vcf vcf files. Click Open. Open. Variant Calling Workshop | Chris Fields | 2020 31
Step 5C: Loading VCF Files You should see a window similar to below: Variant Calling Workshop | Chris Fields | 2020 32
Step 5D: Navigate to Chromosome 20 In the drop-down menu next to Human (b37) , select 20 . You should see a view similar to the (partial) screenshot on the right. Variant Calling Workshop | Chris Fields | 2020 33
Step 5E: Navigate to Chromosome 20 Click and drag from around the 20 (Don t have to be exact.) 20 mb mb mark to about the 27 27 mb mb mark. Variant Calling Workshop | Chris Fields | 2020 34
Step 5F: Navigate to Chromosome 20 The result should look similar to the screenshot below: Variant Calling Workshop | Chris Fields | 2020 35
Step 5G: Setting Feature Visibility Window Do this for each VCF track: Do this for each VCF track: Right Click Right Click and Select Set Feature Visibility Window Window Set Feature Visibility Enter 10000 10000 (which is 10 Mb). Click OK. OK. Variant Calling Workshop | Chris Fields | 2020 36
Step 5H: Viewing FOXA2 Polymorphisms In the search box, type FOXA2 FOXA2 and press Enter. Enter. You should see something like the window below: Variant Calling Workshop | Chris Fields | 2020 37
CHECKPOINT REACHED. LAB PAUSE HERE. CHECKPOINT REACHED. LAB PAUSE HERE. Checkpoint: FOXA2 Polymorphisms 1. How many SNPs are here? 2. How many Indels are here? 3. How many SNPs are heterozygotes? Variant Calling Workshop | Chris Fields | 2020 38
Step 6A: Loading a BAM File On the menu bar, click File File Click Load from File Load from File Navigate to: UIUC VM: Desktop -> VM -> 03_Variant_Calling -> results Mayo VM: Desktop -> datafiles -> 03_Variant_Calling -> results Click the .bam .bam file. Click Open. Open. Variant Calling Workshop | Chris Fields | 2020 39
Step 6B: Viewing a loaded BAM File You should see a new track in your window, similar to the one below: Variant Calling Workshop | Chris Fields | 2020 40
Step 6C: Show Coverage Track Note: If you don't see a summary track like below : Note: If you don't see a summary track like below : Right Click Click on the BAM BAM track. Click Show Coverage Track Show Coverage Track. Variant Calling Workshop | Chris Fields | 2020 41
Step 6D: Color Alignments by Read Right Click Click on the .bam .bam track. (Not the .bam.Coverage track.) Click Color Alignment by Color Alignment by and then Read Strand Read Strand Variant Calling Workshop | Chris Fields | 2020 42
Step 6E: FOXA2 Read GAP Question What is happening in the highlighted portion? (You wont see the green rectangle, we ve added that to highlight that region here.) Variant Calling Workshop | Chris Fields | 2020 43
Step 6F: Viewing SNP Calls Zoom In (double-click several times) on a SNP to see the base pair calls on each read. Note: the double-click may be tricky to do, keep trying! How to zoom out once this is done? Find the instructions here: http://software.broa dinstitute.org/softwa re/igv/?q=navigate#z oom Variant Calling Workshop | Chris Fields | 2020 44
Done with IGV! You may close the software by going to File (top left), then Exit Variant Calling Workshop | Chris Fields | 2020 45
Step 7: SnpEff Results (Do remaining slides on your own) SnpEff gives a nice summary HTML file. Navigate to the results directory for this lab: UIUC: Desktop->VM->03_Variant_Calling->results Mayo: Desktop->datafiles->03_Variant_Calling->results Open snpEff_summary.html in each of the following sub directories: 1. snpeff_snp_results 2. snpeff_indel_results Browse each of the HTML files and note the results of the following slides: Variant Calling Workshop | Chris Fields | 2020 46
Step 7B: SNPEff Summary of SNPS Variant Calling Workshop | Chris Fields | 2020 47
Step 7B: SNPEff Summary of Indel Lengths The summary of snpeff lengths: snpeff indels indels shows the following distribution of indel Variant Calling Workshop | Chris Fields | 2020 48