Skip to content

Commit eb9bda9

Browse files
committed
variant calling
1 parent adac4fb commit eb9bda9

2 files changed

Lines changed: 50 additions & 0 deletions

File tree

images/blast1.png

140 KB
Loading

variant_calling.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,3 +50,53 @@ Navigate the mapping using the zoom and pan left/right etc. Under Colour schemes
5050
![tablet3](images/tablet3.png)
5151

5252
Does the mapping data confirm the presence of an intact pO157 plasmid in the St. Louis outbreak strain?
53+
54+
You will note that one region has extremely high coverage. This region corresponds to two adjacent CDSs
55+
56+
Check their identity, what do they have in common?
57+
58+
Extract the sequence of the biggest of these two by right-clicking on it and selecting copy reference subsequence to clipboard
59+
60+
Your read data was from the whole genome, but you are only mapping to a reference for the pO157 plasmid. Go to BLAST and use nucleotide BLAST to check if the problematic region is present on the chromosome of the reference strain Sakai (BA000007.2) using the Align two sequences option with your subsequence copied from Tablet and BA000007.2. You don’t have to enter the whole genome sequence, just the code.
61+
62+
![blast1](images/blast1.png)
63+
64+
How many matches do you find on the chromosome?
65+
66+
Why do you think the coverage is so high for this region in the mapping to the pO157 reference?
67+
68+
## Variant Calling
69+
70+
A frequent application for mapping reads is variant calling, i.e. finding positions where the reference and your sequences differ. Single nucleotide polymorphism (SNP)-based typing is particularly popular and used for a broad range of applications. For an EHEC O157 outbreak you could use it to identify the source, for instance. Indels are insertions and deletions in the mapped data compared to the reference.
71+
72+
You could find SNPs by just staring at the alignment in Tablet. A variant caller is a program that automates this process, finding variable positions anywhere in the sequence and weighing the evidence for a true variant (number of reads agreeing, quality of those reads etc.) We will be using SAMtools for variant calling.
73+
74+
### Sorting and indexing a bam file
75+
76+
First you need to sort the reads in the BAM-file so the variant caller can easily find the relevant ones for a given position in the reference sequence.
77+
78+
You may need to convert your sam file into a bam file first:
79+
80+
`samtools view -b -o $output.bam $input.sam`
81+
82+
then sort and index the bam file:
83+
84+
```
85+
samtools sort $output.bam > $output_sorted.bam
86+
samtools index $output_sorted.bam
87+
```
88+
89+
Now we can perform the SNP and Indel calling:
90+
91+
`samtools mpileup -o output.vcf -f pO157_Sakai.fasta test_sorted.bam`
92+
93+
You can read about the structure of vcf files here
94+
95+
How many SNPs did the variant caller find? Did you find any indels?
96+
97+
Each variant in the vcf comes with a lot of different quality metrics from the variant calling process. The “QUAL” column tells you how confident the variant caller is feeling about that variant.
98+
99+
Use the coordinates from the vcf-file and manually check a high quality SNP and a low quality SNP in your Tablet view of the sam file. (Hint: there is a “Jump to base” button that can make it easier for you.)
100+
101+
If you have time:
102+
Read the original paper about the St. Louis outbreak investigation, in particular check the supplemental material and methods to see how the authors performed read mapping and variant calling.

0 commit comments

Comments
 (0)