Skip to content

Commit 18cb11f

Browse files
committed
temp qc tuto
1 parent afed149 commit 18cb11f

1 file changed

Lines changed: 71 additions & 0 deletions

File tree

qc.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Quality control
2+
3+
In this practical you will learn to import, view and check the quality of NGS read data in FASTQ format.
4+
5+
You will be working with an Illumina MiSeq read dataset from a genome sequence project. The sequenced organism is an enterohaemorrhagic E. coli (EHEC) of the serotype O157, a potentially fatal gastrointestinal pathogen. The sequenced bacterium was part of an outbreak investigation in the St. Louis area, USA in 2011.
6+
The sequencing was done as paired-end 2x150bp.
7+
8+
## Downloading the data
9+
10+
The Raw data were deposited at the European nucleotide archive, under the accession number SRR957824, go the the ENA [website](http://www.ebi.ac.uk/ena) and search for the run with the accession SRR957824. Download the two fastq files associated with the run:
11+
12+
```
13+
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR957/SRR957824/SRR957824_1.fastq.gz
14+
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR957/SRR957824/SRR957824_1.fastq.gz
15+
```
16+
17+
## FastQC
18+
19+
To check the quality of the sequence data we will use a tool called FastQC. With this you can check things like read length distribution, quality distribution across the read length, sequencing artifacts and much more.
20+
21+
FastQC has a graphical interface and can be downloaded and ran on a Windows or LINUX computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
22+
23+
However, FastQC is also available as a command line utility on the training server you are using. You can load the module and execute the program as follow:
24+
25+
`module load fastqc`
26+
`fastqc $read1 $read2`
27+
28+
which will produce both a .zip archive containing all the plots, and a html document for you to look at the result in your browser.
29+
30+
Open the html file with your favourite web browser, and try to interpret them
31+
32+
Pay special attention to the per base sequence quality and sequence length distribution. Explanations for the various quality modules can be found [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/). Also, have a look at examples of a [good](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and a [bad](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) illumina read set for comparison.
33+
34+
You will note that the reads in your uploaded dataset have fairly poor quality (<20) towards the end. There are also outlier reads that have very poor quality for most of the second half of the reads.
35+
36+
There are overrepresented sequences in the data. Where do they come from?
37+
38+
## Scythe
39+
40+
Scythe uses a Naive Bayesian approach to classify contaminant substrings in sequence reads. It considers quality information, which can make it robust in picking out 3'-end adapters, which often include poor quality bases.
41+
42+
First, install scythe:
43+
44+
```
45+
git clone https://github.com/vsbuffalo/scythe.git
46+
cd scythe
47+
make all
48+
```
49+
50+
Then, copy or move "scythe" to a directory in your $PATH.
51+
52+
Scythe can be run minimally with:
53+
54+
`scythe -a adapter_file.fasta -o trimmed_sequences.fastq sequences.fastq`
55+
56+
Trim the adapters in both your read files!
57+
58+
## Sickle
59+
60+
https://github.com/najoshi/sickle
61+
62+
We will trim each read individually down to the good quality part to keep the bad part from interfering with downstream applications.
63+
64+
and set the quality score to 25. This means the trimmer will work its way from both ends of each read, cutting away any bases with a quality score < 25.
65+
66+
67+
What did the trimming do to the per-base sequence quality, the per sequence quality scores and the sequence length distribution?
68+
69+
What is the sequence duplication levels graph about? Why should you care about a high level of duplication, and why is the level of duplication very low for this data?
70+
71+
Based on the FastQC report, there seems to be a population of shorter reads that are technical artefacts. We will ignore them for now as they will not interfere with our analysis.

0 commit comments

Comments
 (0)