|
| 1 | +# Quality control |
| 2 | + |
| 3 | +In this practical you will learn to import, view and check the quality of NGS read data in FASTQ format. |
| 4 | + |
| 5 | +You will be working with an Illumina MiSeq read dataset from a genome sequence project. The sequenced organism is an enterohaemorrhagic E. coli (EHEC) of the serotype O157, a potentially fatal gastrointestinal pathogen. The sequenced bacterium was part of an outbreak investigation in the St. Louis area, USA in 2011. |
| 6 | +The sequencing was done as paired-end 2x150bp. |
| 7 | + |
| 8 | +## Downloading the data |
| 9 | + |
| 10 | +The Raw data were deposited at the European nucleotide archive, under the accession number SRR957824, go the the ENA [website](http://www.ebi.ac.uk/ena) and search for the run with the accession SRR957824. Download the two fastq files associated with the run: |
| 11 | + |
| 12 | +``` |
| 13 | +wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR957/SRR957824/SRR957824_1.fastq.gz |
| 14 | +wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR957/SRR957824/SRR957824_1.fastq.gz |
| 15 | +``` |
| 16 | + |
| 17 | +## FastQC |
| 18 | + |
| 19 | +To check the quality of the sequence data we will use a tool called FastQC. With this you can check things like read length distribution, quality distribution across the read length, sequencing artifacts and much more. |
| 20 | + |
| 21 | +FastQC has a graphical interface and can be downloaded and ran on a Windows or LINUX computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). |
| 22 | + |
| 23 | +However, FastQC is also available as a command line utility on the training server you are using. You can load the module and execute the program as follow: |
| 24 | + |
| 25 | +`module load fastqc` |
| 26 | +`fastqc $read1 $read2` |
| 27 | + |
| 28 | +which will produce both a .zip archive containing all the plots, and a html document for you to look at the result in your browser. |
| 29 | + |
| 30 | +Open the html file with your favourite web browser, and try to interpret them |
| 31 | + |
| 32 | +Pay special attention to the per base sequence quality and sequence length distribution. Explanations for the various quality modules can be found [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/). Also, have a look at examples of a [good](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and a [bad](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) illumina read set for comparison. |
| 33 | + |
| 34 | +You will note that the reads in your uploaded dataset have fairly poor quality (<20) towards the end. There are also outlier reads that have very poor quality for most of the second half of the reads. |
| 35 | + |
| 36 | +There are overrepresented sequences in the data. Where do they come from? |
| 37 | + |
| 38 | +## Scythe |
| 39 | + |
| 40 | +Scythe uses a Naive Bayesian approach to classify contaminant substrings in sequence reads. It considers quality information, which can make it robust in picking out 3'-end adapters, which often include poor quality bases. |
| 41 | + |
| 42 | +First, install scythe: |
| 43 | + |
| 44 | +``` |
| 45 | +git clone https://github.com/vsbuffalo/scythe.git |
| 46 | +cd scythe |
| 47 | +make all |
| 48 | +``` |
| 49 | + |
| 50 | +Then, copy or move "scythe" to a directory in your $PATH. |
| 51 | + |
| 52 | +Scythe can be run minimally with: |
| 53 | + |
| 54 | +`scythe -a adapter_file.fasta -o trimmed_sequences.fastq sequences.fastq` |
| 55 | + |
| 56 | +Trim the adapters in both your read files! |
| 57 | + |
| 58 | +## Sickle |
| 59 | + |
| 60 | +https://github.com/najoshi/sickle |
| 61 | + |
| 62 | +We will trim each read individually down to the good quality part to keep the bad part from interfering with downstream applications. |
| 63 | + |
| 64 | +and set the quality score to 25. This means the trimmer will work its way from both ends of each read, cutting away any bases with a quality score < 25. |
| 65 | + |
| 66 | + |
| 67 | +What did the trimming do to the per-base sequence quality, the per sequence quality scores and the sequence length distribution? |
| 68 | + |
| 69 | +What is the sequence duplication levels graph about? Why should you care about a high level of duplication, and why is the level of duplication very low for this data? |
| 70 | + |
| 71 | +Based on the FastQC report, there seems to be a population of shorter reads that are technical artefacts. We will ignore them for now as they will not interfere with our analysis. |
0 commit comments