You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: qc.md
+14-8Lines changed: 14 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ The sequencing was done as paired-end 2x150bp.
7
7
8
8
## Downloading the data
9
9
10
-
The Raw data were deposited at the European nucleotide archive, under the accession number SRR957824, go the the ENA [website](http://www.ebi.ac.uk/ena) and search for the run with the accession SRR957824. Download the two fastq files associated with the run:
10
+
The raw data were deposited at the European Nucleotide Archive, under the accession number SRR957824. Go to the ENA [website](http://www.ebi.ac.uk/ena) and search for the run with the accession SRR957824. Download the two fastq files associated with the run:
To check the quality of the sequence data we will use a tool called FastQC. With this you can check things like read length distribution, quality distribution across the read length, sequencing artifacts and much more.
20
20
21
-
FastQC has a graphical interface and can be downloaded and ran on a Windows or LINUX computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
21
+
FastQC has a graphical interface and can be downloaded and run on a Windows or Linux computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
22
22
23
-
However, FastQC is also available as a command line utility on the training server you are using. You can load the module and execute the program as follow:
23
+
However, FastQC is also available as a command line utility on the training server you are using. You can load the module and execute the program as follows:
24
24
25
25
```
26
-
module load fastqc
26
+
module load FastQC
27
27
fastqc $read1 $read2
28
28
```
29
29
30
30
which will produce both a .zip archive containing all the plots, and a html document for you to look at the result in your browser.
31
31
32
-
Open the html file with your favourite web browser, and try to interpret them
32
+
Open the html file with your favourite web browser, and try to interpret them.
33
33
34
34
Pay special attention to the per base sequence quality and sequence length distribution. Explanations for the various quality modules can be found [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/). Also, have a look at examples of a [good](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and a [bad](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) illumina read set for comparison.
35
35
@@ -49,7 +49,9 @@ cd scythe
49
49
make all
50
50
```
51
51
52
-
Then, copy or move "scythe" to a directory in your $PATH.
52
+
Then, copy or move "scythe" to a directory in your $PATH, for example like this:
53
+
54
+
`cp scythe $HOME/bin/`
53
55
54
56
Scythe can be run minimally with:
55
57
@@ -73,6 +75,10 @@ cd sickle
73
75
make
74
76
```
75
77
78
+
Copy sickle to a directory in your $PATH:
79
+
80
+
`cp sickle $HOME/bin/`
81
+
76
82
Sickle has two modes to work with both paired-end and single-end reads: sickle se and sickle pe.
77
83
78
84
Running sickle by itself will print the help:
@@ -95,8 +101,8 @@ What did the trimming do to the per-base sequence quality, the per sequence qual
95
101
96
102
What is the sequence duplication levels graph about? Why should you care about a high level of duplication, and why is the level of duplication very low for this data?
97
103
98
-
Based on the FastQC report, there seems to be a population of shorter reads that are technical artefacts. We will ignore them for now as they will not interfere with our analysis.
104
+
Based on the FastQC report, there seems to be a population of shorter reads that are technical artifacts. We will ignore them for now as they will not interfere with our analysis.
99
105
100
106
## Extra exercises
101
107
102
-
Perform quality control on the extra datasets given by your instructors
108
+
Perform quality control on the extra datasets given by your instructors.
0 commit comments