Skip to content

Commit afbe068

Browse files
committed
added assembly
1 parent 8393705 commit afbe068

1 file changed

Lines changed: 61 additions & 7 deletions

File tree

assembly.md

Lines changed: 61 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,9 @@ Go to ENA, and search for the run ERR486840.
88

99
Download the 2 fastq files associated with the run.
1010

11-
## Quality control?
11+
## Quality control
12+
13+
Perform quality control of the data as you did in the [QC tutorial](qc.md)
1214

1315
How many reads are in the fastq file? What is the read length?
1416
Does the data need trimming or other filtering? If so, do it.
@@ -20,23 +22,75 @@ Based on the expected genome size, the read length and the number of reads – w
2022

2123
We will be using the SPAdes assembler to assemble our bacterium.
2224

25+
```
26+
module load spades
27+
spades.py -k 21,33,55,77,99 --careful --only-assembler -1 read_1.fastq -2 read_2.fastq -o output
28+
```
29+
2330
This will produce a series of outputs. The scaffolds will be in fasta format.
2431

2532
How well does the assembly total consensus size and coverage correspond to your earlier estimation?
2633
What is the N50 of the assembly? What does this mean?
2734
How many contigs in total did the assembly produce?
2835
How many contigs longer than 500bp? What is the N50 of those contigs only?
2936

30-
Perform more assemblies with the following options:
37+
Perform another assembly with the following options:
38+
39+
Use the raw reads (no trimming, but with adapters removed), wthout the --only-assembler option.
40+
41+
If you have time, try out the following genome assemblers:
3142

32-
Raw reads (no trimming, but with adapters removed), let spades do the qc.
43+
44+
* MaSurCa
45+
* Ray
3346

3447
## Comparing assemblies
3548

36-
quast
49+
QUAST is a software evaluating the quality of genome assemblies by computing various metrics, including
50+
51+
* N50, length for which the collection of all contigs of that length or longer covers at least 50% of assembly length
52+
* NG50, where length of the reference genome is being covered
53+
* NA50 and NGA50, where aligned blocks instead of contigs are taken
54+
* misassemblies, misassembled and unaligned contigs or contigs bases
55+
* genes and operons covered
56+
57+
To run Quast:
58+
59+
```
60+
module load quast
61+
quast.py assembly1.fasta assembly2.fasta ... -R reference.fasta -G reference.gff
62+
```
63+
64+
Quast will produce a pdf in the `quast_results` directory. Download it on your computer and take a look. Which assembly is better?
65+
66+
## Fixing misassemblies
67+
68+
Pilon is a software tool which can be used to automatically improve draft assemblies. It attempts to make improvements to the input genome, including:
69+
70+
* Single base differences
71+
* Small Indels
72+
* Larger Indels or block substitution events
73+
* Gap filling
74+
* Identification of local misassemblies, including optional opening of new gaps
75+
76+
Pilon then outputs a FASTA file containing an improved representation of the genome from the read data and an optional VCF file detailing variation seen between the read data and the input genome.
77+
78+
You can read how Pilon works in detail [here](https://github.com/broadinstitute/pilon/wiki/Methods-of-Operation)
79+
80+
Before running Pilon itself, you have to map your reads back to the assembly!
81+
82+
```
83+
bowtie2-build $assembly $output/index
84+
(bowtie2 -x $output/index -1 $r1 -2 $r2 | samtools view -bS -o $mapping - ) 2> bowtie.err
85+
samtools sort $mapping $mapping.sorted
86+
samtools index $mapping.sorted.bam
87+
```
3788

38-
### Comparing to the reference?
89+
Run Pilon with the following command:
3990

40-
Get the reference fasta here
91+
```
92+
module load pilon
93+
pilon --genome $assembly --frags $mapping.sorted.bam --output $output
94+
```
4195

42-
## Fixing misassemblies?
96+
Once Pilon is finished running, compare the new assembly with the old one using Quast!

0 commit comments

Comments
 (0)