Skip to content

Commit 315c271

Browse files
committed
added pan_genome tuto
1 parent d804fa2 commit 315c271

2 files changed

Lines changed: 66 additions & 0 deletions

File tree

data/pangenome.tar.gz

5.18 MB
Binary file not shown.

pan_genome.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Pan-genome analysis
2+
3+
In this tutorial we will learn how to determine a pan-genome from a collection of isolate genomes.
4+
5+
This tutorial is inspired from [Genome annotation and Pangenome Analysis](https://github.com/microgenomics/tutorials/blob/master/pangenome.md) from the CBIB in Santiago, Chile
6+
7+
## Getting the data
8+
9+
We'll use six *Listeria monocytogenes* genomes in this tutorial.
10+
11+
```bash
12+
wget pangenome.tar.gz
13+
tar xzf pangenome.tar.gz
14+
cd pangenome
15+
```
16+
17+
These genomes correspond to isolates of *L. monocytogenes* reported in
18+
19+
> Xiangyu Deng, Adam M Phillippy, Zengxin Li, Steven L Salzberg and Wei Zhang. (2010) Probing the pan-genome of Listeria monocytogenes: new insights into intraspecific niche expansion and genomic diversification. doi:10.1186/1471-2164-11-500
20+
21+
The six genomes you downaloaded were selected based on their level of completeness (finished; contigs, etc) and their genotype (type I-III):
22+
23+
Genome Assembly | Genome Accession | Genotype | Sequenced by | Status
24+
--- | --- | --- | --- | ---
25+
GCA_000026705 | FM242711 | type I | Institut_Pasteur | Finished
26+
GCA_000008285 | AE017262 | type I | TIGR | Finished
27+
GCA_000168815 | AATL00000000 | type I | Broad Institute | 79 contigs
28+
GCA_000196035 | AL591824 | type II | European Consortium | Finished
29+
GCA_000168635 | AARW00000000 | type II | Broad Institute | 25 contigs
30+
GCA_000021185 | CP001175 | type III | MSU | Finished
31+
32+
## Annotation of the genomes
33+
34+
By annotating the genomes we mean to add information regarding genes, their location, strandedness, and features and attributes. Now that you have the genomes, we need to annotate them to determine the location and attributes of the genes contained in them. We will use Prokka for the annotation.
35+
36+
```
37+
prokka --kingdom Bacteria --outdir prokka_GCA_000008285 --genus Listeria --locustag GCA_000008285 GCA_000008285.1_ASM828v1_genomic.fna
38+
```
39+
40+
Annotate the 6 genomes, by replacing the `-outdir` and `-locustag` and `fasta file` accordingly.
41+
42+
## Pan-genome analysis
43+
44+
put all the .gff files in the same folder (e.g., ./gff) and run Roary
45+
46+
`roary -f results -e -n -v gff/*.gff`
47+
48+
Roary will get all the coding sequences, convert them into protein, and create pre-clusters. Then, using BLASTP and MCL, Roary will create clusters, and check for paralogs. Finally, Roary will take every isolate and order them by presence/absence of orthologs. The summary output is present in the `summary_statistics.txt` file.
49+
50+
Additionally, Roary produces a `gene_presence_absence.csv` file that can be opened in any spreadsheet software to manually explore the results. In this file, you will find information such as gene name and gene annotation, and, of course, whether a gene is present in a genome or not.
51+
52+
## Plotting the result
53+
54+
Roary comes with a python script that allows you to generate a few plots to graphically assess your analysis output.
55+
56+
First, we need to generate a tree file from the alignment generated by Roary:
57+
58+
```
59+
raxmlHPC -m GTRGAMMA -p 12345 -s core_gene_alignment.aln -n core.tre
60+
```
61+
62+
Then we can plot the Roary results:
63+
64+
```
65+
roary_plots.py RAxML_bestTree.core.tre gene_presence_absence.csv
66+
```

0 commit comments

Comments
 (0)