You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: file_formats.md
+100-8Lines changed: 100 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,19 +73,111 @@ The quality, also called phred scores is the probability that the corresponding
73
73
74
74
Phred scores use a logarithmic scale, and are represented by ASCII characters, mapping to a quality usually going from 0 to 40
75
75
76
-
Phred Quality Score | Probability of incorrect base call | Base call accuracy
77
-
- | - | -
78
-
10 | 1 in 10 | 90%
79
-
20 | 1 in 100 | 99%
80
-
30 | 1 in 1000 | 99.9%
81
-
40 | 1 in 10,000 | 99.99%
82
-
50 | 1 in 100,000 | 99.999%
83
-
60 | 1 in 1,000,000 | 99.9999%
76
+
| Phred Quality Score | Probability of incorrect base call | Base call accuracy
77
+
| --- | --- | ---
78
+
| 10 | 1 in 10 | 90%
79
+
| 20 | 1 in 100 | 99%
80
+
| 30 | 1 in 1000 | 99.9%
81
+
| 40 | 1 in 10,000 | 99.99%
82
+
| 50 | 1 in 100,000 | 99.999%
83
+
| 60 | 1 in 1,000,000 | 99.9999%
84
84
85
85
### the sam/bam format
86
86
87
+
From [Wikipedia](https://en.wikipedia.org/wiki/SAM_(file_format)):
87
88
89
+
SAM (file format) is a text-based format for storing biological sequences aligned to a reference sequence developed by Heng Li. The acronym SAM stands for Sequence Alignment/Map. It is widely used for storing data, such as nucleotide sequences, generated by Next generation sequencing technologies and usually mapped to a reference.
90
+
91
+
The SAM format consists of a header and an alignment section. The binary representation of a SAM file is a BAM file, which is a compressed SAM file.[1] SAM files can be analysed and edited with the software SAMtools.
92
+
93
+
The SAM format has a really extensive and complex specification that you can find [here](https://samtools.github.io/hts-specs/SAMv1.pdf)
94
+
95
+
In brief it consists in a header section and reads (with other information) in tab delimited format.
the vcf is also a text-based file format. VCF stands for Variant Call Format and is used to store gene sequence variations (SNVs, indels). The format hs been developped for genotyping projects, and is the standard to represent variations in the genome of a species.
115
+
116
+
A vcf is a tab-delimited file with 9 columns, described [here](https://samtools.github.io/hts-specs/VCFv4.2.pdf)
117
+
118
+
#### VCF Example
119
+
120
+
```
121
+
##fileformat=VCFv4.0
122
+
##fileDate=20110705
123
+
##reference=1000GenomesPilot-NCBI37
124
+
##phasing=partial
125
+
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
The general feature format (gff) is another text file format. used for describing genes and other features of DNA, RNA and protein sequences. It is the standrad for annotation of genomes.
167
+
168
+
A gff file should, as a vcf, conatins 9 columns described [here](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)
169
+
170
+
#### Example gff
171
+
172
+
```
173
+
##description: evidence-based annotation of the human genome (GRCh38), version 25 (Ensembl 85)
0 commit comments