|
| 1 | +# File Formats |
| 2 | + |
| 3 | +This lecture is aimed at making you discover the most popular file formats used in bioinforatics. You're expected to have basic working knowledge of linux to be able to follow the lesson. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | +* [The fasta format](#the-fasta-format) |
| 7 | +* [The fastq format](#the-fastq-format) |
| 8 | +* [The sam/bam format](#the-sam-format) |
| 9 | +* [The vcf format](#the-vcf-format) |
| 10 | +* [The gff format](#the-gff-format) |
| 11 | + |
| 12 | +### The fasta format |
| 13 | + |
| 14 | +The fasta format was invented in 1988 and designed to represent nucleotide or peptide sequences. It originates from the [FASTA](https://en.wikipedia.org/wiki/FASTA) software package, but is now a standard in the world of bioinformatics |
| 15 | + |
| 16 | +The first line in a FASTA file starts with a ">" (greater-than) symbol followed by the description or identifier of the sequence. Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code. |
| 17 | + |
| 18 | +A few sample sequences: |
| 19 | + |
| 20 | +``` |
| 21 | +>KX580312.1 Homo sapiens truncated breast cancer 1 (BRCA1) gene, exon 15 and partial cds |
| 22 | +GTCATCCCCTTCTAAATGCCCATCATTAGATGATAGGTGGTACATGCACAGTTGCTCTGGGAGTCTTCAG |
| 23 | +AATAGAAACTACCCATCTCAAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGTAACAGCTGGAAGAGT |
| 24 | +CTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAG |
| 25 | +``` |
| 26 | + |
| 27 | +``` |
| 28 | +>KRN06561.1 heat shock [Lactobacillus sucicola DSM 21376 = JCM 15457] |
| 29 | +MSLVMANELTNRFNNWMKQDDFFGNLGRSFFDLDNSVNRALKTDVKETDKAYEVRIDVPGIDKKDITVDY |
| 30 | +HDGVLSVNAKRDSFNDESDSEGNVIASERSYGRFARQYSLPNVDESGIKAKCEDGVLKLTLPKLAEEKIN |
| 31 | +GNHIEIE |
| 32 | +``` |
| 33 | + |
| 34 | +A fasta file can contain multiple sequence. Each sequence will be separated by their "header" line, starting by ">" |
| 35 | + |
| 36 | +Example: |
| 37 | + |
| 38 | +``` |
| 39 | +>KRN06561.1 heat shock [Lactobacillus sucicola DSM 21376 = JCM 15457] |
| 40 | +MSLVMANELTNRFNNWMKQDDFFGNLGRSFFDLDNSVNRALKTDVKETDKAYEVRIDVPGIDKKDITVDY |
| 41 | +HDGVLSVNAKRDSFNDESDSEGNVIASERSYGRFARQYSLPNVDESGIKAKCEDGVLKLTLPKLAEEKIN |
| 42 | +GNHIEIE |
| 43 | +>3HHU_A Chain A, Human Heat-Shock Protein 90 (Hsp90) |
| 44 | +MPEETQTQDQPMEEEEVETFAFQAEIAQLMSLIINTFYSNKEIFLRELISNSSDALDKIRYESLTDPSKL |
| 45 | +DSGKELHINLIPNKQDRTLTIVDTGIGMTKADLINNLGTIAKSGTKAFMEALQAGADISMIGQFGVGFYS |
| 46 | +AYLVAEKVTVITKHNDDEQYAWESSAGGSFTVRTDTGEPMGRGTKVILHLKEDQTEYLEERRIKEIVKKH |
| 47 | +SQFIGYPITLFVEK |
| 48 | +``` |
| 49 | + |
| 50 | +### The fastq format |
| 51 | + |
| 52 | +The fastq format is also a text based format to represent nucleotide sequences, but also contains the coresponding quality of each nucleotide. It is the standard for storing the output of high-throughput sequencing instruments such as the Illumina machines. |
| 53 | + |
| 54 | +A fastq file uses four lines per sequence: |
| 55 | + |
| 56 | +* Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). |
| 57 | +* Line 2 is the raw sequence letters. |
| 58 | +* Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. |
| 59 | +* Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. |
| 60 | + |
| 61 | +An example sequence in fastq format: |
| 62 | + |
| 63 | +``` |
| 64 | +@SEQ_ID |
| 65 | +GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT |
| 66 | ++ |
| 67 | +!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 |
| 68 | +``` |
| 69 | + |
| 70 | +#### Quality |
| 71 | + |
| 72 | +The quality, also called phred scores is the probability that the corresponding basecall is incorrect. |
| 73 | + |
| 74 | +Phred scores use a logarithmic scale, and are represented by ASCII characters, mapping to a quality usually going from 0 to 40 |
| 75 | + |
| 76 | +Phred Quality Score | Probability of incorrect base call | Base call accuracy |
| 77 | +- | - | - |
| 78 | +10 | 1 in 10 | 90% |
| 79 | +20 | 1 in 100 | 99% |
| 80 | +30 | 1 in 1000 | 99.9% |
| 81 | +40 | 1 in 10,000 | 99.99% |
| 82 | +50 | 1 in 100,000 | 99.999% |
| 83 | +60 | 1 in 1,000,000 | 99.9999% |
| 84 | + |
| 85 | +### the sam/bam format |
| 86 | + |
| 87 | + |
| 88 | + |
| 89 | +### the vcf format |
| 90 | + |
| 91 | +### the gff format |
0 commit comments