Skip to content

Commit 26927e1

Browse files
author
Jonas Ohlsson
committed
fix typos.
1 parent f22bf12 commit 26927e1

1 file changed

Lines changed: 12 additions & 12 deletions

File tree

file_formats.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# File Formats
22

3-
This lecture is aimed at making you discover the most popular file formats used in bioinforatics. You're expected to have basic working knowledge of linux to be able to follow the lesson.
3+
This lecture is aimed at making you discover the most popular file formats used in bioinformatics. You're expected to have basic working knowledge of Linux to be able to follow the lesson.
44

55
## Table of Contents
66
* [The fasta format](#the-fasta-format)
@@ -11,7 +11,7 @@ This lecture is aimed at making you discover the most popular file formats used
1111

1212
### The fasta format
1313

14-
The fasta format was invented in 1988 and designed to represent nucleotide or peptide sequences. It originates from the [FASTA](https://en.wikipedia.org/wiki/FASTA) software package, but is now a standard in the world of bioinformatics
14+
The fasta format was invented in 1988 and designed to represent nucleotide or peptide sequences. It originates from the [FASTA](https://en.wikipedia.org/wiki/FASTA) software package, but is now a standard in the world of bioinformatics.
1515

1616
The first line in a FASTA file starts with a ">" (greater-than) symbol followed by the description or identifier of the sequence. Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code.
1717

@@ -31,7 +31,7 @@ HDGVLSVNAKRDSFNDESDSEGNVIASERSYGRFARQYSLPNVDESGIKAKCEDGVLKLTLPKLAEEKIN
3131
GNHIEIE
3232
```
3333

34-
A fasta file can contain multiple sequence. Each sequence will be separated by their "header" line, starting by ">"
34+
A fasta file can contain multiple sequence. Each sequence will be separated by their "header" line, starting by ">".
3535

3636
Example:
3737

@@ -49,7 +49,7 @@ SQFIGYPITLFVEK
4949

5050
### The fastq format
5151

52-
The fastq format is also a text based format to represent nucleotide sequences, but also contains the coresponding quality of each nucleotide. It is the standard for storing the output of high-throughput sequencing instruments such as the Illumina machines.
52+
The fastq format is also a text based format to represent nucleotide sequences, but also contains the corresponding quality of each nucleotide. It is the standard for storing the output of high-throughput sequencing instruments such as the Illumina machines.
5353

5454
A fastq file uses four lines per sequence:
5555

@@ -69,9 +69,9 @@ GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
6969

7070
#### Quality
7171

72-
The quality, also called phred scores is the probability that the corresponding basecall is incorrect.
72+
The quality, also called phred score, is the probability that the corresponding basecall is incorrect.
7373

74-
Phred scores use a logarithmic scale, and are represented by ASCII characters, mapping to a quality usually going from 0 to 40
74+
Phred scores use a logarithmic scale, and are represented by ASCII characters, mapping to a quality usually going from 0 to 40.
7575

7676
| Phred Quality Score | Probability of incorrect base call | Base call accuracy
7777
| --- | --- | ---
@@ -90,9 +90,9 @@ SAM (file format) is a text-based format for storing biological sequences aligne
9090

9191
The SAM format consists of a header and an alignment section. The binary representation of a SAM file is a BAM file, which is a compressed SAM file.[1] SAM files can be analysed and edited with the software SAMtools.
9292

93-
The SAM format has a really extensive and complex specification that you can find [here](https://samtools.github.io/hts-specs/SAMv1.pdf)
93+
The SAM format has a really extensive and complex specification that you can find [here](https://samtools.github.io/hts-specs/SAMv1.pdf).
9494

95-
In brief it consists in a header section and reads (with other information) in tab delimited format.
95+
In brief it consists of a header section and reads (with other information) in tab delimited format.
9696

9797
#### Example header section
9898

@@ -111,9 +111,9 @@ M01137:130:00-A:17009:1352/14 * 0 0 * * 0 0 AGCAAAATACAACGATCTGGATGGTAGCATTAGCGA
111111

112112
### the vcf format
113113

114-
the vcf is also a text-based file format. VCF stands for Variant Call Format and is used to store gene sequence variations (SNVs, indels). The format hs been developped for genotyping projects, and is the standard to represent variations in the genome of a species.
114+
The vcf format is also a text-based file format. VCF stands for Variant Call Format and is used to store gene sequence variations (SNVs, indels). The format has been developped for genotyping projects, and is the standard to represent variations in the genome of a species.
115115

116-
A vcf is a tab-delimited file with 9 columns, described [here](https://samtools.github.io/hts-specs/VCFv4.2.pdf)
116+
A vcf is a tab-delimited file, described [here](https://samtools.github.io/hts-specs/VCFv4.2.pdf).
117117

118118
#### VCF Example
119119

@@ -163,9 +163,9 @@ chr2 215634055 . C T
163163

164164
### the gff format
165165

166-
The general feature format (gff) is another text file format. used for describing genes and other features of DNA, RNA and protein sequences. It is the standrad for annotation of genomes.
166+
The general feature format (gff) is another text file format, used for describing genes and other features of DNA, RNA and protein sequences. It is the standard for annotation of genomes.
167167

168-
A gff file should, as a vcf, conatins 9 columns described [here](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)
168+
A gff file should contain 9 columns, described [here](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)
169169

170170
#### Example gff
171171

0 commit comments

Comments
 (0)