Skip to content

Commit 19e4923

Browse files
expanded the documentation
1 parent cd624ce commit 19e4923

1 file changed

Lines changed: 39 additions & 4 deletions

File tree

README.md

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,48 @@
11
# Codon optimizer
22

3-
## Purpose
4-
Generating a DNA sequence optimized for one or more organisms.
3+
<br>
4+
5+
## Overview
6+
This program allows you to perform codon optimization for multiple species at once, starting from an amino acid sequence or a coding DNA sequence. The program considers the species you select and the cut sites you want to avoid as well as some other problematic characteristics and outputs an optimized DNA sequence.
7+
8+
<br>
59

610
## Usage
11+
### Species weights
12+
By selecting the checkbox, you can see all of the species you can optimize for. A species with a weight of zero won't be considered, while all larger numbers will be given proportional consideration. All of the weights are added together, and each species contributes as a proportion of its weight divided by the total.
13+
### Restriction sites
14+
By selecting the checkbox, you can see all of the restriction enzyme cut sites you can choose to avoid. Selecting a large number may result in too much rare codon usage or a failure to avoid a site, but both will be recognizable in the color coding of the generated sequence.
715
### "DNA here" box
816
Paste in the coding sequence you want to be optimized (preferably including the stop codon), and the program will convert it into amino acids and then back into an optimized DNA sequence for you to use.
917
### "Amino acids here" box
1018
Paste in the 1-letter amino acid sequence of your protein of interest (preferably with a stop codon represented as an asterisk), and the program will generate a DNA sequence optimized for your organism(s) of interest.
19+
### "Show codon preferences" button
20+
This gives you a preview of the aspirational percentages of codon usage that will be utilized while generating the optimized DNA sequence. This first value after the codon can be compared to the species-specific preferences following it to help you determine how you may want to adjust your weights.
21+
### "Optimize" button
22+
This begins the optimization process. You should expect it to take several seconds.
23+
24+
<br>
25+
26+
## Interpreting the output
27+
### Score
28+
The score shows the quality of the sequence, especially with the ability to avoid undesirable sequences. A perfect score is very difficult to attain, but that doesn't mean a useful sequence is very difficult to attain.
29+
### GC content
30+
The GC content helps you double-check the sequence viability. Unsurprisingly, it tells you the percentage of the sequence which consists of guanine or cytosine.
31+
### Color coding
32+
The output should be primarily green with occasional blue and a proportional gradient in between. These colors represent the average prevalence of the codons across your selected species. Abundant blue suggests a sequence that is likely to have translation difficulties. Orange or red shouldn't be present and is an indication that the program struggled to optimize your sequence. Orange corresponds to "problems" which are terminators or strong ribosome binding sites. Red corresponds to "bad problems" which contain an enzyme cut site. Palindromic cut sites are checked on both DNA strands.
33+
34+
<br>
1135

12-
## Interpreting the DNA output
13-
The score shows how well the sequence meets your criteria, and the GC content helps you double-check the sequence viability. The nucleotides are color-coded per codon based on the overall rarity of the codon: more green is more common whereas more blue is less common. Occasional blue is expected, but abundant blue may lead to decreased translation efficiency.
36+
## Algorithm
37+
### Overview
38+
80 attempts are made to generate a sequence from scratch, and if no sequence is perfect (the most likely outcome), the highest-scoring sequence is given 80 attempts to fix the problem areas. Each sequence starts with a score of 100, and every time any imperfection is found, a certain amount is subtracted. It's rather unlikely to get a perfect score, but the reason the score was decreased could be rather minor.
39+
### Codon preferences
40+
The preferences used are ones which indicate how much a given codon is used compared to the other codons which code for the same amino acid. When combining the preferences of multiple species, the end result is generally a weighted average calculated from the specified weights of the species; however, low preferences are calculated differently. When a preference is below 11%, the unweighted average for that codon is found, and the lowest value of the following options is chosen: 1.5 times the lowest species preference, the unweighted average, or the threshold value (11%). Codons with greater preference have their percentages adjusted proportionally to make sure the total continues being 100%. **When only a single species is specified,** all codon preferences are the same as that species except that codons below 8% preference aren't used. The score of a generated sequence is decreased by the floored percent error of the presence of each codon exceeding the tolerance of 100%.
41+
### Avoided sequences
42+
Besides the specified enzyme cut sites, the program tries to avoid the following terminators and ribosome binding sites: aaaaa, ttttt, ggagg, and taaggag. Every appearance of one of those sites penalizes the score by 5 whereas each enzyme cut site penalizes by 10.
43+
### Hairpin checking
44+
For the first 50 nucleotides of the generated sequence, segments of 18, 20, 22, and 24 bases are sampled at every possible location, and the remainder of the 50 nucleotides is checked for the reverse complement of the sample. If a sequence is found that's at least 75% identical (determined by Levenshtein distance), then the melting point of the potential hairpin is estimated (using Wallace's Rule individually on both strands), and if the melting point is determined to be greater than or equal to 60°C, a hairpin is confirmed, and the score is decreased by 5 times the number of hairpins found.
45+
### GC content
46+
Anything above 60% or below 30% penalizes the score by the square of the percentage beyond the threshold. (e.g. GC content of 65% is penalized by 5<sup>2</sup>.)
47+
### Miscellaneous
48+
Capitalization of any sequence doesn't matter. (Everything is converted to lowercase internally.)

0 commit comments

Comments
 (0)