Choose a start codon

Start codons

Genes in bacteria have several components. These include start and stop signals that define the beginning and end of the protein being made. These signals are specific codons, or three letter combinations of nucleotide bases, that specify the amino acid MET for start and do not specify any amino acid for stop. The other component of a gene you need to know about when assigning a start codon is something called the ribosome binding site (RBS; also known as the shine-delgarno sequence). This is a series of nucleotides located 7-13 bases upstream of the start codon. This sequence is recognized by the ribosome (the protein complex that translates mRNA sequence information into protein). What a RBS does is tell the ribosome where the start codon is so that it knows where to start making the protein. The RBS indicates the location of the start codon since it is always located only a short distance (7-13 nt) away from the correct start codon. The consensus sequence for ribosome binding sites that we will use in this project is from the bacteria E. coli and is:

AGGAGGA

There are many varieties of codons that can be used as start codons in bacteria. Some of these include (ATG, TTG, GTG, CTG, etc). Note that they all look sort of like ATG which is the most common one and actually does specify MET while some of the others don't normally. However, ALL of these will put a MET amino acid in the first position of the protein (actually this is a modified MET amino acid linked to a formyl group that distinguishes it as the start). It is important to note that MOST start codons are ATG (>90%) but that the other options can also be used. Fortunately, our bioinformatics team has already done all the work of finding these for us.

So, how do we do this? Recall that the example that we are using is Avi0017. Once you have logged in follow this link:

(http://agro.vbi.vt.edu/servlets-examples/servlet/GeneEdit?genename=avi0017&Search=Search&level=2)

This will bring you to the record we are reviewing. Scroll down a bit to look at the window showing the start codon selector. This shows the DNA sequence of gene Avi0017. It has small blue boxes that indicate the locations of possible start codons that match the reading frame of this gene (i.e. If you pick these the order of three nucleotide groupings [codons] for the entire gene will not change). You will note that one of the blue boxes has a red box underneath it that is close to the DNA sequence shown. This is the codon that is currently selected as the start codon. If you want to change this just click on any other blue box and and then click the 'submit' button. This will change the start codon to the one you selected. Note that this will also change the amino acid sequence of the protein shown two boxes below. This is because the start codon is the first amino acid so you will either add (if you move to the left when you choose a new start codon) or remove (if you choose one to the right) amino acids from the protein.

In order to select a start codon you need to do two things;

1. Look at the various options for start codons. For each one, look upstream (to the left) about 7-13 nucleotides (nts) and see if you can see something that looks similar to the consensus RBS (AGGAGGA). Note that this is a consensus, which means that most ribsome binding sites look like this, but that there can be variations. In our gene Avi0017 if you look upstream of the selected start codon you will see the sequence:

GTAACGAGGATAATGGAATG

For this example I have colored the start codon green and the RBS blue. You can see that this is not an exact match (since the second letter is C and not G) but it is very close. Not all will be this close but many are. As with everything else, if you are not sure, don't change anything.

2. Once you think you have found the correct start then you need to see how this compares with what other people have chosen for similar genes. You do this using the Blast program to compare the sequence of the protein you have defined using this new start codon, to other similar proteins in the database. Just above the start codon window is a small box that says:

On the fly processing BLAST SIGNALP/SMART COG PFAM TMHMM PSORT

Click on the link that says BLAST and a new window will pop up that contains your revised protein sequence. Simply click on the 'search' button to start the program looking for matches. In a few seconds the results will show up. Click on the top colored bar in the list to automatically be taken to the text results below, or simply scroll down past the list of matches. You should see something like this:

>ref|NP_384131.1| CONSERVED HYPOTHETICAL TRANSMEMBRANE PROTEIN [Sinorhizobium
				 meliloti 1021]
 emb|CAC41412.1| CONSERVED HYPOTHETICAL TRANSMEMBRANE PROTEIN [Sinorhizobium
 				meliloti]
 Length = 141

 Score = 169 bits (427), Expect = 2e-41
   Identities = 80/134 (59%), Positives = 100/134 (74%), Gaps = 3/134 (2%)

Query: 1  MNQSALLRPGWRPATIAMMVLGFVIFWPLGLAMLAYILWGDRFRTSKRNANEAMDAMFSK 60
	  MNQSAL+RP W PATIA+MVLGF++FWPLGLAMLAYIL+GD+ R K++ANE +D M 
Sbjct: 1  MNQSALIRPDWTPATIALMVLGFIVFWPLGLAMLAYILFGDKLRAFKKDANEGVDRM--- 57

Query: 61  CCGXXXXXXXXXXXXXXGNLAFDEWRVTELERIEQERRKLEEMREEFEAYVLELQRAKDQ 120
	   C G GN+AFD+WR EL R+++ERRKL+EMREEF+ YV EL+RAKDQ
Sbjct: 58  CAGFKRNRRGQWAHHRTGNVAFDDWRTAELARLDEERRKLDEMREEFDGYVRELRRAKDQ 117

Query: 121 DEFNRFMNQRNASR 134
	   +EF+RFM +R R
Sbjct: 118 EEFDRFMRERKNGR 131

This shows you how the two proteins are aligned. Your protein is the 'query' and its match in the database is the 'subject' . Note that following 'Query:' is the number 1. This means that the amino acid just following is amino acid #1 in your protein. The same is true in this case for the query, it also is showing amino acid #1. What this means to us is that both these proteins chose the same start codon since the alignment for each starts at 1. This means that for a similar protein from Sinorhizobium meliloti, both proteins start at the same position.

Here's an example from the same search in which they don't start at the same place:

>ref|ZP_00194050.2| hypothetical protein MBNC02002869 [Mesorhizobium sp. BNC1]
 Length = 155

 Score = 161 bits (408), Expect = 3e-39
   Identities = 77/131 (58%), Positives = 91/131 (69%), Gaps = 6/131 (4%)

Query: 1  MNQSALLRPGWRPATIAMMVLGFVIFWPLGLAMLAYILWGDRFRTSKRNANEAMDAMFSK 60
	  M SAL+RP W PATIA+MV+GF+ FWPLGLAMLAYILWGDR KR N D +F+ 
Sbjct: 18 MTNSALIRPAWTPATIALMVIGFMAFWPLGLAMLAYILWGDRLHEFKRGINSKTDGLFAN 77

Query: 61  CCGXXXXXXXXXXXXXXGNLAFDEWRVTELERIEQERRKLEEMREEFEAYVLELQRAKDQ 120
	   C GN+AFDEWR ELER+E+ERRKL+ MR EF+ YV EL+RAKDQ
Sbjct: 78  C------RRASRSYSMTGNIAFDEWRQKELERLEEERRKLDAMRSEFDEYVRELRRAKDQ 131

Query: 121 DEFNRFMNQRN 131
	   +EF+RFM RN
Sbjct: 132 EEFDRFMRDRN 142

In this example the matching protein from Mesorhizobium loti start 18 amino acids earlier than the protein we have selected. In this case this is because this group did not take the time to select start codons, but simply used automated programs to find the start. Our automated program also found this same start before we changed it, but it wasn't an ATG (the most common) and had no RBS. If you quickly review the rest of the top matches you will see that most use the start codon we chose. This is what we are looking for. If most use a different one consider making the change to see if it has a better RBS. The bottom line is that this second step is just confirmation, if you are very confident in your RBS then stick with it, don't change just because others have if the RBS doesn't support the change.

That's all there is to identifying the ribosome binding site, pretty easy huh!

last update:

061753 Dec 04

Please refer questions or comments to agro@u.washington.edu

Site design and maintenance:

Derek Wood