Naming your gene

Gene names

Due to the large number of genomes that have been sequenced it is not possible to give each gene a traditional name as was done prior to the era of genomics. Most genes are given identification numbers or 'unique identifiers' that are assigned at the end of the project in an automated fashion to keep track of each gene. These numbers are not useful to us in selecting a gene name since we will assign all unnamed genes an identification number at the end of the project automatically. So, why are we looking for gene names and what exactly are we looking for? In this case we are looking for very common genes that have been given distinct names prior to the genomic era. The value of this lies in the fact that many scientists still recognize and commonly use these names. Because of this they will often use the common name to search for the gene they are interested in.

Gene names of the type we are looking for have a specific format. An example is recA. This is a three letter lower case name followed by a capital letter. The entire gene name is listed in italics. The first three letters typically represent something about the function of the gene, in this case rec stands for recombinase. The last capital letter may ofter represent the fact that several of these genes are found together, or have a related function, for example, recA, recB and recC. Note that when we are referring to the protein for this gene we change the name to RecA. The protein names are not italiced and have the first letter capitalized. Now that we know what they are, how do we find them?

We have several options when selecting a gene name. Typically, genes are named based on the similarity of their protein products to the products of other genes. This requires that you have selected a product by looking at the blastP (protein comparison) results for your genes' product. In the blast window of our annotation form you can click on any of the bars that indicate similar proteins in the database to find out more about that particular protein. In our example (Avi0017), the top bar represents a conserved hypothetical transmembrane protein from a bacteria called Sinorhizobium meliloti. If you click on this light blue bar you will be taken to a 'record' that describes this particular gene and its protein product. This record is from the NCBI database which contains one of the largest collections of gene information in the world. Researchers around the world will send information on their genes to this database so that other scientists are able to use this information in their research. When a researcher sends a gene in they also send as much information as they have about the gene and its protein product. this information includes a list of publications that describe the gene, the function of the gene, the DNA sequence, the protein sequence, any domains found, along with lots of other stuff. There is a lot of information on these pages so take a bit of time to look around and see what is available. Gene names are usually found near the bottom of the page under the heading 'features'. In our example, you will not see a gene name but will see 'SMc02769' listed under the heading 'locus tag'. This is an example of a 'unique identifier' described above so will not be useful to us in naming our gene. In this case, we can see that there is no common gene name as we review several of the records, this is what we expect for a conserved hypothetical gene since no function is known. Lets take a quick look at another gene that has a name to see how this might work.

At the top of your annotation page there is a search box labeled 'gene id search'. Type 'avi2558' in this box and click on the 'search' button. You will be taken to a record for the recA gene in A. vitis S4. Scroll down the window to the blast results box and click on the second bar from the top that is labeled 'RecA protein [Agrobacterium tumefaciens C58]'. This will take you to the record for this gene. Note the first protein did not have an obvious name in its title so we skipped to the second. Also note that all the rest of the proteins appear to be named RecA so this is the obvious choice for the product of our gene. If you review the record you will see the following near the bottom:

CDS            1..363
               /gene="recA"
               /locus_tag="Atu1874"

This tells you that the name of this gene is recA. You should look at three or four of the top blast matches (hits) to ensure that they all have named the gene the same. Once you have done that then you can decide to name this gene recA.

What if not all of the top blast matches use the same name? In this case we give priority to selecting a gene name based on the organism which has a name for the gene. The priority list is as follows:

Agrobacterium <<Sinorhizobium<<Brucella<<E. coli

What that means is that you would first choose the name used for another similar gene in Agrobacterium. If there isn't one, then you can use one from Sinorhizobium. If there isn't one of these then use gene names from Brucella and finally E. coli in that order. If you don't see any of these organisms with a common name then just leave the gene name blank and we will automatically fill it in later.

When you are done be sure to note why you chose the gene name in the 'private comments' box at the bottom of the annotation record and click on the UPDATE button.

Finally, as with any other field that you can fill in, if you are not sure, just leave it blank and someone else will review it. If you would like someone to review this or any other field send an email to your instructor with the gene identifier (e.g. Avi0017), the field you are uncertain of (e.g. gene name) and a short description of why you are having a problem.

Thats it, you have just named your first gene!

last update:

281225 Nov 04

Please refer questions or comments to agro@u.washington.edu

Site design and maintenance:

Derek Wood