README-MammalHom Created January 21, 2005 by John Newman (newmanj@u.washington.edu) This document describes the creation and use of MammalHom, a package of Perl programs for inter-converting human, mouse and rat gene symbols. MammalHom requires a computer with Perl5 and a UNIX-like command shell. MammalHom is released under the GNU General Public License (see the file "LICENSE"), with the following notice: -------------------- Copyright (C) 2005 John C. Newman This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. -------------------- -------------------- Contents: -------------------- I. Introduction II. Download HomoloGene database and extract data III. Remove duplicate entries IV. Using mammalhom.pl V. Using m2h.pl -------------------- I. Introduction -------------------- MammalHom is a suite of programs that assist in inter-converting mammalian gene names. The two programs that do the actual conversion work are mammalhom.pl and m2h.pl. Both of these require information about gene homologies to work, in the form of a .hom file for each of the three species. These files are derived from the NIH's HomoloGene database. MammalHom includes a version of these files that is current as of the writing of this document. If you wish to make your own, updated .hom files, follow the instructions in sections II and III below. If you simply want to get on with converting gene names, skip to section IV. -------------------- II. Download HomoloGene database and extract data -------------------- Download the current release of the HomoloGene database from ftp://ftp.ncbi.nlm.nih.gov/pub/HomoloGene/current/homologene.data.tar. Unpack the archive into the MammalHom directory. The Perl application hom_extract.pl will extract gene symbols and HomoloGeneIDs for human, mouse and rat genes from the database file: ./hom_extract.pl It will create three output files: human.hom, mouse.hom and rat.hom, all in the format "HomoloGeneID GeneSymbol". -------------------- III. Remove duplicate entries -------------------- Typically, several of the gene symbols in each species will have duplicate HomoloGeneIDs. These duplicates will confuse the conversion programs, which match symbols from difference species by finding their common HomoloGeneID. First, identify the duplicates with hom_finddups.pl: ./hom_finddups.pl The program will search the three .hom files for duplicates, and outputs the symbols and HomoloGeneIDs of any duplicates to the terminal. You then need to open the .hom files in a text editor and delete one of the duplicate entries. To decide which entry to delete for any given gene symbol, you may want to look for that symbol in the .hom files of the other species. If one of the duplicate IDs is the only ID for that symbol in another species, you should probably keep it and delete the other. Once all duplicates have been resolved, the .hom files are ready for use. -------------------- IV. Using mammalhom.pl -------------------- There are two conversion programs that use the .hom files to match gene symbols between species: mammalhom.pl and m2h.pl. mammalhom.pl is the more flexible of the two. Make sure it it in the same directory as the .hom files, then launch it from the command line: ./mammalhom.pl It will present a series of five prompts. First, the species you wish to convert from. Make sure you enter precisely "h", "m", or "r"; or "human", "mouse" or "rat". Second, the species you want to convert to. Third, the location of the input file. This is a text file that contains the list of gene symbols you wish to convert (one per line). It can be located anywhere. Fourth, the location and name you wish to use for the output file. Finally, the program asks if you want it to insert a blank line in the output if it can't find a match (enter either "yes" or "y" if you do; any other response will be treated as "no"). It will then run, and report its progress on the terminal. The choice of adding an empty line for failed matches is intended to help copy-and-paste the results into a spreadsheet. The order of the gene symbols is preserved, so the after pasting the list of genes into the input file, the user can paste the list of genes from the output file back into the spreadsheet next to the original list, and see at a glance which genes matches could not be found for. Genes with no match can be matched by hand in EntrezGene. mammalhom.pl is case-insensitive. All names are converted to upper-case before matching. This means the output is upper-case by default. HUGO names are (almost) all upper-case, anyway, but the usual mouse format is for only the first letter to be uppercase. Therefore, mammalhom.pl will upper-case the first letter of the output name, is the user is converting TO mouse. This may result in some incorrect casing, since the first-letter-uppercase rule is not absolute, but it should generally produce correctly-formatted mouse gene names. -------------------- V. Using m2h.pl -------------------- The second conversion program is m2h.pl. It is a specialized tool for rapidly converting many lists of mouse symbols to human symbols, or vice versa. Rather than prompting the user for input, it accepts command-line arguments. m2h.pl has two "modes", a batch-conversion mode (the default), where the two arguments are the input and output files: ./m2h.pl input output ...and a single-gene conversion mode, specified by the "-g" flag, where the single argument is a gene name: ./m2h.pl -g GENE By default, the input must be mouse symbols, and the output will always be human symbols. This behavior can be inverted with the "-i" flag. The input will then be assumed to be human, and the output will be mouse. "-i" can be used with either batch or single-gene mode: ./m2h.pl -i input output ./m2h.pl -ig GENE Batch mode will always leave a blank line in the output if a mouse gene cannot be matched (see above). Like mammalhom.pl, m2h.pl does case-insensitive matching, and will produce all-capitals human gene output and first-letter-capital mouse gene output. -------------------- END OF README FILE --------------------