README-L2L-GO Updated January 6, 2006 by John Newman (newmanj@u.washington.edu) This document describes, in detail, the steps required to create the Gene Ontology sets of list files for L2L. Our intention is to permit others to both confirm and replicate our efforts. The process requires at least basic familiarity with a UNIX-like command shell and an installation of Perl5. L2L-GO.pl is released under the GNU General Public License (see the file "LICENSE"), with the following notice: -------------------- Copyright (C) 2006 John C. Newman This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. -------------------- -------------------- Contents: -------------------- I. Download Public Database Files II. Create GO term list files III. How does it work? -------------------- I. Download Gene Ontology Database -------------------- Download the following two archive files from NCBI's public ftp server: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz Also, download the latest Gene Ontology description file: ftp://ftp.geneontology.org/pub/go/ontology/gene_ontology.obo Extract both of the NCBI archives; they will be very large (tens to hundreds of MB). Put all three files into the same directory as the GO-L2L.pl program. -------------------- II. Create GO term list files -------------------- Run the GO-L2L.pl program from the command line: ./GO-L2L.pl It will report its progress, along with any serious errors (such as not being able to find the database files). The program should only take a few minutes to run, and will put all of the output L2L listfiles into three new folders, one for each Gene Ontology organizing principle. -------------------- III. How does it work? -------------------- NCBI Gene now publishes a list of Gene ID associations with Gene Ontology terms, derived from the GOA consortium. There are two major tasks. The surprisngly complicated task is to compile a list of all Gene IDs that are associated with each GO term. The simpler task is to translate those Gene IDs to HUGO symbols, and create an L2L list file for each GO term. The first task is surprisingly complicated because not all genes are associated with all of the relevant ancestor terms of the term they are directly associated with. Therefore, simply perusing down the list of associations, and assigning each gene to its direct association, is not useful. We instead need to start with a term, find all the direct associations for that term, then look for direct associations to any of its child terms, then look for direct associations to any of *their* child terms, and so on, until we reach the end of every relevant branch of the family tree of GO terms. Then we can write the comprehensive gene list for that term to a file, and begin anew with the next term. To use an imaginary example, a gene might be specifically associated with "voltage-gated sodium channels". But not "sodium channels", "cation channels", "membrane transporters" or "homeostasis". For any given term (e.g. "cation channels"), this meant that the list of genes directly associated with that term did not necessarily include all the genes associated with its descendant terms (e.g. "voltage-gated sodium channels"). GO-L2L makes this process remarkably speedy by the clever use of Perl hashes in a recursive tree-walking algorithm. It iterates through each GO term, and finds all of that term's ancestors by walking down each branch of the GO tree. It stops when it finds no new children on any branch. It then finds all of the Gene IDs associated with the term or any or its ancestors, translates them all to HUGO symbols, collects the annotation information for the term (what category it belongs to, it's English name, etc.), and writes everything to an appropriately-named and -located output file. When it finishes, you can check to make the output files in the new directories (gobiol, gocell, and gomole) look like you would expect: #L2L listfile #NAME [term name] #REFERENCE GO:[term accession number] #DESCRIPTION [term definition, if it existed] #RELEASE [current L2L release] GENE1 GENE2 GENE3 ... -------------------- END OF README FILE --------------------