README-L2L-Utilities Created January 24, 2005 by John Newman (newmanj@u.washington.edu) This document describes L2L-Utilities, a collection of Perl programs for manipulating and mining information from L2L list files. L2L-Utilities requires a computer with Perl5 and a UNIX-like command shell. L2L-Utilities is released under the GNU General Public License (see the file "LICENSE"), with the following notice: -------------------- Copyright (C) 2005 John C. Newman This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. -------------------- -------------------- Contents: -------------------- I. Included programs II. Co-expression relationships III. Counting IV. Comparing -------------------- I. Included programs -------------------- L2L-Utilities includes seven Perl application that perform a variety of functions. Most were written specifically to work with L2L list files (comparelines.pl is a general-purpose utility). They can be divided into three groups: 1. Co-expression relationships: findposcoexp.pl, findnegcoexp.pl 2. Counting: countgenes.pl, countrefs.pl 3. Comparing: findcommon.pl, comparelines.pl, removedups.pl -------------------- II. Co-expression relationships -------------------- findposcoexp.pl and findnegcoexp.pl will find and record co-expression relationships within the list files of the L2L Microarray Database. A positive co-expression relationship is any occasion where two genes appear on the same list. A negative co-expression relationship is any occasion where two genes appear on inverse lists of a particular condition (one on the "up" list, one on the "down" list). Inverse lists are defined as lists with identical filenames, except that one has the suffix "_dn" and the other "_up". findposcoexp.pl searches for positive relationships, while findnegcoexp.pl searches for negative relationships. Otherwise, the programs are functionally identical. Both take a single command-line argument, that is the directory which contains the list files to be searched. Usually, this will be the lists/l2lmdb directory. All files in the directory will be processed, so there should be no extraneous files - only L2L list files. Both programs will write a summary of the results to the terminal. In addition, they will write a list of all relationships that appear at least twice to the file "[pos|neg]coexpgenes.txt". This file has the format: #relationship GENE1 #gene1 GENE2 #gene2 ...where #relationship is the number of times the relationship appears in the database, GENE1 and GENE2 are the two genes involved in the relationship, and #gene1 and #gene2 are the number of times GENE1 and GENE2 appear in the database, respectively. The name of this output file cannot be changed, except in the source code. The programs can take several minutes to run if used to search the entire L2L MDB, so patience may be required. Typical usage: ./findposcoexp.pl path/to/lists -------------------- III. Counting -------------------- countgenes.pl and countrefs.pl will count and characterize the number of genes or references, respectively, present in a directory full of L2L list files. Functionally, they are very similar. They each take as an argument the path to a directory, which should contain all of the list files to be processes and no other files. Each first extracts the relevant data from each list file in the directory, and keeps track of the number of unique occurrences of each reference or gene. They each print a basic summary of the results to the terminal. They also accept an optional second argument for an output filename. If given this argument, they will write a more detailed output to the file. countgenes.pl prints to the terminal the total number of gene entries, the number of unique gene names, and the number of duplicate occurrences of gene names. If given the optional output filename argument, the program will write to the file the number of occurrences of each unique gene name, in the format: GENE1 #occurrences Typical usage: ./countgenes.pl path/to/lists outputfile countrefs.pl prints to the terminal the number of lists without references, the number of unique references, and the number of duplicate occurrences of references. If given the optional filename argument, the program will write to the file the number of occurrences of each reference (i.e. the number of list files derived from that reference), in the format: REFERENCE #occurrences Typical usage: ./countrefs.pl path/to/lists outputfile -------------------- III. Comparing -------------------- findcommon.pl will finding the genes shared in common between several L2L list files. It takes two arguments. The first is the directory which contains the list files to be analyzed. All files in this directory will be processed, so it should contain only the list files you wish to compare, and no extraneous files. The second argument is the filename of the output file. The program will print its progress and a summary of its output to the terminal. It will write a detailed output to the output file. It first writes the name and number of genes for each list file processed, then the number of genes that were found on n, n-1 and n-2 of the n lists processed. It also lists the names of all genes that were found on n, n-1 or n-2 of the list files. It does not keep track of which lists a given gene was found on, only the number of occurrences. Typical usage: ./findcommon.pl path/to/lists outputfile removedups.pl is a tool for removing duplicate gene entries from L2L list files. Often, when compiling a new list from published data using automated tools like MatchMiner, multiple sequences or probe IDs may match to the same gene name. These duplicates can confound the results of L2L, so they must be removed before the new list can be used. removedups.pl takes as an argument the directory which contains all the list files to be processed. It will process all files in this directory, so take care that there are no extraneous files. It first creates a "temp" subdirectory in its own directory to store the processed list files. As it reads each file, it writes the contents to a new file in the temp directory. If it finds a duplicate gene entry, that entry is not written to the new file. The original file is not modified. It will report its progress to the terminal. When the program is finished, temp will contain a cleaned copy of all the files in the original directory, including those in which no duplicates were found. Typical usage: ./removedups.pl path/to/lists comparelines.pl is a general-purpose tool for comparing the content of two files. It takes two required arguments, filename1 and filename2; and an optional third argument, the output filename. It finds all the lines in file1 that are not present (verbatim) in file2, and prints the non-matching lines to the terminal and (if an output filename was specified) to the output file. The order of the lines does not matter, and they must match perfectly (even whitespace) to be considered identical. Since order is irrelevant, duplicated lines cannot be tested separately; the program will not notice if a certain line occurs three times in file1 but once in file2. The program essentially treats each line of each file as a large word, and looks for any "words" that appear in file1, but not file2. Also note that it works in this direction only, not vice versa. You must switch the order of the files and run the program again to find lines present in file2, but not file1. Typical usage: ./comparelines.pl file1 file2 outputfile -------------------- END OF README FILE --------------------