README-L2L Created January 28, 2005 Last modified August 15, 2006 by John Newman (newmanj@u.washington.edu) This document describes L2L, a Perl application that, together with the L2L Microarray Database, can identify novel biological patterns in microarray gene expression data. L2L requires a computer with Perl5 and a UNIX-like command shell. L2L is released under the GNU General Public License, with the following notice: -------------------- Copyright (C) 2006 John C. Newman This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. -------------------- -------------------- Contents: -------------------- I. Introduction II. Installing the L2L application III. Summary of switches and arguments IV. Command-line interface V. Batch ("execute") mode VI. File Formats VII. Setting up L2L on a web server Appendix A. Batch mode switches and arguments Appendix B. Sample batch mode commands Appendix C. Batch mode scenarios -------------------- I. Introduction -------------------- L2L is a program for finding novel biological patterns in microarray gene expression data. It compares a user's list of changed genes to several hundred such published lists in the database, and reports which ones have significant overlap with the user's data. It can thus identify the fingerprints of a particular transcription factor or stress agent in the user's data. It also includes sets of lists that are derived from Gene Ontology categories, and can determine if any particular categories are over-represented in the user's data. L2L requires three types of files. First, the user's data, in the form of a list of probe IDs that went up or down in the experiment. Second, a translator library for the microarray platform that was used, which maps each probe on that microarray to a HUGO gene symbol. Third, a set of lists to which the user's data will be compared. These lists are all in the form of HUGO gene symbols (thus the need for a translator library). The L2L application is, in essence, a customizable tool for comparing two lists. One list belongs to the user's data, and the other is taken from the database. L2L first converts the HUGO symbols on the database list to probes on the user's microarray, using the translator library. This list of probes is then compared to the user's list of probes. L2L records how many probes are found on both lists, and computes a p-value for how significant this overlap is. It does this for each list in the database, recording not only the degree of overlap and statistical significance for each list in the database, but also the names of all the probes in the user's data that matched to each database list, and (conversely) the names of all of the database lists that matched to each probe in the user's data. It outputs all of this information as a set of easy-to-browse HTML pages. L2L is intended primarily to be run on a web server, and used from a web browser. Instructions for installing it on your own web server are included below. However, using the L2L application directly, instead of through the web interface, accords a much greater degree of flexibility and power. It also lets you easily customize almost any aspect of L2L; for example, creating your own translator library files, or new sets of lists. L2L can either be run directly, which will bring up a simple textual interface; or using "batch mode", which bypasses the interface and immediately perform an analysis using the specified files. A variety of command-line switches allow batch mode to automatically perform a large number of analyses on a wide variety of data files, against any or all of the included sets of lists, or any custom sets of lists - all with a single, simple command. -------------------- II. Installing the L2L application -------------------- The L2L application package ("L2L.tgz") should include the following files and folders: l2l (the application) README-L2L (this document) LICENSE (a copy of the GNU General Public License, under which this package is distributed) styles.css (for creating the results HTML files) img/ (also for the results) modules/ (contains a Darwin/PPC version of the Perl module Math::CDF) library/ (contains the translator libraries) lists/ (contains the four default sets of lists: l2lmdb, gobiol, gocell, and gomole) The L2L application will run without modification on Macintosh OS X (10.3 or 10.2). Unpack the archive into any folder, and navigate to that folder in Terminal.app. Generally, L2L should run on any platform with a UNIX-like command shell and Perl5. It also requires one non-standard Perl module, Math::CDF. Most BSD and GNU/Linux installations have all of the necessary components installed by default except for Math::CDF. This Perl module relies on a C-based math library that must be compiled for your specific platform. L2L is distributed with the Darwin/PPC version of Math::CDF installed in the "cgi-bin/module" directory. On other platforms, you must install the module yourself (using "install Math::CDF" within the cpan application, for instance). If you are installing L2L on your personal computer, simply install Math::CDF into the default location; L2L will ignore the Darwin/PPC version and find the platform-correct version that you installed. L2L assumes that Perl is located at /usr/bin/perl; if this is not the case on your system, you can either edit the first line (#!) of the l2l file to reflect the correct location, or create a symbolic link at /usr/bin/perl that points to the correct location. If you are trying to install L2L on a server on which you do not have root access, see Section VII, Setting up L2L on a web server, for tips. In addition to Mac OS X-PowerPC, L2L has been tested on IBM AIX 4.3-POWER with Perl 5.6 and GNU/Linux-x86 with Perl 5.8. -------------------- III. Summary of switches and arguments -------------------- l2l [e] [(a|c)dflsw(b|n dataname)] data [library] [lists] Arguments: 1. Path to data file (default) or directory containing data files (with -b) to be analyzed 2. Path to translator library (ignore if -f) 3. Path to directory with list files (default), or directory containing subdirectories of list files (with -c) (ignore if -a) -e: bypass the command-line interface and execute with command-line switches and arguments. All other switches and arguments are ignored without -e. Defining Input: -a: Analyze against all four default sets of lists (l2lmdb, gobiol, gocell, and gomole). -b: Batch mode, process all data files in specified directory. -c: Analyze against all sets of lists in specified directory. -d: Use single-file database instead of a directory of list files. -f: Get translator library name from within data file ("#LIBplatform"). Customizing Output: -l: Send output log to the terminal, instead of to a text file -n [dataname]: Specifies the name to be used for all output files. -s: Simple output only; no HTML files are created. -w: Customized output for web interface. -------------------- IV. Command-line interface -------------------- When L2L is run without any arguments or switches ('./l2l'), it will launch a simple command-line interface that prompts the user for the locations of the three files necessary to run an analysis (data, translator library, and set of lists). See the "File Formats" section for information on the formats of these file types. The interface only allows fairly simple analyses: one or more data files, all using the same translator library, analyzed against a single set of lists. For more complicated or thorough batch analyses, the "execute mode" (see below) allows substantially greater flexibility. The first prompt is for the location of the data file (or directory containing several data files). The data file contains a list of probe IDs that were called as significantly changed in the user's microarray experiment. By specifying a directory, multiple files can be analyzed simultaneously, but all must be derived from the same microarray platform (and so use the same translator library). Second is the location of the translator library to be used. If this is one of the default libraries, the path will be "library/[somename]" (for example, "library/u133set". If the user is using a custom translator library, this should be the path to that file. Third is the location of the set of lists against which the data will be analyzed. There are four sets included with L2L: the L2L Microarray Database; and three sets derived from Gene Ontology categories, Biological Process, Cellular Component, and Molecular Function. These are located at "lists/l2lmdb", "lists/gobiol", "lists/gocell", and "lists/gomole", respectively. If the user wishes to use a custom set of lists, they can specify the path to that directory instead. There are also four commands that can be entered at any prompt: - "o" to set options - "i" for credit and copyright information - "h" to find help - "q" to quit From the "o"ptions screen, the user can choose whether to send the output log to the screen (s) or to a text file (f). This is the same choice given by the "-l" switch. Default is to write the log to a file. Note that the output log is not the results; it is a progress report from the program, from loading the translator library to evaluating each list against the data. When the analysis begins, a new directory will be created wherever the data file is located. All output files will be created in this directory. To browse the output, open the file "mydata_output.html" (where "mydata" is the name of the original data file) in any web browser. -------------------- V. Batch ("execute") mode -------------------- L2L is most flexible and powerful when used in batch mode, when all input is passed through the command line and L2L immediately executes the specified analysis. Batch mode is invoked by using the "-e" switch when launching L2L: './l2l -e'. Of course, such a sparse command will only generate an error. You must supply L2L will all of the information it needs (data, translator library, and lists) in a single command. The simplest way to do this is to specify the locations of these files, just as your would in the interface, in the same order: ./l2l -e mydata /library/u133set /lists/l2lmdb Calling L2L like this is functionally no different from using the interface (or the website, for that matter), although it is faster to type a single line than to navigate through the interface. But the real power of batch mode lies in the variety of switches that can customize both the input and output. For example, the command: ./l2l -eabf mydatafolder ...will analyze all of the data files in mydatafolder, each of which may use a different translator library, against all of the four included sets of lists (l2lmdb, gobiol, gocell, and gomole). For even greater customization, the command: ./l2l -eabf mydatafolder customlistfolder ...will analyze all of the data files (as above) against all of the sets of lists within the directory "customlistfolder". Note that in the above command, only two arguments were passed, data and lists. But they were still in the same relative order ('data library lists'). If a switch renders an argument unnecessary, that argument should be left out of the command. But the remaining arguments should still be in the same relative order ('data library', or 'data lists'). L2L is designed to be easily customized. All of the files it uses are plain-text files, which can be created from any text editor (see File Formats below). It is easy to create a custom translator library, for example, and drop it into the "library" folder of L2L. Or to create a few new lists for the microarray database, perhaps relating to specific topic that is of interest, and drop them into the "lists/l2lmdb" folder. Or to create an entirely new set of lists, maybe based on a protein-protein interaction database, and drop the folder containing all of these new lists into "lists". All of these new components are immediately available for use by L2L. We hope that L2L will be more useful to more investigators, thanks to its ease of customization; and we also hope that anyone who creates new components will consider contributing them back to the community, though the "Contribute" page on the L2L website. A brief description of all of the possible switches and argument is provided above (Section I). For a more detailed description, see Appendix A. Appendix B contains several sample batch mode commands with explanations. And, finally, Appendix C contains several sample scenarios where batch mode might be useful, and an explanation of how to take advantage of it. -------------------- VI. File formats -------------------- L2L uses simple, tab-delimted formats for all of its files. The three types of files it needs are data files, translator libraries, and database list files. DATA FILES Data files contain a user's own experimental data - the list of genes that were up- or down-regulated in a microarray experiment. Genes that were up-regulated and genes that were down-regulated should be put in separate files and analyzed separately. The file is simply a list of unique probe identifiers for the particular microarray system that was used, one identifier per line: probeID1 probeID2 probeID3 Support for several popular microarray systems is built-in to L2L. If your microarray system isn't among them, you can create your own translator library. A table listing the supported microarray systems and a few sample probe identifiers can be found at the L2L website, on the "format.html" page. In general, the format of the identifiers is exactly that used by the manufacturer. Note, however, that U133 Set identifiers include the chip ID (A or B). All identifiers are case-insensitive (i.e. 200007_at or 200007_AT are both fine). A special translator library called "All HUGO Names" is intended to be used for gene annotation - you can put a few genes you want to annotate in your data file, use this translator library, and see which L2L Microarray Database lists your genes of interest are found on. "All HUGO Names" can also be a used as a default "microarray system" if your microarray is not represented and you do not want to create a translator library for it. However, L2L's statistical analysis relies on knowing how many genes are actually on your microarray, and how many of those were changed in your experiment. Therefore, you should not put much faith in any p-values or fold-enrichment numbers if you use "All HUGO Names". It is also very easy to create a new translator library (see below), so we highly recommend you do this if L2L doesn't include a translator library for your microarray system. TRANSLATOR LIBRARIES A translator library allows L2L to translate gene names to microarray probe identifiers and back. It is a tab-delimited file with a paired probe identifier and HUGO name on each line: probeID1 XYZ1 probeID2 ABCD1 probeID3 HUJA6 ... The probe identifiers can be anything, as long as they match the probe identifiers you use for your data. Try to avoid special characters, however. The web-interface of L2L will warn you if your uploaded translator library has improper characters in it (this is a security measure). Gene names must be official HUGO gene symbols in order for L2L's gene annotation functions to work (linking to EntrezGene, for example). DATABASE LISTS Each list in the database is a file with a few annotations at the top, followed by the HUGO gene symbols of all the genes on that list, one per line: #L2L listfile #NAME brca1_up #REFERENCE 12032322 #DESCRIPTION Upregulated by induction of BRCA1 in EcR-293 cells #KEYWORDS cancer #PLATFORM HuGeneFL #RELEASE 2006.2 FSTL1 GALNT3 SEC10L1 HTATIP ... The annotations (e.g. brca1_up) must be separated from their identifiers (e.g. #NAME) by tabs. This allows spaces to appear in the description without confusing the program. The first annotation line tells the L2L application that this is, indeed, a list. This line should be the same for all list files. The second line is a short, informative name for the list (usually the same as the file name). It should contain only alphanumeric characters and underscores. The third line is a reference to the source of the list. For L2L Microarray Database lists, this is the PubMed ID of the source publication. The fourth line is a description of the list. It can be as long as necessary, and can include any character except tabs. The fifth line can contain one of a number of keywords for browsing the database and (in a future revision of L2L) restricting searches to particular topics. Current keywords. The sixth line describes the platform (microarray or otherwise) that was used to generate the data encompassed by the list. The sixth (optional) line contains a release version. All other lines in the file contain one of the genes on the list. Combining all lists into a single file can speed batch processing considerably. A single-file database is simply a concatenation of all list files, with each list separated by a new line containing "##". Database files are included with the L2L application for all default sets of lists. This option can be invoked with the -d switch. -------------------- VII. Setting up L2L on a web server -------------------- General requirements for installing L2L on a web server are provided below, but the details for how to fulfill these requirements will vary from platform to platform. Most modern platforms come with many of the required components built-in. L2L was developed on Mac OS X, and will run without modification on OS X 10.2 or later. However, it does require that you turn on the built-in webserver and make several changes in its configuration. Detailed instructions for this are provided below. Note that in order to reduce the size of the website download, archived older versions of the website are not included in the download of the latest version. For example, the L2L website contains links to the archived 2005.1 and 2006.1 versions. If you download the current "Complete L2L Website", these archives are not included. You may, however, download them separately using the links on the "Revision History" page, and then drop them into the "downloads" folder. GENERAL INSTRUCTIONS: --------------------- Running L2L on a web server requires the following components and web server settings: Components: - Perl5 or later, located at /usr/bin/perl - tar that supports -c and long pathnames (e.g. GNUtar) - grep with -r (recursive) function (GNU grep 2.3 or later) - Perl module CGI.pm (part of the default Perl5 install) - Perl module Math::CDF (not part of the default Perl install, and must compiled for your specific platform) Web server settings: - ability for Perl scripts to write to the L2L/cgi-bin/temp deirectory - ability to execute CGI scripts from the L2L/cgi-bin directory - ability to follow symbolic links within the L2L directory Macintosh OS X, the various BSDs, and most distributions of GNU/Linux come with all of the necessary components installed by default, with the exception of Math::CDF. Other UNIX systems may use non-GNU versions of tar and grep that lack the necessary features. L2L has been tested on Mac OS X-PowerPC, GNU/Linux-x86 with Perl 5.8, and IBM AIX 4.3-POWER with Perl 5.6. The last required custom installation of GNUtar and GNU grep. GNUtar and GNU grep can be downloaded from ftp.gnu.org/gnu. Download the "L2L-Website.tgz" file from the L2L website. Unpack this archive into any directory on your web server. Many of the files in the archive have long pathnames, and you will need GNUtar or another tar program that can handle these pathnames properly. After unpacking, move the archive to the "L2L/downloads" folder. This will allow others to download the complete L2L from your website, just as you did from ours. The Perl module Math::CDF relies on a C-based math library that must be compiled for your specific platform. L2L is distributed with the Darwin/PPC version of this module installed in the cgi-bin/module directory. On other platforms, you must install the module yourself (using "install Math::CDF" within the cpan application, for instance). If you do not have root access to your web server, things are a bit more complicated. All of the required components (tar, grep, and Math::CDF) can be installed somewhere within the cgi-bin directory, but the various CGI scripts will have to be edited to look in the new location (for instance, replace "grep" with "newlocation/bin/grep" in any backticks or system calls). Or you could create an alias command to map "grep" to "newlocation/bin/grep", which will save you from having to edit the CGI scripts. If you install Math::CDF within the cgi-bin directory, you should edit the third (uncommented) line of the l2l script with the correct location. If Perl is located anywhere other than /usr/bin/perl, you will have to change the first line (#! ...) of each script file to reflect the correct location. Finally, you will need to make the necessary adjustments to the web server configuration. Many professional web-hosting services already permit execution from anywhere in your home directory, will follow symbolic links by default, and run your CGI scripts with the same permissions as your user. If this is the case, no further configuration is required. If not, see the instructions for Mac OS X, below, for an example on how to make these configuration changes in a typical Apache installation. Mac OS X INSTRUCTIONS: --------------------- These instructions are for MacOSX 10.4 or 10.3 (10.2 should also work, but has not been tested). They assume that L2L will be installed in your user directory ("Home folder"), that the built-in web server has not been modified, and that you have the "BSD Subsystem". The BSD Subsystem (a collection of command-line utilties) is installed by default on MacOSX; but if, in customizing your installation you chose not to install it, you can install it now from the MacOSX install CD. First, download the "L2L-Website.tgz" file from the L2L website. Place this file into your ~/Sites folder, and unpack it. Once it's unpacked, move the original .tgz file into the "L2L/downloads" folder. Now, the tricky part. MacOSX comes with the industry-standard Apache web server built-in. However, you need to customize it in order to be able to run L2L. This involves modifying two files: "/etc/httpd/httpd.conf" and "/etc/httpd/users/YOURUSERNAME.conf". Open these files in any text editor. If you use the Terminal to open them (e.g. with pico), make sure you use "sudo", or else you won't have permission to save your changes. /etc/httpd/httpd.conf In this file, you need to give Apache permission to run CGI scripts. Scroll down past the languages section, until you see the "AddHandler" section. Uncomment (remove the leading "#") from the line "AddHandler cgi-script .cgi". Save the change. /etc/httpd/users/YOURUSERNAME.conf Change the second line of this file ("Options") to read: "Options -Indexes MultiViews ExecCGI FollowSymLinks". Save the change. Next, you need to give L2L permission to create and modify files in the L2L/cgi-bin/temp directory. Select the "temp" folder in the Finder, and choose "Get Info". Expand the "Details" section of "Ownership & Permissions", and set Others to "Read & Write" access. Finally, start the built-in Apache web server. Open up the "Sharing" Pane of System Preferences, and click the button next to "Personal Web Sharing". To test is everything worked, open up Safari, and enter the URL "http://localhost/~YOURUSERNAME/L2L/". The L2L home page should appear. Click on the "Browse Database" link to make sure that the CGI scripts are working. If they are, you will see a web page displaying of all the lists in the microarray database. If not, you will instead see a text file with the raw Perl code, or an error message. You can try quitting and relaunching Safari, and stopping and starting the web server (in Sharing). If it still doesn't work, you may need to consult more thorough documentation on running CGI scripts in MacOSX. Try Googling for "Mac OS X CGI Perl", or reading over the MacDevCenter how-to: http://www.macdevcenter.com/pub/a/mac/2001/12/14/apache_two.html If all is well, you can now access L2L from your computer using the "http://localhost/~YOURUSERNAME/L2L" URL; or you can access it from any computer from the URL listed in the "Sharing" pane. You should be aware that running a web server opens a potential door into your computer for attackers. Although such an attack is unlikely, you should be sure to keep your system updated (through Software Update), and leave "Personal Web Sharing" off whenever you are not actually using it. -------------------- Appendix A. Batch mode switches and arguments -------------------- Arguments: L2L takes up to three command-line arguments, all of which specify file locations for the three necessary inputs. They must always be given in the correct order: 1. data 2. translator library 3. set of lists However, depending on which switches are set, not all three may be necessary (described below). Only pass the necessary arguments, but keep them in the correct relative order. The program determines from the switches which arguments it should be looking for, and interprets them correctly. All switches: abcdeflnsw Bypassing the command-line interface: e -e: Execute; this switch bypasses the interface and triggers direct execution of L2L with the input specified by the other switches and argument passed. If -e is not specified, the interface will launch and any other arguments or switches will be ignored. For defining input: abcf -a: All list sets; analyzes the data with all default sets of lists (as specified in "listsets.txt"). -c and -a cannot be used simultaneously. Renders the list argument unnecessary. -b: Batch data; the data argument must be a directory, and the program will analyze all files in that directory. -b and -n cannot be used simultaneously. -c: Custom list sets; the lists argument specifies a custom directory in which may be any number of subdirectories, each containing a set of lists. The program will analyze all data against all of the list sets in this directory. Using -c with "lists" in a default install of L2L is equivalent to using -a (without a list argument). -c and -a cannot be used simultaneously. -d: The list set argument specifies a single-file database instead of a directory. This can speed large batch runs considerably. -f: File has library; L2L will look for a line in each data file that specifies which translator library to use with that data. The line in the data file must be "#LIBplatform", where "platform" is the name of one of the files in the library/ directory. Renders the library argument unnecessary. For setting output options: lnsw -l: Log output; redirects output of the progress log from a text file (default) to the terminal. -n [dataname]: Name; the only switch that must be followed by an argument, this being a short name for the analysis being run. All output files will be named using this name, rather than by the file name of the data (the default). The name should be short, without any spaces or strange characters. This switch (and the name that follows it) must be entered before any other arguments (but -n can be the last switch). -n and -b cannot be used simultaneously. -s: Simple output; the program will skip all of the HTML output. The only output will be the log output, and the raw text table output. This includes all of the statistics for overlaps between the data and all lists, but lacks any information about which specific genes matched to which lists. -w: Web interface; this switch is intended for use only by the L2L web interface (2l2l.cgi). It changes two behaviors of L2L; first, if no other switches (except -n) are used, L2L skips creation of an output file directory. This is because 2l2l.cgi already creates such a directory. Second, when everything is finished, L2L will tar and gzip the entire output directory. This is only required (or desirable) in the context of a web server from which the results must be downloadable. -------------------- Appendix B. Sample batch mode commands -------------------- ./l2l Will bring up the command-line interface, where the user will be prompted to input the file paths to data, translator library, and lists. None of the batch-processing options (like -c, -b, -f) are available from within the interface. ./l2l -e data/mydata library/u133 lists/l2lmdb The simplest -e command. "mydata" must be a text file with a list of probe IDs from the Affy U133 microarray; the data will be processed against the L2L MDB. ./l2l -e -n l2loutput data/mydata library/u133 lists/l2lmdb Same as above, but the output files will be named "l2loutput" instead of "mydata". ./l2l -ed -n l2loutput data/mydata library/u133 lists/l2lmdb.db Same as above, but the program will use the database file "l2lmdb.db" instead of a directory full of list files. ./l2l -abef data The most powerful batch-processing command. The directory "data" can contain any number of data files (-b), all of which must have a line in them specifying the translator library to be used (-f). All of the data files will be analyzed against all four default L2L sets of lists (-a). Optionally, use -d to speed the run by using single-file databases instead of the default directories-of-files. ./l2l -bcef data lists Same as above, but all of the data in "data" will be processed against all of the sets of lists in "lists". The directory "lists" must contain one or more subdirectories of L2L list files (-c). All of the subdirectory names should be short, without spaces or strange characters. This is intended for users who create a custom set of lists, and want to be able to analyze their data against their custom set as well as some or all of the default sets simultaneously. -------------------- Appendix C. Batch mode scenarios -------------------- 1. Your experiment produced an "up" list and a "down" list of changed genes, and you want to do a standard analysis on both. Prep: Create data files for the two lists, and place both files into an empty directory. Command: ./l2l -eab datadirectory library/platformname 2. You have performed several experiments using different microarray platforms, and want to do a standard analysis on all of the data simultaneously. Prep: Create a data file for each list of changed genes, and place at the top of each file a line that specifies the translator library to be used for that list ("#LIBplatformname"). Command: ./l2l -eabf datadirectory 3. You want to use L2L to analyze your microarray data, but a translator library for your microarray platform is not included in L2L. Prep: Create a translator library file for your platform (see Section V: File Formats). Give the file a short but meaningful name ("u999xplus") and drop it into the library folder. Command: ./l2l -ea datafile library/u999xplus 4. You want to create a new set of lists, and analyze your data only against these custom lists. Prep: Create the list files (see Section V: File Formats), and place them all into an empty directory. Give the directory a short but meaningful name ("pdblists"). Command: ./l2l -e datafile library/platformname path/to/pdblists 5. Like 4, but you want to do a batch analysis that includes lots of data files, created with several microarrays, against all of the default sets of lists plus the new custom one. Prep: Make sure all data files have the "#LIB" line in them. Drag the "pdblists" folder into the "lists" directory. Command: ./l2l -ebcf datadirectory lists -------------------- END OF README FILE --------------------