Seqhelp - A Tool for Assisting Molecular Sequence Analysis

Table of Contents

	Why SeqHelp
	What it does
	What it does not do
	Recommended software
	How it works
	Installation
	Using SeqHelp
	Some hints
	Some caveats
	Some Q_and_As
	An example of SeqHelp applications
	Select publications of work assisted by SeqHelp

Why SeqHelp Many research activities in molecular biology are generating sequences in various quantities. To gain understanding of the sequences, some analyses need to be performed on these sequence data. Common analyses performed on sequences include: database searches, multiple alignments, open reading frame analysis, genomic structure analysis, identification of variations among sequences, and of course, management of data, among other functions. Specialized programs have been designed for these analysis functions, but most of the time, analysis need to be performed separately (e.g. database search; alignment of a sequence against a group of others; prediction of genomic structure; identification of variation). Whereas a top level view of information is desirable in some cases, a refined view of data at the base level is more desirable in many projects. Genomic sequence annotation requires particulaly detailed, low level information. Computer hardware requirements are another demanding factor in the use of software. In addition, use of the existing programs often requires much training, and summary information for sequences from the same project may not be available. SeqHelp seeks to help in sequence analysis requiring minimal efforts, while allowing the experimental biologist to use some familiar inferface mechanism to perform several analyses simultaneously with a hypertext browser and access to data over the internet. It provides an integrated approach to data analysis in the laboratory, or remotely over the internet, almost independent of hardware.

Back to top


What it does

    Seqhelp organizes information pertinent to molecular sequence analysis
    to assist scientists using familiar web-page based analysis.  It collects
    relevant database search results and identifies certain information with
    respect to a sequence, and generates hypertext files which will result in
    web pages.  Through these web pages, a scientist can study the identified
    features, and possibly other relevant information at remote databases and
    libraries over the internet instantly, to decide on the next experimental
    steps.  The results organized by Seqhelp can be applied to gene
    identification, sequence annotation, multiple sequence alignment, mutation
    analysis, identification of individual sequences from a population, and
    other projects.

Back to top


What it does not do

    Except for choosing the database results to include in the web pages,
    Seqhelp does not make many decisions.  However, it does help the researcher
    make decisions.  Many areas of genome research do not have ready answers.
    For example, in a gene identification project, a novel gene may only show
    weak similarity to some existing genes.  Some novel genes or regulatory
    units do not show much similarity to existing, known entities.  An
    automatic answer of this sort is not readily available and should only be
    examined with other information on the sequence and experimentation.
    Therefore, Seqhelp leaves the burden of decision to the scientist after
    careful analysis of the gathered information.

    It does not (yet) provide a top-level display of the sequence structure.

Back to top


Recommended Software

    SeqHelp currently runs on the UNIX platform, but its output can be used on
    any platform.  Since it is written in C, it may be possible to compile on
    other platforms when proper interfaces are provided.  It works with these
    programs:

    1.  The blast (blastall, blastcl3) suite of database search programs
    capable of establishing local blast-searchable databases, searching public
    databases over the internet with HTML-format output on a UNIX system
    (Altschul et al., 1990) appropriate for your operating system.  You will
    need the standalone programs if you want to establish and search a local
    database (and make sure you get formatdb).  You will need the network
    version of the program if you want to search over the internet.  If the
    executables don't run on your system, you probably will need to get the
    source code for the blast programs to compile on your system.

    The latest suite of blast programs incorporates gaps in the search.  In
    principle, these are better programs to use and the matches are more
    meaningful.  Version 1.0/1.1 of SeqHelp is not incorporating gaps.
    One main reason pertains to the translations in the sequence.  That is,
    when gaps are introduced, codons also will be modified and shifted, along
    with the amino acids displayed.  For gene identification studies, SeqHelp
    remains with the ungapped version, although a gapped version may be
    developed.  A new version (1.0p) incorporating gaps has branched out, and
    is more suitable in population study contexts (for example, comparing
    groups of genomic, cDNA, or RNA sequences over the same region, and
    identifying unique bases/sequences).

    Hypertext links are only provided for individual entries through the aligned
    sequences to public databases in the current version of seqhelp.  It should
    be rather easily modified to accommodate locally maintained databases
    provided indexing information is available for the individual records.

    2.  GenScan for gene prediction on a molecular sequence (C. Burge)

    3.  RepeatMasker for identifying and masking repeat elements in a sequence
    (Smit & Green.)

    4.  The phred/phrap/cross_match programs for sequence generation and
    assembly from electrophoregrams generated by an automatic sequencer (Green).

    Phred/phrap are actually not called by seqhelp, but had been used to
    generate sequences from chromatograms.  They can, however, be combined with
    seqhelp by a simple script.  Quality scores from phred/phrap are being
    incorporated into an incoming upgrade of SeqHelp v1.0p in sequence
    variation studies.

    5.  PolyPhred for identifying putative polymorphisms (only used with
    SeqHelp version 1.0p) (Nickerson et al.).

    6.  Perl 5 (required for RepeatMasker and any Perl scripts used).

    7.  Auxiliary programs to SeqHelp.
	Getseqs:  takes each sequence from a single file in fasta format and
	    creates a file for this sequence to be processed by SeqHelp.
	    Usage:  getseqs file_name

    None of the above is required, but the functionality of Seqhelp may depend
    on those present.

    The above are not endorsements of any program.  The programs mentioned
    above were used in our work and SeqHelp was implemented to work with them.

    The following are required:

    8.  A C-compiler.
    9.  A web-browser.

Back to top


How it works

    Seqhelp takes sequence from a fasta format file.  If the sequence is run
    by an automatic sequencer and needs to be translated into a sequence, use
    phred to call the bases from chromatograms and phrap to assemble them into
    contigs if sequences overlap or leave them as singlets otherwise.  The
    sequences so generated are again in fasta format.  The user may of course
    use other software for base calling and assembly.  Depending on the options
    specified, it then calls RepeatMasker to identify and mask the repeat
    elements, GenScan to predict exons, and predicts high-CG content regions.
    Blast is used to search the local database, if one is available, and the
    non-redundant public databases, plus the EST, HTGS, GSS, and STS databases.
    SeqHelp then collects the results and organizes them into an HTML file,
    displaying the predicted exons and CpG islands, identified repeat elements,
    and relevant database search results, with hypertext links, in alignment
    with the query sequence.  A hypertext browser can then be used to analyze
    the results.

Back to top


Installation

    For the recommended software, follow the installation instructions that
    come with each distribution.

    SeqHelp can be compiled by issuing
    		cc -o seqhelp seqhelp.c -lm
    the executable 'seqhelp' and its auxiliary programs should be placed in a
    directory in your search path.

    Address questions regarding installation to Ming Lee.

Back to top


Using Seqhelp

    Important:
	Always back up your files from previous work before starting new
	analyses.
	Always check to make sure that you have sufficient free disk space
	available, since the files containing database search results and
	hypertext files can be quite large.

    It is highly recommended that you study sequences for individual projects
    in a separate directory to prevent accidental interference with other
    projects.

    If you plan on analyzing sequences with local data (i.e. sequences generated
    in your research), you should establish a local database for the relevant
    project before invoking SeqHelp.  This can be accomlished rather easily.
    You will need to have you sequences in a fasta-format file (say seqs).  All
    lines containing sequence data in the file need to be of the same length,
    except for the last line (X is not accepted in this application.  So change
    it to something else distinctive).  Then issue (assuming all nucleotides)
	formatdb -i seqs -p F
    (formatdb is a utility program from NCBI for building a blast searchable
    database).  Three files will be generated: seqs.nsq, seqs.nhr, seqs.nin.
    Move these three files to the directory where blast can search the database
    (specified in the .ncbirc file).

    Seqhelp 1.0/1.1 is invoked from the command mode by issuing

	seqhelp file_name project_name [-][bcehilnstuxyz] [-][d v] [-][P v p] [-][N v p] [-][L v p] [-][R s] [-G O]

    where file_name is the sequence data in fasta format;
    project_name is the project from which the sequences are generated
    (and is the unique name of the local database suitable for blast searches
    for the particular project).  The fasta format is chosen purely for
    convenience, but this format seems to be most widely used.  The project
    name is used to identify the local database where local sequences related
    to the project are stored.  If no local database for the project is
    available, a dummy name must be used in its place, and the search local 
    database option (h) must be suppressed.

    The optional parameters to the command are as follows:
    '-' by itself will invoke the program to do nothing except to explain
	the available options.
    '-' followed by one or more of 'b', 'c', 'e', 'g', 'h', 'i', 'l', 'n',
        'r', 's', 't', 'u', 'x', 'y', 'z' toggles (between the default value
	and its complement) the respective actions to be taken.
	b:  removed bacteria (E. coli) sequence.
	    Default is remove.
	    *** This option is important when studying sequences that ***
	    *** may contain unremoved E. coli sequences, even if only ***
	    *** a fraction of the sequence is E. coli.  Therefore if  ***
	    *** you are not certain that the sequence does not have   ***
	    *** E. coli sequence, set this option to no-removal with  ***
	    *** the -b option.                                        ***
	c:  split the sequence into smaller segments for analysis.  This
	    may sometimes make the database searches run faster.
	    Default is no splitting.
	    (A sequence longer than 6000 bp is automatically split for
	     analysis.)
	e:  no search for the EST database.
	    Default is search.
	    Database search results in file file_name.est.html.
	h:  no search for local sequences in the project.
	    Default is search.
	    Database search results in file file_name.l.html.
	i:  do not predict CpG islands.
	    Default is predict.
	l:  include local sequence names in summary report.
	    Default is does not include.
	n:  no search for nucleic acid database.
	    Default is search.
	    Database search results in file file_name.nr.html.
	s:  no direct reference to sequence data (no longer used).
	    Default is no reference.
	t:  do not translate sequence into amino acids.
	    Default is translate.
	u:  no search for the STS database.
	    Default is search.
	    Database search results in file file_name.sts.html.
	x:  no search for amino acid database.
	    Default is search.
	    Database search results in file file_name.p.html.
	y:  no search for the GSS database.
	    Default is search.
	    Database search results in file file_name.gss.html.
	z:  no search for the HTGS database.
	    Default is search.
	    Database search results in file file_name.htgs.html.
     The following parameter must be followed by an argument: [-][D v]
	where d is for the display length per line
	      v: integer for v bases to display per line (80)
     The following parameter must be followed by an argument: [-][R s]
	where s is for the species chosen from
	      r (rodent specific and mammalian wide repeats)
	      m (non-primate, non-rodent mammals)
	      a (Arabidopsis thaliana)
	      d (Drosophilas)
     The following parameters must be in groups of three: [-][LNP] v p
	where L, N, P stand for Local, Nucleotide, and Protein, respectively.
	      v: integer for the percent similarity (70 for L, N; 50 for P)
	      p: real number for the probability of random match
		 (.01 for L, N; .5 for P)
     The following parameter must be followed by an argument: [-][G O]
	where G is for gene prediction.  The default is not predict.
	      O: name of organism where the sequence was derived from.  Choices
		 include Human, Arabidopsis, and Maize.  If an organism is
		 misspecified, human is chosen.

    The resulting annoated sequence information will be in an HTML file as
    file_name.anno.html.  You can then use a web browser to open this file,
    look at the results, and follow the links to analyze your sequence.  The
    summary file for all sequences analyzed will be in the file info.html.

    The data files and their associated hypertext files can be placed on a
    web server for analysis on a remote computer connected to the internet.
    You may also download the files to a personal computer where a hypertext
    browser and sufficient disk space are available.

Back to top


Some hints

    Seqhelp can display DNA/cDNA/mRNA sequence against DNA/cDNA/mRNA and amino
    acid sequences in alignment.  Displaying an amino acid sequence against
    other amino acid sequences is not formally implemented, but can be done with
    some manipulation.  Displaying amino acid sequences against DNA/cDNA/mRNA
    sequences is not yet implemented.

    Depending on the purpose of your work, all of the database searches and
    predictions are optional.  Thus, you have a few different ways to lay out
    the display of your sequence in relation to other sequences.  (In subsequent
    text, 'data sequence' will refer to a sequence that you want to study).

    To prevent removal of sequences fractionally containing E. coli sequence,
    issue
	seqhelp file_name project_name -b
    Other options can also be included.

    If you only want to see whether your novel data sequence is similar to
    something in the public databases, you can issue
	seqhelp file_name project_name -h
    The local database is not searched.  If you don't want the translations of
    the six open reading frames, add t to the list of options.

    If you are concerned that searching the amino acid databases take too much
    time, but only want a quick look at the available nucleotide and EST
    databases, then issue
	seqhelp file_name project_name -xuyz
    The search will be conducted against the nucleotide, EST, and (actually)
    the local database, because the h option was not used.

    Say you are running project X that produces a group of sequences.  Then
    you generate sequence Y and want to compare Y to its most similar sequences
    in X (for example, search for identical sequences of Y in X).  Prepare X
    in a local database, and issued
	seqhelp Y_file_name X_database_name -nextiguyz
    becase you are not concerned about the public databases, the six reading
    frames, or any repeat elements there might be in Y.  Only the sequences in
    the local database are compared.

    If you are undertaking genomic sequencing for a gene (or a homolog of it)
    whose mRNA sequence is available, you can examine the progress the coverage
    of the gene by the genomic sequences you have generated by
	seqhelp G_file_name X_database_name -exuyzb
    where G_file_name is the file containing the mRNA sequence of the gene, and
    X_database_name is the local database containing your sequencing data.  It
    is assumed that you are not interested in comparing the EST database.  Yet,
    you may also not want to compare this gene to the public data, nor care for
    the repeat elements.  So you may add some more options to skip the searches
    that you don't want.

Back to top


Some Caveats

    Since database searches can result in very large amounts of data and some
    HTML files can become very large, disk space can be quite demanding.
    Therefore, it is always a good policy to make sure that there is a large
    amount of free disk space you can use, and continuously remove outdated
    files, or keep outdated files in archives.

    For a sequence longer than 11000 bp (as it is set for blast), SeqHelp
    breaks it down into segments up to (conceptually) 6000 (6500 in reality)
    bp, numbered from 0, so that segment i (i = 0, 1, ...) contains bases
    i X 6000 + 1 to (i+1) X 6000.  The segments for sequence S are then named
    as S.0, S.1, ..., and so on.  SeqHelp then takes each segment, performs the
    requested taskes, and assembles the results back into a complete annotated
    sequence.  So the process is almost transparent.  However, if you are
    interested in a piece of data beyond base 6000 and follow the link to the
    related database search results, you will see a different number for the
    bases in the query (data) sequence.  This happens because for each segment,
    blast treats the first base as base 1.  SeqHelp has not taken the effort to
    convert the base numbers.  So if you are looking at the database results
    in segment i (where from the location window in your web browser you will
    see the segment number), base j in segment i corresponds to j + i X 6000
    in the original sequence.  This also applies when you set the -c (split
    sequence) option.

Back to top


Some Q_and_As

    Q: Why yet another sequence visualization tool while there are already
       quite a few out there?
    A: Although most tools may have been intended to be for general use and
       may be used in many different situations, each tool is mostly influenced
       by the applications and circumstances during its development.  We have
       examined a number of visual analysis tools for sequence analysis and
       sequence project management, and did not find one that completely matches
       our needs.  So the birth of SeqHelp.  It is intended to be easy to use
       by experimental biologists at any level of computer sophistication, and
       to be generally applicable to various sequence analysis projects.  In
       many aspects, SeqHelp has more superior features in thorough sequence
       annotation and analyses.

    Q: Why does SeqHelp produce text-only files?
    A: The program had been designed mostly for biologists studying molecular
       sequences: comparing homologies or similarities of sequences at the
       nucleotide level.  Although determination of gene structure has been
       part of some projects, high-level displays have not been compellingly
       demanding.  For the most part, a high-level (i.e., abstract graphical)
       display of the sequence structure would have been needed in a very small
       percentage of the analyses.  On the other hand, base pair level
       matches/mismatches, intron/exon boundaries, etc, had been the
       overwhelming elements that require thorough study.  Thus the present
       state of the program to display sequences at the base level by text.
       In addition, text files can be easily edited without specialized
       software.  There are, of course, many fine software for displaying high
       level structures graphically.

    Q: Why are database searches limited to the non-redundant nucleotide and
       amino acid databases, dbEST, GSS, HTGS, STS, and your own local database?
    A: SeqHelp was started as a tool for sequence analysis in positional cloning
       projects, with applications to other projects explored later.  Much
       of the needs in sequence analysis therefore would be to: compare a novel
       sequence to the known nucleotide and amino acid sequences to see if it is
       similar or homologous to (or even is) an existing gene; decide if the
       novel sequence contains any ESTs as indicators for genes; and look at the
       constituent sequences from the local sequencing project to examine the
       progress of the project.  The other databases can be, in principle,
       incorporated with some lines of additional code.  In fact, the number of
       databases searched had been expanded from the original group.

    Q: Why does SeqHelp only incorporate other programs and not more generic
       analysis methods?
    A: The programs used in SeqHelp are among the best work in their respective
       areas of research, and serves the needs of our, and we believe many other
       investigators', research.  This is the fastest way to efficiently
       apply existing resources.  In any line of work, if some product already
       exists, there is little need to reinvent it.

    Q: Is there a web-based interface to SeqHelp?
    A: Cliff Olmsted designed a web interface to SeqHelp for the Bioinformatics
       Resources server at the University of Washington.

Back to top


An example of SeqHelp applications

    BRCA1 mutations confer significantly elevated risks for developing human
    breast cancer.  Research in this gene, as well as other cancer-related genes,
    will provide insights into the prevention and treatment of breast cancer.
    The complete sequence of the region on human chromosome 17q21 containing
    BRCA1 has been sequenced (GenBank L78833).  A recent annotation of human BRCA1
    by SeqHelp is given here as an example of how SeqHelp can help annotate and
    analyze genomic sequences.  This 1.17 megabase sequence was actually
    analyzed by annotating smaller segments and concatenating the results
    together.  The results can be studied at a glance by looking at the summary
    of the annotation.

Back to top


Select publications of work assisted by SeqHelp

The following is partial list of work that had been assisted by SeqHelp.

Gasper JS, Shiina T, Inoko H, Edwards SV.  2001.  Songbird genomics: analysis
of 45 kb upstream of a polymorphic MHC class II gene in red-winged blackbirds
(Agelaius phoeniceus).  Genomics 75:26-34.

Lipovich L, Lynch ED, Lee MK, King MC.  2001.  A novel sodium bicarbonate
cotransporter-like gene in an ancient duplicated region: SLC4A9 at 5q31.
Genome Biol 2:RESEARCH0011.

Edwards SV, Gasper J, Garrigan D, Martindale D, Koop BF. 2000.  A 39-kb sequence
around a blackbird MHC class II gene: ghost of selection past and songbird genome
architecture. Mol Biol Evol 17:1384-95.

Hess CM, Gasper J, Hoekstra HE, Hill CE, Edwards SV.  2000.  MHC class II
pseudogene and genomic signature of a 32-kb cosmid in the house finch (Carpodacus
mexicanus).  Genome Res 10:613-23.

Lynch ED, Lee MK, Morrow JE, Welcsh PL, Leon PE, King MC. 1997.  Nonsyndromic
deafness DFNA1 associated with mutation of a human homolog of the Drosophila gene
diaphanous. Science. 278(5341): 1315-8.

Back to top

Although care has been taken to make accurate analysis and presentation in the program SeqHelp and this documentation, there might be unintentional errors in them. If you find errors, please contact Ming Lee. If you are interested in getting SeqHelp, please contact Software Transfer, or have questions or suggestions, please contact Ming Lee.