Seqhelp - A Tool for
Assisting Molecular Sequence Analysis
Table
of Contents
Why SeqHelp
What it does
What it does not do
Recommended software
How it works
Installation
Using SeqHelp
Some hints
Some caveats
Some Q_and_As
An example of SeqHelp applications
Select publications of work assisted by SeqHelp
Why SeqHelp
Many research activities in molecular biology are generating sequences in
various quantities. To gain understanding of the sequences, some analyses
need to be performed on these sequence data. Common analyses performed on
sequences include: database searches, multiple alignments, open reading
frame analysis, genomic structure analysis, identification of variations
among sequences, and of course, management of data, among other functions.
Specialized programs have been designed for these analysis functions, but
most of the time, analysis need to be performed separately (e.g. database
search; alignment of a sequence against a group of others; prediction of
genomic structure; identification of variation). Whereas a top level view
of information is desirable in some cases, a refined view of data at the
base level is more desirable in many projects. Genomic sequence annotation
requires particulaly detailed, low level information. Computer hardware
requirements are another demanding factor in the use of software. In
addition, use of the existing programs often requires much training, and
summary information for sequences from the same project may not be
available.
SeqHelp seeks to help in sequence analysis requiring minimal efforts,
while allowing the experimental biologist to use some familiar inferface
mechanism to perform several analyses simultaneously with a hypertext
browser and access to data over the internet. It provides an integrated
approach to data analysis in the laboratory, or remotely over the internet,
almost independent of hardware.
Back to top
What it
does
Seqhelp organizes information pertinent to molecular sequence analysis
to assist scientists using familiar web-page based analysis. It collects
relevant database search results and identifies certain information with
respect to a sequence, and generates hypertext files which will result in
web pages. Through these web pages, a scientist can study the identified
features, and possibly other relevant information at remote databases and
libraries over the internet instantly, to decide on the next experimental
steps. The results organized by Seqhelp can be applied to gene
identification, sequence annotation, multiple sequence alignment, mutation
analysis, identification of individual sequences from a population, and
other projects.
Back to top
What it does not
do
Except for choosing the database results to include in the web pages,
Seqhelp does not make many decisions. However, it does help the researcher
make decisions. Many areas of genome research do not have ready answers.
For example, in a gene identification project, a novel gene may only show
weak similarity to some existing genes. Some novel genes or regulatory
units do not show much similarity to existing, known entities. An
automatic answer of this sort is not readily available and should only be
examined with other information on the sequence and experimentation.
Therefore, Seqhelp leaves the burden of decision to the scientist after
careful analysis of the gathered information.
It does not (yet) provide a top-level display of the sequence structure.
Back to top
Recommended
Software
SeqHelp currently runs on the UNIX platform, but its output can be used on
any platform. Since it is written in C, it may be possible to compile on
other platforms when proper interfaces are provided. It works with these
programs:
1. The blast (blastall, blastcl3) suite of database search programs
capable of establishing local blast-searchable databases, searching public
databases over the internet with HTML-format output on a UNIX system
(Altschul et al., 1990) appropriate for your operating system. You will
need the standalone programs if you want to establish and search a local
database (and make sure you get formatdb). You will need the network
version of the program if you want to search over the internet. If the
executables don't run on your system, you probably will need to get the
source code for the blast programs to compile on your system.
The latest suite of blast programs incorporates gaps in the search. In
principle, these are better programs to use and the matches are more
meaningful. Version 1.0/1.1 of SeqHelp is not incorporating gaps.
One main reason pertains to the translations in the sequence. That is,
when gaps are introduced, codons also will be modified and shifted, along
with the amino acids displayed. For gene identification studies, SeqHelp
remains with the ungapped version, although a gapped version may be
developed. A new version (1.0p) incorporating gaps has branched out, and
is more suitable in population study contexts (for example, comparing
groups of genomic, cDNA, or RNA sequences over the same region, and
identifying unique bases/sequences).
Hypertext links are only provided for individual entries through the aligned
sequences to public databases in the current version of seqhelp. It should
be rather easily modified to accommodate locally maintained databases
provided indexing information is available for the individual records.
2. GenScan for gene prediction on a molecular sequence (C. Burge)
3. RepeatMasker for identifying and masking repeat elements in a sequence
(Smit & Green.)
4. The phred/phrap/cross_match programs for sequence generation and
assembly from electrophoregrams generated by an automatic sequencer (Green).
Phred/phrap are actually not called by seqhelp, but had been used to
generate sequences from chromatograms. They can, however, be combined with
seqhelp by a simple script. Quality scores from phred/phrap are being
incorporated into an incoming upgrade of SeqHelp v1.0p in sequence
variation studies.
5. PolyPhred for identifying putative polymorphisms (only used with
SeqHelp version 1.0p) (Nickerson et al.).
6. Perl 5 (required for RepeatMasker and any Perl scripts used).
7. Auxiliary programs to SeqHelp.
Getseqs: takes each sequence from a single file in fasta format and
creates a file for this sequence to be processed by SeqHelp.
Usage: getseqs file_name
None of the above is required, but the functionality of Seqhelp may depend
on those present.
The above are not endorsements of any program. The programs mentioned
above were used in our work and SeqHelp was implemented to work with them.
The following are required:
8. A C-compiler.
9. A web-browser.
Back to top
How it works
Seqhelp takes sequence from a fasta format file. If the sequence is run
by an automatic sequencer and needs to be translated into a sequence, use
phred to call the bases from chromatograms and phrap to assemble them into
contigs if sequences overlap or leave them as singlets otherwise. The
sequences so generated are again in fasta format. The user may of course
use other software for base calling and assembly. Depending on the options
specified, it then calls RepeatMasker to identify and mask the repeat
elements, GenScan to predict exons, and predicts high-CG content regions.
Blast is used to search the local database, if one is available, and the
non-redundant public databases, plus the EST, HTGS, GSS, and STS databases.
SeqHelp then collects the results and organizes them into an HTML file,
displaying the predicted exons and CpG islands, identified repeat elements,
and relevant database search results, with hypertext links, in alignment
with the query sequence. A hypertext browser can then be used to analyze
the results.
Back to top
Installation
For the recommended software, follow the installation instructions that
come with each distribution.
SeqHelp can be compiled by issuing
cc -o seqhelp seqhelp.c -lm
the executable 'seqhelp' and its auxiliary programs should be placed in a
directory in your search path.
Address questions regarding installation to Ming Lee.
Back to top
Using
Seqhelp
Important:
Always back up your files from previous work before starting new
analyses.
Always check to make sure that you have sufficient free disk space
available, since the files containing database search results and
hypertext files can be quite large.
It is highly recommended that you study sequences for individual projects
in a separate directory to prevent accidental interference with other
projects.
If you plan on analyzing sequences with local data (i.e. sequences generated
in your research), you should establish a local database for the relevant
project before invoking SeqHelp. This can be accomlished rather easily.
You will need to have you sequences in a fasta-format file (say seqs). All
lines containing sequence data in the file need to be of the same length,
except for the last line (X is not accepted in this application. So change
it to something else distinctive). Then issue (assuming all nucleotides)
formatdb -i seqs -p F
(formatdb is a utility program from NCBI for building a blast searchable
database). Three files will be generated: seqs.nsq, seqs.nhr, seqs.nin.
Move these three files to the directory where blast can search the database
(specified in the .ncbirc file).
Seqhelp 1.0/1.1 is invoked from the command mode by issuing
seqhelp file_name project_name [-][bcehilnstuxyz] [-][d v] [-][P v p] [-][N v p] [-][L v p] [-][R s] [-G O]
where file_name is the sequence data in fasta format;
project_name is the project from which the sequences are generated
(and is the unique name of the local database suitable for blast searches
for the particular project). The fasta format is chosen purely for
convenience, but this format seems to be most widely used. The project
name is used to identify the local database where local sequences related
to the project are stored. If no local database for the project is
available, a dummy name must be used in its place, and the search local
database option (h) must be suppressed.
The optional parameters to the command are as follows:
'-' by itself will invoke the program to do nothing except to explain
the available options.
'-' followed by one or more of 'b', 'c', 'e', 'g', 'h', 'i', 'l', 'n',
'r', 's', 't', 'u', 'x', 'y', 'z' toggles (between the default value
and its complement) the respective actions to be taken.
b: removed bacteria (E. coli) sequence.
Default is remove.
*** This option is important when studying sequences that ***
*** may contain unremoved E. coli sequences, even if only ***
*** a fraction of the sequence is E. coli. Therefore if ***
*** you are not certain that the sequence does not have ***
*** E. coli sequence, set this option to no-removal with ***
*** the -b option. ***
c: split the sequence into smaller segments for analysis. This
may sometimes make the database searches run faster.
Default is no splitting.
(A sequence longer than 6000 bp is automatically split for
analysis.)
e: no search for the EST database.
Default is search.
Database search results in file file_name.est.html.
h: no search for local sequences in the project.
Default is search.
Database search results in file file_name.l.html.
i: do not predict CpG islands.
Default is predict.
l: include local sequence names in summary report.
Default is does not include.
n: no search for nucleic acid database.
Default is search.
Database search results in file file_name.nr.html.
s: no direct reference to sequence data (no longer used).
Default is no reference.
t: do not translate sequence into amino acids.
Default is translate.
u: no search for the STS database.
Default is search.
Database search results in file file_name.sts.html.
x: no search for amino acid database.
Default is search.
Database search results in file file_name.p.html.
y: no search for the GSS database.
Default is search.
Database search results in file file_name.gss.html.
z: no search for the HTGS database.
Default is search.
Database search results in file file_name.htgs.html.
The following parameter must be followed by an argument: [-][D v]
where d is for the display length per line
v: integer for v bases to display per line (80)
The following parameter must be followed by an argument: [-][R s]
where s is for the species chosen from
r (rodent specific and mammalian wide repeats)
m (non-primate, non-rodent mammals)
a (Arabidopsis thaliana)
d (Drosophilas)
The following parameters must be in groups of three: [-][LNP] v p
where L, N, P stand for Local, Nucleotide, and Protein, respectively.
v: integer for the percent similarity (70 for L, N; 50 for P)
p: real number for the probability of random match
(.01 for L, N; .5 for P)
The following parameter must be followed by an argument: [-][G O]
where G is for gene prediction. The default is not predict.
O: name of organism where the sequence was derived from. Choices
include Human, Arabidopsis, and Maize. If an organism is
misspecified, human is chosen.
The resulting annoated sequence information will be in an HTML file as
file_name.anno.html. You can then use a web browser to open this file,
look at the results, and follow the links to analyze your sequence. The
summary file for all sequences analyzed will be in the file info.html.
The data files and their associated hypertext files can be placed on a
web server for analysis on a remote computer connected to the internet.
You may also download the files to a personal computer where a hypertext
browser and sufficient disk space are available.
Back to top
Some hints
Seqhelp can display DNA/cDNA/mRNA sequence against DNA/cDNA/mRNA and amino
acid sequences in alignment. Displaying an amino acid sequence against
other amino acid sequences is not formally implemented, but can be done with
some manipulation. Displaying amino acid sequences against DNA/cDNA/mRNA
sequences is not yet implemented.
Depending on the purpose of your work, all of the database searches and
predictions are optional. Thus, you have a few different ways to lay out
the display of your sequence in relation to other sequences. (In subsequent
text, 'data sequence' will refer to a sequence that you want to study).
To prevent removal of sequences fractionally containing E. coli sequence,
issue
seqhelp file_name project_name -b
Other options can also be included.
If you only want to see whether your novel data sequence is similar to
something in the public databases, you can issue
seqhelp file_name project_name -h
The local database is not searched. If you don't want the translations of
the six open reading frames, add t to the list of options.
If you are concerned that searching the amino acid databases take too much
time, but only want a quick look at the available nucleotide and EST
databases, then issue
seqhelp file_name project_name -xuyz
The search will be conducted against the nucleotide, EST, and (actually)
the local database, because the h option was not used.
Say you are running project X that produces a group of sequences. Then
you generate sequence Y and want to compare Y to its most similar sequences
in X (for example, search for identical sequences of Y in X). Prepare X
in a local database, and issued
seqhelp Y_file_name X_database_name -nextiguyz
becase you are not concerned about the public databases, the six reading
frames, or any repeat elements there might be in Y. Only the sequences in
the local database are compared.
If you are undertaking genomic sequencing for a gene (or a homolog of it)
whose mRNA sequence is available, you can examine the progress the coverage
of the gene by the genomic sequences you have generated by
seqhelp G_file_name X_database_name -exuyzb
where G_file_name is the file containing the mRNA sequence of the gene, and
X_database_name is the local database containing your sequencing data. It
is assumed that you are not interested in comparing the EST database. Yet,
you may also not want to compare this gene to the public data, nor care for
the repeat elements. So you may add some more options to skip the searches
that you don't want.
Back to top
Some
Caveats
Since database searches can result in very large amounts of data and some
HTML files can become very large, disk space can be quite demanding.
Therefore, it is always a good policy to make sure that there is a large
amount of free disk space you can use, and continuously remove outdated
files, or keep outdated files in archives.
For a sequence longer than 11000 bp (as it is set for blast), SeqHelp
breaks it down into segments up to (conceptually) 6000 (6500 in reality)
bp, numbered from 0, so that segment i (i = 0, 1, ...) contains bases
i X 6000 + 1 to (i+1) X 6000. The segments for sequence S are then named
as S.0, S.1, ..., and so on. SeqHelp then takes each segment, performs the
requested taskes, and assembles the results back into a complete annotated
sequence. So the process is almost transparent. However, if you are
interested in a piece of data beyond base 6000 and follow the link to the
related database search results, you will see a different number for the
bases in the query (data) sequence. This happens because for each segment,
blast treats the first base as base 1. SeqHelp has not taken the effort to
convert the base numbers. So if you are looking at the database results
in segment i (where from the location window in your web browser you will
see the segment number), base j in segment i corresponds to j + i X 6000
in the original sequence. This also applies when you set the -c (split
sequence) option.
Back to top
Some
Q_and_As
Q: Why yet another sequence visualization tool while there are already
quite a few out there?
A: Although most tools may have been intended to be for general use and
may be used in many different situations, each tool is mostly influenced
by the applications and circumstances during its development. We have
examined a number of visual analysis tools for sequence analysis and
sequence project management, and did not find one that completely matches
our needs. So the birth of SeqHelp. It is intended to be easy to use
by experimental biologists at any level of computer sophistication, and
to be generally applicable to various sequence analysis projects. In
many aspects, SeqHelp has more superior features in thorough sequence
annotation and analyses.
Q: Why does SeqHelp produce text-only files?
A: The program had been designed mostly for biologists studying molecular
sequences: comparing homologies or similarities of sequences at the
nucleotide level. Although determination of gene structure has been
part of some projects, high-level displays have not been compellingly
demanding. For the most part, a high-level (i.e., abstract graphical)
display of the sequence structure would have been needed in a very small
percentage of the analyses. On the other hand, base pair level
matches/mismatches, intron/exon boundaries, etc, had been the
overwhelming elements that require thorough study. Thus the present
state of the program to display sequences at the base level by text.
In addition, text files can be easily edited without specialized
software. There are, of course, many fine software for displaying high
level structures graphically.
Q: Why are database searches limited to the non-redundant nucleotide and
amino acid databases, dbEST, GSS, HTGS, STS, and your own local database?
A: SeqHelp was started as a tool for sequence analysis in positional cloning
projects, with applications to other projects explored later. Much
of the needs in sequence analysis therefore would be to: compare a novel
sequence to the known nucleotide and amino acid sequences to see if it is
similar or homologous to (or even is) an existing gene; decide if the
novel sequence contains any ESTs as indicators for genes; and look at the
constituent sequences from the local sequencing project to examine the
progress of the project. The other databases can be, in principle,
incorporated with some lines of additional code. In fact, the number of
databases searched had been expanded from the original group.
Q: Why does SeqHelp only incorporate other programs and not more generic
analysis methods?
A: The programs used in SeqHelp are among the best work in their respective
areas of research, and serves the needs of our, and we believe many other
investigators', research. This is the fastest way to efficiently
apply existing resources. In any line of work, if some product already
exists, there is little need to reinvent it.
Q: Is there a web-based interface to SeqHelp?
A: Cliff Olmsted designed a web interface to SeqHelp for the Bioinformatics
Resources server at the University of Washington.
Back to top
An example of SeqHelp
applications
BRCA1 mutations confer significantly elevated risks for developing human
breast cancer. Research in this gene, as well as other cancer-related genes,
will provide insights into the prevention and treatment of breast cancer.
The complete sequence of the region on human chromosome 17q21 containing
BRCA1 has been sequenced (GenBank L78833). A recent annotation of human BRCA1
by SeqHelp is given here as an example of how SeqHelp can help annotate and
analyze genomic sequences. This 1.17 megabase sequence was actually
analyzed by annotating smaller segments and concatenating the results
together. The results can be studied at a glance by looking at the summary
of the annotation.
Back to top
Select publications
of work assisted by SeqHelp
The following is partial list of work that had been assisted by SeqHelp.
Gasper JS, Shiina T, Inoko H, Edwards SV. 2001. Songbird genomics: analysis
of 45 kb upstream of a polymorphic MHC class II gene in red-winged blackbirds
(Agelaius phoeniceus). Genomics 75:26-34.
Lipovich L, Lynch ED, Lee MK, King MC. 2001. A novel sodium bicarbonate
cotransporter-like gene in an ancient duplicated region: SLC4A9 at 5q31.
Genome Biol 2:RESEARCH0011.
Edwards SV, Gasper J, Garrigan D, Martindale D, Koop BF. 2000. A 39-kb sequence
around a blackbird MHC class II gene: ghost of selection past and songbird genome
architecture. Mol Biol Evol 17:1384-95.
Hess CM, Gasper J, Hoekstra HE, Hill CE, Edwards SV. 2000. MHC class II
pseudogene and genomic signature of a 32-kb cosmid in the house finch (Carpodacus
mexicanus). Genome Res 10:613-23.
Lynch ED, Lee MK, Morrow JE, Welcsh PL, Leon PE, King MC. 1997. Nonsyndromic
deafness DFNA1 associated with mutation of a human homolog of the Drosophila gene
diaphanous. Science. 278(5341): 1315-8.
Back to top
Although care has been taken to make accurate analysis and presentation in the
program SeqHelp and this documentation, there might be unintentional errors in
them. If you find errors, please contact Ming Lee.
If you are interested in getting SeqHelp, please contact Software Transfer,
or have questions or suggestions, please contact Ming Lee.