SLA Los Angeles
Genomics, Proteomics and Sequence/Structure Resources
Program presented June 12, 2002
Session report by Jim Martin

Program Abstract:

Secrets learned through the study of the genome and proteome are transforming biochemistry, medicine, and related scientific disciplines. Join us as a molecular biologist explores the science of sequences and sequence information. Two librarians will then introduce and discuss the major sequence information resources freely available over the web. Audiences: All attendees. Sponsor(s): American Chemical Society. Moderator(s): Nancy Simons, Georgia Tech. Speaker(s): Michael M. Miyamoto, Ph.D., Professor of Zoology, University of Florida, Monica Romiti, Senior Technical Information Specialist, KEVRIC Contract Manager, Michele R. Tennant, Ph.D., M.L.I.S., Bioinformatics Librarian, University of Florida.

The program began with Michael Miyamoto's excellent introduction to genomics and proteomics. After briefly explaining some basic genetic concepts, he emphasized how developments in genomics and proteomics are revolutionizing the study of life: scientists now have the tools to begin the study of the full complement of genes (the genome), mRNA (the transcriptome), and proteins (the proteome) in an organism. The ability to explore these interrelated aspects simultaneously should allow for a more complete understanding of the intricate networks that regulate cellular environments.

Dr. Miyamoto then highlighted some of the more intriguing findings to emerge from the Human Genome Project. For example, only 1.1% of the human genome is known to code for proteins. (Interestingly endogenous retroviral sequences, which have found their way into the germline of human cells and are heritable, account for approximately 8% of our DNA.) And it turns out that humans also have far fewer genes (35,000-40,000) than predicted. In fact we only have about twice the number of genes as the nematode, a small transparent worm whose genome has also been sequenced.

What accounts for the apparently non-linear relationship between the total number of genes in an organism's genome and the complexity of that organism? The key may lie in the number of proteins that the genes encode for. Dr. Miyamoto described how alternative splicing of RNA can result in a single gene ultimately giving rise to more than one mRNA, which in turn serve as templates for protein synthesis. In humans, there is believed to be an average of three alternative splices per gene. Nematodes, on the other hand, are thought to have a correlation between genes and proteins that is closer to one to one.

In the study of the transcriptome, or all of an organism's mRNAs and their various splice forms, one of the more powerful tools is the cDNA microarray. cDNA microarrays can be used to study the expression of tens of thousands of genes simultaneously under experimental conditions and have many exciting applications. For example researchers are using microarrays to compare gene expression in normal cells with gene expression in cancer cells and other diseased tissue. It is hoped that this will lead to the identification of genes and proteins that may play a role in certain pathologies and to diagnostic tests for disease. [See "The Magic of Microarrays", by Stephen H. Friend and Roland B. Stoughton, Scientific American, February 2002, pages 44-53, for an overview of DNA microarrays.]

The study of proteomics, Dr. Miyamoto explained, also has tremendous potential, but is a more complex undertaking than genomics. Whereas with few exceptions each and every cell in an organism contains the same DNA, protein expression varies widely depending on cell type and a multitude other factors. Proteins also interact with other proteins and may undergo extensive post-translational modification, which affects their function. There are also many more proteins than genes and relatively few of these proteins been completely characterized or annotated. Some of the analytical techniques currently being used to study proteins include 2D PAGE gels, mass spectrometry, and x-ray crystallography.

Monica Romiti provided an overview of new and forthcoming features from NCBI, the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/) and noted that the organization has a new ftp address: ftp.ncbi.nih.gov. (For more on the resources available on the NCBI ftp server, see http://www.ncbi.nlm.nih.gov/Ftp/)

NCBI databases mentioned in this presentation included the following (descriptions in quotes are taken from the NCBI web pages): dbSNP [Single Nucleotide Polymorphisms) http://www.ncbi.nlm.nih.gov/SNP/index.html SNPs (pronounced "snips") represent the single nucleotide changes, which are believed to account for most of the genetic differences among humans. "NCBI plays a major role in facilitating the identification and cataloging of SNPs through its creation and maintenance of [dbSNP]. This powerful genetic tool may be accessed by the biomedical community worldwide and is intended to stimulate many areas of biological research, including the identification of the genetic components of disease."

See http://www.ncbi.nlm.nih.gov/About/primer/snps.html for an introduction to SNPs and their significance.

UniSTS [Sequence Tagged Site]

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unists "UniSTS integrates marker and mapping data from a variety of public resources. If two or more markers have different names but the same primer pair, a single sequence tagged site (STS) record is presented for the primer pair and all the marker names are shown."

ProbeSet

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=geo "ProbeSet is a "by experiment" view of NCBI's Gene Expression Omnibus gene expression and hybridization array repository. ProbeSet is intended to facilitate powerful searching on the GEO database, and link the search results to internal and external resources where possible."

3D Domains

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Domains Brings together resources for using the Molecular Modeling Database of over 10,000 structures, including the latest version of the visualization software Cn3D.

Books

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=books A searchable collection of free full text biomedical books. The books may be searched directly or viewed through links from abstracts in the PubMed database.

Genes & disease page (recently updated)

http://www.ncbi.nlm.nih.gov/disease/

Michele Tennant spoke about genomics and proteomics resources. The presentation focused on examples of online resources for annotation, pathways, and gene expression, which are listed below.

Gene Ontology Consortium (GOC)

http://www.geneontology.org/ The GOC is developing common, controlled vocabularies to describe the molecular function, biological processes, and cellular components of genes and gene products. These vocabularies are used by consortia members to annotate genome databases for a variety of model organisms. One of the many advantages of a controlled vocabulary is that it allows researchers to cross-search organism databases simultaneously for common terms. For example, one could search for proteins in different organisms, which share common attributes as defined by the vocabulary.

For in depth articles on the GOC, see http://www.geneontology.org/#pubs

KEGG - Kyoto Encyclopedia of Genes and Genomes

http://www.genome.ad.jp/kegg/ KEGG brings together graphical models of molecular and cellular processes and systems with genomic information. Computational methods are used to help researchers predict the functions of genes and gene products.

NCBI's GEO (Gene Expression Omnibus)

http://www.ncbi.nlm.nih.gov/geo/ GEO is a tertiary resource which brings together high-throughput expression and genomic hybridization array data. NCBI's Entrez ProbeSet allows for sophisticated searching of the database.

KEGG/EXPRESSION Database

http://www.genome.ad.jp/kegg/expression/ A database for browsing and analyzing microarray gene expression data. Integrated links with other databases allows the data to be mapped to pathways (via KEGG/PATHWAY) or chromosomal positions (via KEGG/GENES).

Dr. Tennant concluded her presentation with a list of recommended resources:

Resource Lists

Hightower, Christy. "Guide to Selected Bioinformatics Internet Resources." Issues in Science and Technology Librarianship, Winter 2002 http://www.istl.org/istl/02-winter/internet.html

The January issue of "Nucleic Acids Research" (http://nar.oupjournals.org/ ) is devoted to an annual review of factual biological databases.

To Learn More

Basic genetics & genomics glossaries

http://www.genomicglossaries.com/content/Basic_Genetic_Glossaries.asp

Genome glossary

http://www.ornl.gov/hgmis/glossary

NCBI's "Science Primer"

http://www.ncbi.nlm.nih.gov/About/primer/

Genomics Impact on Medicine & Society

http://www.ornl.gov/hgmis/publicat/primer2001/

"A Primer of Genome Science", Greg Gibson and Spencer V. Muse, Sinauer Associates, Inc. 2002 (ISBN: 0878932348)

NCBI's Advanced Course for Librarians and Bioinformatics Information Specialists, August 5-9, 2002

http://www.ncbi.nlm.nih.gov/Class/NAWBIS