Detection of protein coding sequences using a mixture model for local protein amino acid sequence

TitleDetection of protein coding sequences using a mixture model for local protein amino acid sequence
Publication TypeJournal Article
Year of Publication2000
AuthorsThayer, E. C., Bystroff C., & Baker D.
JournalJournal of computational biology : a journal of computational molecular cell biology
Volume7
Issue1-2
Pagination317-27
Date Published2000 Feb-Apr
ISSN1066-5277
KeywordsAlgorithms, Amino Acid Sequence, Biometry, DNA, DNA, Fungal, Fungal Proteins, Genome, Fungal, Humans, Models, Genetic, Primary Publication, Proteins, Saccharomyces cerevisiae, Sequence Analysis, Protein
Abstract

Locating protein coding regions in genomic DNA is a critical step in accessing the information generated by large scale sequencing projects. Current methods for gene detection depend on statistical measures of content differences between coding and noncoding DNA in addition to the recognition of promoters, splice sites, and other regulatory sites. Here we explore the potential value of recurrent amino acid sequence patterns 3-19 amino acids in length as a content statistic for use in gene finding approaches. A finite mixture model incorporating these patterns can partially discriminate protein sequences which have no (detectable) known homologs from randomized versions of these sequences, and from short (< or = 50 amino acids) non-coding segments extracted from the S. cerevisiea genome. The mixture model derived scores for a collection of human exons were not correlated with the GENSCAN scores, suggesting that the addition of our protein pattern recognition module to current gene recognition programs may improve their performance.

Alternate JournalJ. Comput. Biol.
AttachmentSize
thayer00A.pdf244.63 KB