- Fold representative selection and simulation
- Fold family oversampling
- SQL and OLAP Development
Dynameomics is our high-throughput simulation initiative. As the determination of new proteins structures increases, the discovery of new protein topologies (folds) slows. Much effort has been spent studying the differences between these folds with respect to evolution and simple biophysical structural favorability. Our particular interest is to use a broad sample of in silico folding behavior to elucidate general rules of self-assembly that would be useful both for prediction of protein structure and treatment of misfolding disease.
The initial phase of this project was constituted by the generation of a consensus domain dictionary from three major public domain dictionaries (SCOP, CATH, and the Dali Domain Dictionary). Thirty initial targets were selected from the 30 most populated consensus folds ("metafolds") and made available. Following this, a preliminary set of 188 targets were selected and simulated, both in their native states and along their unfolding pathways (induced by high-temperature). The data from our simulations was validated for these targets and made available on a limited basis.
The final phase of this project involved three distinct components: simulation of a complete set of small protein topologies, simulation of a large set of targets from a small number of metafolds, and simulation of biomedically relevant targets with disease-causing single nucleotide polymorphisms ("SNPS") . Our 2003 consensus set was updated with current domain dictionaries and a single target was selected from each metafold. Where possible these targets were simulated. Multiple targets were selected from the well-studied three-helix bundle, SH3 domain, and ubiquitin-like metafolds and simulated to evaluate sequence effects. Multiple targets with disease-causing SNPs were simulated to survey potential pathological destabilizing events.
Our Dynameomics database (available at www.dynameomics.org) is implemented in Microsoft SQL Server 2008. The database is split over several servers, each of which host subsets of the data and are joined via a single unified directory. In order to greatly simplify data access, the database makes extensive use of views. Although traditional SQL row-sets are a natural structure for some data types, we have also been experimenting with multidimensional OLAP cubes as a way to more efficiently store high-dimensional data such as coordinates. OLAP, which is accessible via the MDX query language, shows promise as a means of simplifying complex multi-simulation queries. Throughout the database a very strict organization is enforced so as to make extension and access easy as we transition our data to the public domain.
Performing these simulations was made possible by generous support from the Department of Energy and Microsoft. We are in the process of making a large portion of these data public at www.dynameomics.org.
- Simms A.M., Toofanny R.D., Kehl C., Benson N.C., and Daggett V. Dynameomics: design of a computational lab workflow and scientific data repository for protein simulations. Protein Engineering Design & Selection 21: 369-377, 2008. [DOI]
- Van der Kamp M.W., Schaeffer R.D., Jonsson A.L., Scouras A.D., Simms A.M., Toofanny R.D., Benson N.C., Anderson P.C., Merkley E.D., Rysavy S., Bromley D., Beck D.A.C., and Daggett V. Dynameomics: A Comprehensive Database of Protein Dynamics. Structure 18:423-435, 2010. [DOI] [Cover Image]
- Schaeffer R.D. and Daggett V. Protein folds and protein folding. Protein Engineering Design & Selection, 24:11-19, 2010. [DOI]
- Schaeffer R.D, Jonsson A.L., Simms A.M., and Daggett V. Generation of a Consensus Protein Domain Dictionary. Bioinformatics, 27:46-54, 2010. [DOI]