Prediction and Design of Protein Structures and Protein-protein Interactions

From prediction of structure to design of function

        The primary goals of the research in the Baker group over the past several years have been to predict the structures of naturally occurring biomolecules and interactions and to design new molecules with new and useful functions. These prediction and design challenges have direct relevance for biomedicine and provide stringent and objective tests of our understanding of the fundamental underpinnings of molecular biology.

        To carry out the prediction and design calculations, we have been developing a computer program called Rosetta. At the core of Rosetta are potential functions for computing the energies of interactions within and between macromolecules, and methods for finding the lowest energy structure for a protein or RNA sequence (structure prediction) and for finding the lowest energy sequence for a protein or given structure or function (design) (Das and Baker, 2008). Feedback from the prediction and design tests is used continually to improve the potential functions and the search algorithms. Development of one computer program to treat these diverse problems has considerable advantages: first, the different applications provide complementary tests of the underlying physical model (the fundamental physics/physical chemistry is, of course, the same in all cases); second, many problems of current interest, such as flexible backbone protein design and protein-protein docking with backbone flexibility, involve a combination of the different optimization methods.
Prediction of Protein Structures and Interactions

         We have continued to improve the Rosetta ab initio structure prediction methodology, and in blind tests for small proteins have in some cases achieved atomic level accuracy (Qian et al, 2008; Das et al, 2007; Raman et al, 2009).  We have extended the methodology to metal binding proteins (Wang et al, 2010), membrane proteins (Barth et al, 2007), symmetric protein complexes (Andre et al, 2007; Das et al, 2009), RNA molecules (Das et al, 2007; Das et al, in press), flexible backbone protein-protein docking (Wang et al, 2007), and protein-small molecule docking for drug design (Meiler and Baker, 2006; Davis and Baker, 2009; Davis et al, 2009). We have found a striking high level similarity between all of these problems.  In all cases, the native folded structure is lower in energy than any non-native structures we can generate even with large scale sampling.  Pronounced “native energy gaps” must exist because biopolymers can only fold to unique native states if the native state is very much lower in energy than all other possible conformations—our observation of significant energy gaps in all of the biological systems we have examined indicates that the magnitude of the actual gaps is considerably larger than the noise resulting from the remaining inaccuracies in the Rosetta forcefield.

         Because our model of interatomic interactions has reached sufficient accuracy that native structures are almost always significantly lower in energy than nonnative structures, the problem of predicting the structure of a protein, RNA, protein complex, etc.  has become primarily a search problem: starting from an extended chain we must sample close enough to the native free-energy minimum for the energy to drop lower than for all the nonnative conformations generated.   The challenge is that the energy drops only when the structure is sufficiently close (~2RMSD) to realize the close complementary packing and precise hydrogen bonding of the native structure (Bradley et al, 2005; Kim et al, 2009).

         Because of the large number of possible conformations for a protein or RNA chain, finding the lowest energy state is a formidable computational challenge.  We have developed a distributed computing project, called Rosetta@home (, to meet this challenge. There are now more than 200,000 participants worldwide whose computers run Rosetta structure prediction and design calculations when not otherwise being used. The project has sparked considerable interest in biomedical research. Inspired by this, we are working with high school teachers to develop a minicurriculum for students that will explain the science around Rosetta@home. We are also developing a multiplayer, interactive video game version of Rosetta@home called Foldit (http::/; Nature ref) that we believe will be an excellent vehicle for learning and, by allowing people to work with each other and with their computers, may allow solution of difficult scientific problems.  Foldit players have already exhibited quite amazing prowess both in solving hard structure prediction problems and in developing new algorithms and strategies for solving these problems.

         Even with Rosetta@home, the sampling problem remains insurmountable for all but the simplest proteins.  We have however opened up a very exciting new area by cheating a bit—using experimental data or protein homology information to overcome the sampling problem by constraining the search to the vicinity of the native energy minimum.  To achieve adequate sampling starting from information from structures of remote homologues, we have developed a multiscale search algorithm that begins with a broad low-resolution search over a wide range of conformations (Qian et al, 2008). It then shifts to a search using a detailed all atom representation for tightly packed conformations in which buried polar groups form hydrogen bonds to compensate for the loss of interactions with water.

         The first indication of the power of the Rosetta structure prediction methodology in combination with experimental data was the finding that a prediction of the structure of a CASP7 (critical assessment of structural prediction) target had sufficiently high accuracy that the x-ray crystallographic phase problem could be solved with the model (Qian et al, 2008).  Subsequent work showed that high resolution refinement using the newly developed Rosetta rebuild and refine protocol in a number of cases significantly improved molecular replacement using comparative models and NMR structures as starting points (Qian et al, 2008; Das et al, 2009).  This was an important milestone for protein structure prediction since quite high accuracy is required for molecular replacement.

         To allow guidance of the search process using X-ray and cryo-electron microscopy data, we incorporated a “fit to density” term into Rosetta (DiMaio et al, 2009).  This has allowed the building of atomically detailed models based on 4.5-7Å experimental electron density maps from cryoelectron microscopy.   We have encouraging recent results suggesting that by refining models with density we can solving x-ray structures starting from molecular replacement matches too weak for conventional methods to converge.

         In the NMR area, we showed in collaboration with Ad Bax that using NMR chemical shifts to guide the selection of fragments used in the first low resolution part of the Rosetta search process dramatically increased sampling around the native structure (Shen et al, 2008).  We showed subsequently that incorporating unassigned sidechain NOESY data improved performance further (Raman et al, 2010a).  The biggest breakthrough came with incorporating residual dipolar coupling and backbone NOESY data—with this relatively easy to collect data, we are able to solve structures of proteins of up to 200 amino acids with a fraction of the time and effort required for traditional NMR structure determination of proteins of this size (Raman et al, 2010b).  We are currently working to push the approach into the 200-400 residue size range where current NMR methodology is largely unsuccessful.  The Rosetta-NMR methodology is increasingly being used to solve structures in the NMR community.

Design of New Protein Functions

         Several years ago, we developed a general computational strategy for designing new protein structures that incorporates full backbone flexibility into rotamer-based sequence optimization. This was accomplished by integrating ab initio protein structure prediction, atomic-level energy refinement, and sequence design in Rosetta. The procedure was used to design Top7, a 93-residue protein with a novel sequence and topology. Top7 was found to be folded and highly stable, and the x-ray crystal structure of Top7 is virtually identical  to the design model (Kuhlman et al, 2003).

         Since the validation of our protein design methodology provided by Top7, we have focused on designing proteins with new and useful functions. We are concentrating on four challenges: (1) the design of new protein-protein interactions, (2) the design of new enzymes catalyzing reactions not catalyzed by naturally occurring enzymes, (3) the design of novel endonucleases with any specified cleavage specificity, and (4) the design of a vaccine for HIV.
Design of Novel Enzymes

         We have been developing general methodology for designing enzyme catalysts for any arbitrary chemical reaction.  Given a reaction for which a catalyst is desired, the first step of the approach is to compute the structures of the intermediates and transition states along the reaction pathway.  The second step is to design, using a combination of quantum mechanical calculations and general chemical principles, a number of ideal disembodied active sites consisting of protein functional groups (hydrogen bond donors and acceptors, positive and negative charges, aromatic rings, etc.) positioned around the superimposed ensemble of reaction intermediates and transition states in a manner optimal for catalysis.   Each such disembodied active site model is essentially a hypothesis about how to catalyze the reaction—for each reaction we need to experiment with multiple such hypotheses as there is no guarantee that any one will be a good catalyst.  The third step is to design proteins which fold to structures with pockets containing the disembodied active sites.  Again, for each active site hypothesis, we generally produce multiple design as we cannot be sure that any one protein design will fold so as to perfectly realize the desired active site.

         We are taking two approaches to the third (protein design) step.  The first approach is to design from scratch a protein which folds up to produce the desired active site. We have been developing the methodology for this approach by building up a library of “platonic ideal” versions of the most common protein folds; thus far we have designed and experimentally validated hyperstable ferrodoxin and Rossman folds with different number of secondary structural elements—we are now working to incorporate catalytic site architecture into this de novo design process.

         The second approach is to search through already existing protein scaffolds containing binding pockets for sets of backbone positions with geometries compatible with the ideal active site.   For this purpose we have developed a program called RosettaMatch which uses geometric hashing to identify compatible active site placements (Zanghellini et al, 2006); we generally search a set of 200 hyperthermophilic protein scaffolds from which the sidechains have been stripped.  Following the geometric matching step, we use standard Rosetta protein design to optimize the residues surrounding the binding pocket to maximize the binding affinity for the reaction transition state.  The results described in the following paragraphs were obtained with this approach.

         Given the uncertainties in each step in the design process, we compute and test many designed proteins for each reaction we are pursuing.  Fortunately, advances in gene synthesis technology have brought both the cost and synthesis times for production of synthetic genes way down in the last few years.  As designed sequences emerge from the computer, we translate the amino acid sequences back to DNA sequences and order synthetic genes in E Coli T7 expression vectors; once the designed genes are shipped back we can easily express and purify the his-tagged designed proteins using affinity chromatography, and then assay enzymatic activity.

         We have used the approach described above to design enzymes that catalyze four quite different chemical reactions using a diverse set of catalytic mechanisms (Figure 1). The designed Kemp elimination catalysts (Fig 1, top left) use classical general acid-base catalysis in which a protein is abstracted from a substrate carbon atom using either a histidine or carboxylate containing sidechain (Roethlisberger et al, 2008).  The designed retroaldolase catalysts (Fig 1, top right) utilize a catalytic lysine residue that forms a Schiff base with a ketone group on the substrate, and then serves as an electron sink to promote bond breakage (Jiang et al, 2008).  Two bound waters are positioned by hydrogen bonding groups in the design to promote proton shuttling.  The bimolecular diels alder catalysts (Fig 1, bottom right) have extended binding sites to bind both substrates, and hydrogen bond donors and acceptors to increase the overlap between the HOMO of one substrate and the LUMO of the other.  Finally, the designed esterase catalyst (Fig 1, bottom left) uses a his-cys dyad for nucleophilic attack on the substrate carbonyl group, and an oxyanion hole made up of one backbone atom and two sidechains to stabilize the negative charge in the transition state.

         We have designed active enzymes for each of the four reactions; an important step forward for design and enzymology.  Catalytic residue knockouts eliminate the catalytic activity, suggesting that the observed activities are due to the designed sites.   Crystal structures have validated the structure of the designed catalysts for all but the esterase design.

         This success in de novo catalyst design is very encouraging and an important  step forward for both the protein design and enzymology fields.  However, there is still huge room for improvement.  First and foremost, the activities of the designed enzymes are  low by comparison with native enzymes.  Fortunately, we have been able to collaborate with the research groups of Danny Tawfik (Weizmann) and Don Hilvert (ETH, Zurich) who are expert in directed evolution.  In collaboration with their groups, we have been able to increase the activities of the initial computational designs considerably, and equally important, learn what amino acid changes the design calculations missed-a critical part of the process of improving the methodology (Roethlisberger et al, 2008; Kheronsky et al, in press).  The most active of the evolved computationally designed catalysts now has a kcat/Km of 5x105/M sec  and a rate enhancement over the background reaction of  over 108 ; these values approach those of naturally occurring enzymes.

Design of Endonucleases with New DNA-Cleavage Specificites

         In the past several years, we have extended the Rosetta protein design methodology to protein-RNA and protein-DNA interfaces and shown that new, highly specific endonucleases can be created by redesign of the extended DNA-binding interface in homing endonucleases (Havranek et al, 2004; Ashworth et al, 2006). We are continuing to improve this methodology and designing new endonucleases that cleave within therapeutically important sites. For gene therapy applications, for example, we are designing endonucleases that cleave near the sites of mutations that cause disease; our collaborators will then experiment with correcting mutations in these genes through homologous recombination by introducing the designed endonuclease and a wild-type copy of the gene into mutant cells.  Thus far, we have successfully designed endonucleases with a range of cleavage specificities, and are getting close to the goal of producing enzymes which specifically cleave physiological target sites to go into gene therapy trials (Ashworth et al, 2006; Thyme et al, 2009).
Tradeoff between experiment and computation

         There is an interesting contrast between the roles of experiment and computation in prediction and design calculations.  Because of the large native energy gaps which must exist for biological polymers to adopt unique folded states, the structure calculation problem is primarily a sampling problem—if we can sample close enough to the native structure we can almost always recognize it as the correct solution based on its very low energy.  However, the sampling problem is formidable, and can only be solved from sequence information alone for the smallest proteins.  For the general case, we require even limited information on the location of the global energy minimum, which can be obtained from experiment.  Thus, the general approach to structure calculation we are developing focuses computation on regions indicated from experiments; experiments come first, then computations.

         In contrast, for protein design, there is no search problem, as we can in principle design any structure or activity we are interested in;  there is not an elusive native structure we are trying to find.  However, there is also no energy gap built in through biological evolution, so noise in the energy function becomes more problematic, and more importantly, we lack a complete understanding of the principles underlying catalysis and binding.  To make up for our lack of complete theoretical understanding and the inadequacies of our computational model, we turn to experiment:  starting from initial computational designs, we use experiments to probe their limitations and evolve them to higher activities.  Thus, for design problems,  computation comes first, then experiment.

Plans for the Future

         We will continue to work to improve the physical model and the sampling methodology underlying the prediction and design calculations in Rosetta. On the structure calculation side, we will strive for consistent near-atomic resolution ab initio structure prediction for small proteins, and work towards atomic level structure determination for proteins greater than 200 amino acids using limited experimental data such as backbone only NMR data and 5-7Å electron density data.   We will focus in particular on membrane proteins and other systems for which obtaining high resolution experimental data is difficult—this is where our approach are likely to contribute the most.  We will also extend data guided structure determination to biological assemblies where SAXS, crosslinking and other types of data often can be collected.  On the design side, we will extend our methodology to non natural amino acids and cofactors to try to leapfrog over the limitations nature has faced with the limited set of twenty amino acids.  We are aiming to design a complete pathway for fuel production from CO2 using solar generated reducing equivalents.  We will also develop and test methods for designing high affinity binders/inhibitors for any specified surface patch on a protein of known structure.  More generally,  we hope to develop new biomolecules with new functions—inhibitors, enzymes, endonucleases, and vaccines—that can have a positive impact on the world.

Figure 1. Examples of design models for experimentally validated de novo designed enzymes.  The chemical reactions catalyzed by the designed enzymes are indicated below the structure schematics in the black panels.