Deep Mutational Scanning to Analyze Protein Function
Understanding the functional and biophysical characteristics of proteins is of paramount importance. We have developed a method, deep mutational scanning (Figure 1), that makes use of protein display technology in conjunction with high-throughput sequencing. Deep mutational scanning enables the investigation of protein function on an unprecedented scale, facilitating the simultaneous measurement of the fitness of hundreds of thousands of mutants of a protein.
Protein display technologies physically link proteins and the DNA sequences that encode them. Protein display allows for selection among a large library of protein variants for those with a protein function. Protein display technology has been restricted in scope by the requirement for back-end DNA sequencing, which has limited the number of selected protein variants that can be identified to a few hundred. Deep mutational scanning alleviates this bottleneck by using high-throughput sequencing to sequence tens of millions of individual library members in parallel (Figure 1). The primary benefit of this approach is that millions of protein variants can be simultaneously identified and counted. Comparison of the frequency of a given variant in a selected library and in the input library yields an enrichment ratio that is an estimate of function. The key ingredientsprotein display, low-intensity selection and highly accurate, high throughput sequencingare simple and becoming widely available. Deep mutational scanning data can be used to construct protein sequencefunction maps, and systematic analysis of deep mutational scanning data can reveal fundamental protein properties. We have applied deep mutational scanning to a number of proteins in a variety of functional assays.
In Vivo Deep Mutational Scanning of an RNA-Recognition Motif (RRM)
Throughout its life, an RNA molecule associates with diverse RNA-binding proteins that regulate its processing and function. A single RNA-binding protein typically recognizes a particular subset of RNA molecules and affects their collective fate by regulating one or more steps in RNA metabolism, from pre-mRNA splicing to mRNA localization, translation and decay. Since these functions underlie multiple fundamental cellular processes, genetic changes that disrupt RNA-binding protein function can lead to multifaceted human pathologies.
We are using deep mutational scanning, an experimental strategy that couples high throughput DNA sequencing with assays of protein function, to study the effects of sequence variations on the function of a very common RNA-binding domain called the RNA Recognition Motif (RRM). Specifically, we made use of the necessity of a functional poly(A) binding protein (Pab1) for yeast growth and survival to test the in vivo effects of numerous mutations in the Pab1 RRM2 domain (Figure 1). In our system, the endogenous PAB1 gene has been deleted and replaced with a plasmid expressing the wild-type PAB1 from a tetracycline-regulated promoter. A second plasmid within these cells expresses one of many variants carrying a random mutation in the PAB1 RRM2. Adding a tetracycline analog to the culture shuts off the expression of the wild-type gene, making the cells completely reliant on the mutant PAB1 performance for growth. High throughput sequencing of the library variants before and after addition of the tetracycline analog allows us to measure the change in frequency of each variant, which in turn can be used as a proxy for the function of the mutant PAB1 RRM domain.
To date, we have obtained information on the effects of more than 250,000 RRM2 mutation variants of Pab1 performance. These data have allowed us to create a structurefunction map of RRM2 showing the importance of the beta-sheet structure to the function of this RNA-binding domain, as well as pointing at functionally important residues outside the RNA-binding site and these are currently under study (Figure 2). We also used these data to define a functionality-based consensus sequences for the two RNA-binding motifs within RRM2 (Figure 3).
Clustering the mutation sensitivity profiles of RRM2 residues allowed us to differentiate between structurally important positions in this domain, such as the RNA-binding interface and the hydrophobic core (Figure 4), an approach that could be highly useful when studying a protein with poor structural data.
We compared the ratio scores of naturally occurring to synthetic single amino acid substitutions in the Pab1 RRM2 domain (Figure 5). While most of the natural changes were neutral in their effects, as one should expect for a protein with a highly conserved function such as Pab1, some substitutions had deleterious effects. Mapping these mutations on the RRM2 structure revealed that most of them affect residues at the protein surface. We suspect that this approach allows the identification of protein interaction sites that diverged throughout evolution. Indeed, it was previously shown that some of these residues are involved in binding to a major translation factor, eIF4G. Our Pab1 mutational data allow us to dissect this site at higher resolution to refine the sequence required for binding. Taken together, we suggest that combining evolutionary data with single amino substitution-based data may improve our understanding of known protein binding sites as well as identify novel interaction sites on protein surfaces.
•Daniel Melamed, David Young & Christina Miller
Activity-enhancing mutations in an E3 Ubiquitin ligase discovered by deep mutational scanning
Although ubiquitination plays a critical role in virtually all cellular processes, understanding of the mechanistic details of ubiquitin transfer is still rudimentary. To identify the molecular determinants with E3 ligases that modulate activity, we developed a high-throughput assay (Figure 1) to measure the activity of nearly 100,000 protein variants of the U-box domain of murine Ube4b and found rare mutations that enhanced activity both in vitro and in cellular p53 degradation assays. Our results highlight the utility of high-throughput mutagenesis in delineating the molecular basis of enzyme activity.
•Lea Starita & Russell Lo
High-throughput Analysis of a Protein Degradation Signal
The ubiquitin proteasome system (UPS) governs most of the regulated proteolysis in eukaryotes. Substrates destined for proteasomal degradation are often modified with ubiquitin, which is attached to these substrates by a series of enzymes called E1, E2, and E3. A primary degradation signal of UPS substrates that is recognized by E3 enzymes is known as a degron. We designed a high-resolution strategy to map the sequencefunction relationships of a known degron in a systematic manner, by combining a simple genetic tool with high-throughput sequencing. Our system is based on the fact that yeast cells that express the URA3 gene grow in the absence of uracil. On the other hand, we can alter the stability of the Ura3 enzyme by fusing it to a degron that leads to rapid degradation, and thus alter the growth sensitivity of the yeast cells. We optimized this system using a well-characterized degradation signal, Deg1 from the Matα2 protein, fused to Ura3. To query mutations of every amino acid in the degron for their ability to stabilize or destabilize Ura3, we replaced the wild type Deg1 sequence with a library, constructed from doped oligonucleotides, that was designed to have a million different mutations in the N-terminal 33 amino acid region of Deg1. Plasmids containing the degron clones were prepped from cultures after uracil selection and subjected to Illumina sequencing. By comparing the number of times each degron mutant appears in the input pool versus in the selected pool, we can gain insight into how each mutation affects the stability of Ura3. This simple but powerful technique is also being applied to other biological questions that revolve around protein stability.
For example, in order to investigate the genes and pathways involved in the degradation of the Deg1-Ura substrate, we transformed the pooled library of 6,000 yeast gene deletion strains with this fusion construct. Our idea is that a combination of Deg1-Ura3 expression and the deletion of certain genes responsible for turnover of this fusion protein will lead to a change in the amount of Deg1-Ura3 present, and thus a change in cell growth rate in the absence of uracil. To measure the growth rate of each transformed strain in the pool, we use sequencing to identify the unique 20-nucleotide barcode sequence present in each deletion strain. By comparing the numbers of sequenced barcodes, we are able to find deletion strains that grow better than wild type, allowing us to identify novel genes and pathways associated with the degradation Deg1-Ura3 or any other UPS substrates.
In addition, we have designed and implemented a computational pipeline that can effectively process, annotate, and analyze high-throughput barcode sequencing data.
•Griffin Kim & Christina Miller
Systematic Analysis of Large Scale Fitness Data to Identify Mutations that Stabilize Proteins
Enhancing protein stability is often critical for industrial and pharmaceutical applications. Stabilizing mutations permit acquisition of other, destabilizing mutations that improve function. This phenomenon can be observed as epistasis, where multiple mutations combine with unpredictable fitness effects. We identify stabilizing mutations in a WW domain based solely on parallel measurement of the fitness of 47,000 variants to bind to a peptide ligand and subsequent calculation of >5,000 epistasis scores (Figure 2A). We introduce an epistasis-based metric, “partner potentiation,” which identified 15 candidate stabilizing mutations, including three known stabilizing mutations (Figure 2B). We tested six novel candidates by thermal denaturation and found two highly stabilizing mutations, one more stabilizing than any previously known mutation. Thus, systematic analysis of large-scale protein fitness data can reveal fundamental physicochemical properties such as stability.
High throughput assays to assess the effects of mutation on BRCA1 function
Our goal is to develop technology that allows us to rapidly assess the function, in human cells, of all the variants of a human protein that contain a single amino acid substitution. We will prototype this technology on the tumor suppressor BRCA1 protein, in which germline mutations result in a vastly increased risk of breast and ovarian cancer, and then extend the technology to other proteins implicated in cancer risk. BRCA1 is a large protein that has multiple activities and interactions. In collaboration with the lab of Jay Shendure in the Department of Genome Sciences, we will generate libraries of all single amino acid variants in the BRCA1 protein. We will assess these variants for their proficiency in DNA repair (using the entire protein), in E3 ligase activity (using the RING domain), and in phosphopeptide binding (using the BRCT domain). We will compare the results from our assays to data on disease risk and progression for known variants in order to establish the utility of our approach.
BRCA1 is an 1863 amino acid protein with a RING domain at its N-terminus and tandem BRCT domains at its C-terminus. BRCA1 has at least two biochemical activities. First, together with BARD1, BRCA1 acts as an E3 ubiquitin ligase. Second, the BRCT domain binds to phosphorylated peptides, an activity critical for BRCA1-dependent DNA repair via homologous recombination.
To assess the effect of mutation on ubiquitin ligase function, we will generate a library of coding variants of the RING domains of BRCA1-BARD1 fused to the coat protein of T7 bacteriophage. The E3-containing phage will be subjected to in vitro ubiquitination assays with a tagged version of ubiquitin, followed by antibody selection for phage that contain ubiquitin and that therefore encode an active E3 ligase. Phage that harbor an active E3 ligase will increase in abundance throughout the selection while phage harboring an E3 ligase with a deleterious mutation will decrease in abundance.
To assess the effect of mutations on the ability of the BRCT domain of BRCA1 to bind to a phosphopeptide, we will generate a library of coding variants of the BRCT domain fused to the coat protein of T7 bacteriophage. The BRCT-containing phage will be bound to beads coated with a cognate phosphopeptide, and those phage harboring a binding-proficient BRCT domain will increase in abundance throughout the selection, while phage harboring a BRCT domain with a deleterious mutation will decrease in abundance.
To assess the effect of mutations in BRCA1 on DNA repair via homologous recombination, we will generate a library of full-length BRCA1 coding variants. The library will be transduced into a reporter cell line that contains two defective copies of the GFP gene. One of these copies contains a site for a double-strand break catalyzed by I-SceI, and the other copy serves as a donor to repair the break via homologous recombination. The endogenous BRCA1 gene will be shut down by introduction of an shRNA against the gene. After induction of the double-strand break, cells that contain an active variant of BRCA1 will lead to the production of Green Fluorescent Protein. GFP+ cells will be separated from GFP- cells by FACS, and the BRCA1 alleles of the two populations will be determined by DNA sequencing. Cell-based assays are performed in collaboration with Jeff Parvin at the Ohio State University.
•Lea Starita & Russell Lo
Enrich: Software for Analysis of Protein Function by Enrichment and Depletion of Variants
•Doug Fowler & Carlos Araya
We developed Enrich, a tool for analyzing deep mutational scanning data. Enrich identifies all unique variants (mutants) of a protein in high-throughput sequencing data sets and can correct for sequencing errors using overlapping paired-end reads. Enrich uses the frequency of each variant before and after selection to calculate an enrichment ratio, which is used to estimate fitness. Enrich provides an interactive interface to guide users. It generates user-accessible output for downstream analyses as well as several visualizations of the effects of mutation on function, thereby allowing the user to rapidly quantify and comprehend sequencefunction relationships. Enrich is implemented in Python, is available under a FreeBSD license and can be downloaded here. Enrich includes detailed documentation as well as a small example data set.
Fowler DM, Araya CL, Gerard W, Fields S. Enrich: Software for Analysis of Protein Function by Enrichment and Depletion of Variants. Bioinformatics. 2011 Oct 17. [Epub ahead of print]
Understanding the Molecular Basis of Selectivity in the Protein Kinase A/AKAP-79 interaction
(with the laboratory John Scott, HHMI and Dept. Pharmacology, University of Washington)
Protein Kinase A (PKA) is a central intracellular protein kinase that regulates the activity of many proteins involved in cellular metabolism. PKA activity is controlled via interactions with A Kinase Anchoring Proteins (AKAPs). AKAPs function by binding to the PKA regulatory subunit, localizing PKA within the cell. AKAPs can interact with either the alpha or the beta isoform of the regulatory subunit of PKA, or they can interact with both. The alpha and beta isoforms are highly similar, making it difficult to study the molecular determinants of selectivity between isoforms (Figure 3).
We are using phage display in combination with high-throughput sequencing to identify the sequence determinants of AKAP selectivity. We displayed a library of millions of mutagenized AKAP proteins on the surface of T7 phage and then subjected this library to selection against either the alpha or beta isoform of the regulatory subunit of PKA. By comparing the abundance of each variant before and after selection, we derived enrichment ratios for several hundred thousand variants. Most variants performed similarly in selections against both the alpha and beta isoforms. However, some variants displayed strong selectivity for either the alpha or beta isoform. We are using the results of this assay to develop highly alpha- and beta-specific AKAPs. These highly specific AKAPs will bind only to PKAs with the cognate regulatory isoform. If introduced into cells at high concentrations, they will disrupt the normal regulatory interaction for their cognate isoform, enabling us to study the biological significance of the isoforms.
•Doug Fowler & Jason Stephany