Stephen Mayo (Cal Tech) Part 1: Protein Design by Computation

Hi, my name is Stephen Mayo. I’m a professor at Caltech, and today I’m going to give a talk entitled Recent Advances in Computation Protein Design The focus is going to be on designing and testing proteins combinatorial libraries The first question, or the first thing you might ask is why are we interested in computational protein design Well, fundamentally, we’re interested in understanding what the underlying physical principals are that govern folding, stability, and function of proteins Typically, or historically, to address these sort of relationships, we’ve used perturbation-based approaches For example, if you’re interested in understanding how an enzyme works, a protein, you might make mutations, perturbations in the active site of the enzyme and then assess experimentally what those perturbations did to the function of the protein What we’re after ultimately is using a design-based paradigm to study these relationships: folding, stability, and function. And in a design-based paradigm, what we’re trying to do is to first construct a mathematical model, a computational model, of all the things that we believe are true of proteins: how they fold, how they achieve stability, how they function And then within that computational construct, framework, we want to then run a design calculation, that is design new or novel proteins, and ask the question experimentally, are those proteins viable Do they fold? Are they stable? Do they have function? And if they have all these properties, then we can conclude that our computational model, our mathematical model, actually accurately captures the fundamental essence of what a protein is On the other hand, if the experiments don’t work, then hopefully we can learn from those failures to modify our models in a way that allows us to go forward in terms of building new knowledge into the model as we do designs and experimental validation As a chemist, ultimately, I’m interested in designing new systems, engineering new systems And in particular, we’re interested in things like engineering nanobodies, biosensors, industrial-grade enzymes, research reagents, etc All the way to using these sort of methods to design human therapeutics Okay, so in terms of an outline, I’ll give a fairly long introduction to computational protein design And then I’ll talk about the specific topic of the presentation, which is the development and testing of the methods for designing protein libraries And then at the end, I’ll give an example of how these methods can be used to design a research reagent, in this case a blue fluorescent protein with enhanced optical properties Okay so before I start then in terms of protein design in particular, let me just give you a brief overview of biology So everything you need to know about biology is actually on this slide Importantly, we start with DNA, and as you know, DNA gets transcribed into RNA, and then RNA gets translated into protein We’re going to focus on the protein end of the spectrum A lot of the hype recently has been in genomics, that is studies related to DNA, but the real action ultimately is at the protein level, because proteins are the things that actually do stuff in cells They are the molecular machines, they’re involved in signal transduction, in metabolism, in replication, in essentially all aspects of cellular function So in general, proteins are linear polymers of amino acids The structure of the protein, and often time the dynamical behavior of the protein, dictate the function of the molecule This important because we use this idea of the relationship between structure and function in our design effort As I said earlier, proteins are involved in essentially all the cellular processes Across the bottom we have examples of three different proteins, just sort of randomly selected On the left side is a protein, a small bacterial protein, called protein G. And it is shown in full atom representation It is about 60 amino acids and you can see it’s a fairly complex structure even though it’s a relatively small protein In the middle is a protein roughly the size of 300 amino acids shown in a ribbon diagram And this protein is an enzyme and you can see if you look closely near the middle of the molecule there will be a small molecule substrate This is around, as I said, 300 amino acids

On the right side of the slide is a membrane-bound protein It is all helical and it resides in the membrane, and the membrane will you know, form a layer like this, and this thing would stick in the middle And you can also see that in the middle of that protein there is a chromaphore that absorbs light and is part of the function of the molecule Okay, so when I talk about protein design, I often get questions that are related to protein folding So let me take a minute to distinguish between protein folding And there’s protein folding prediction, and protein design, which is often referred to as inverse folding And so in protein folding prediction, we have an amino acid sequence and the question is can we make a prediction about the structure that the protein will fold to in solution Now we can think about this in terms of the diagram on the bottom, we have sequence space Sequence space is large, and that blue dot in that sequence space diagram is a representation of the amino acid sequence of interest The mapping that we’re interested in is going from sequence space to structure space. And this is a difficult mapping One reason is that you’re either correct or incorrect in the sense of there’s a single correct structure that this particular sequence will fold into On the other hand, protein design is essentially exactly the opposite problem We start with some structure that we’re interested in, and the question that we’re asking is can we predict, in our case compute, an amino acid sequence which will fold to that structure if we put it into solution in the laboratory And so the mapping then is exactly the opposite We’re going from a point in structure space to a set of sequences in sequence space Now you’ll notice that in the inverse direction, there will be many solutions in sequence space that satisfy our target structure. And this is degeneracy , that is this one to many mapping, aids us in the protein design process So protein design, fundamentally, is a much more straight forward endeavor than protein fold prediction because the idea that there are many solutions which actually lead to the correct answer, that is the correct structure in solution Library design, protein library design, is related So in this case, rather than trying to identify some single optimal sequence for our target structure, we’re going to again use the same mapping procedure, but we’re going to now specify an entire list of sequences, all of which are predicted to fold to the correct structure And then in the laboratory, we can actually take this list of sequences, this sequence alignment, and manifest it as a nucleic acid library, as a protein library, and then screen or run a selection on that library to find those examples which actually satisfy our design objective Folding to the correct structure, and ultimately having some desired functional characteristic Okay so why do this computationally? That can be shown fairly easily on this slide If we have a single protein, comprised of p residues, and as you know there are 20 naturally occurring amino acid types So for a single protein, there are on an order of 20 to the p different sequence combinations So there are lots of ways that you can actually arrange amino acid sequences for some given length of protein So let’s think about this then If we want to design a protein that is comprised of only 18 amino acids, 18 residues, the total number of combinations is on the order of 10^23 different solutions, different arrangements So if we think about that from an experimental perspective, if we wanted to synthesize, chemically synthesize, one example of each of the 10^23 different sequence combinations, we would need the mass of baseball, which is shown on the top And that’s actually almost tractable. You can almost imagine sending a graduate student into the laboratory to do peptide synthesis and synthesize enough material that’s roughly the size of a baseball I wouldn’t want to do it but maybe someone would But this gets out of hand pretty quickly And so by the time you’re at 37 amino acids, which is still a trivial-sized protein, you’re at 10^48 combinations, and the equivalent mass then would be the mass of the earth And then by 59 amino acids, which is still very small

You know, most functional proteins are probably on the order of 100s of amino acids long At 59 amino acids, you’re at 10^77 different sequence combinations, and you’re at the mass of the universe So clearly experimentally, there aren’t methods that you can use in the laboratory that in a single shot, will allow you to explore the full sequence diversity available to even the small proteins Computationally, however, this is actually a very simple problem, in terms of the numbers, at least So we can easily think about methods, combinatorial optimization methods, which will handle numbers like the astronomical sizes that we see here And then the real challenge is can we come up with the physics, quote unquote, that describes a protein well enough that when we do the combinatorial optimization on the computer, we end up with sequences that will in fact be viable in the laboratory Okay so over the years we focused on both developing methodologies for computational protein design and on applying these methods in the laboratory to various types of design problems Most computational protein design methodologies have these set of features in common First of all we have to have some representation of a protein backbone that goes into the calculation Typically the backbone structures are derived from the protein structure database So for most of what we do and most of what others do, the designs are based on the backbone of proteins that are known to exists, whose structures have been solved either by crystallography or by NMR Importantly, when we think about amino acid side chains, amino acid side chains have conformational flexibility, so a particular conformation state of an amino acid side chain is call a rotamer, and we have to be able to describe this conformational flexibility in terms of the calculation So we do that by building rotamer libraries that are again derived from the known protein structure database So the rotamer library capture at some level the conformational flexibility of the amino acids Now importantly, in terms of capturing the underlying physics of the problem, chemical physics of the problem, we have to specify atom based force fields And it’s here that we actually are building our model And I’ll show you more about that in a minute So these aren’t force fields like on Star Trek , these are simple mathematical equations that describe the interactions between sets of atoms And how we write those equations down and the parameters that go into those equations are ultimately the important aspect of our ability to do this design Along with how we’re treating side chain flexibility through the rotamer library and how we are treating or not treating issues of how the protein backbone may relax as the design progresses With those three elements then, we can move to more computer science oriented issues, and that’s where the combinatorial optimization algorithms come into play So we can easily construct a combinatorial problem for design, and employ either standard or improved or novel methods for doing those sort of optimization calculations And so that’s really the combination of straight-forward computer science and applied math, but applied to the problem of protein design And then ultimately there’s this issue of negative design For many protein design problems, it’s not sufficient to focus on achieving what you want, the positive design results: I want to build a protein that does x You also have to, in many cases, consider what you don’t want, that is you must be able to design against the bad outcomes. So if you’re thinking about building a bridge, for example, clearly you want to do a positive design aspect, which is to build a bridge so that cars can drive past it, but you also have to consider building a bridge to guard against negative consequences, like what happens if there’s a high wind, you don’t want that bridge to fall down So protein design is similar in that regard There are many types of protein design problems where an element of negative design is actually critical So for example, designing against aggregation is an important feature Designing against your amino acid sequence folding to some alternative structure that you don’t want So there are ways, there are conceptual ways and pragmatic ways to actually incorporate negative design into your calculation Okay so with that collection of methodology, we’ve been able to write software that actually implements this and to use that software then to think about

design problems that are related to understanding sequence structure, stability and functional relationships, the evolution of protein structure and function, we can do things that are related to hypothesis-driven inquiry, and often we’re interested in using these methods to enable a new protein based biotechnology Okay so here are just some brief examples of things that we’ve done over the years So back in 1997, we showed for the first time that you can actually do all this stuff and it actually works So we selected a small protein fragment, stripped off all the side chains, ran our design calculations and came up with a new sequence, and then in the laboratory showed that that new protein sequence actually folded and had the same structure as the original design target And so you can see that in red is the design target, and in blue is the structure of the actual design protein On the bottom left, you can see that to the extent that the atom-based force fields are actually accurate, and the extent of the combinatorial optimization algorithms are giving us amino acid sequences that are optimal for their target fold, that is that are thermodynamically optimal We should be able to design sequences or sequence variants of naturally occurring proteins that are hyperthermal-stabilized for their folds And so on the bottom left is such an example We took a small bacterial protein and showed that through computation that we can design a set of mutations that turn that protein from being a normal mesophilic protein with a normal melting temperature to a protein that is now hyperthermal-stable with a melting temperature in excess of 100 degrees Celsius Also in terms of function, we can see that it’s easy to imagine ideas of transition state stabilization being folded into these methods and then doing optimizations with substrates and transition states in the calculation, and hence designing an enzyme So on the lower right, we have an example where we took a non-catalytically active protein and then designed a very primitive enzyme by building a very primitive active site, and this molecule we refer to as a protozyme, it’s not a real enzyme, because the kinetic parameters aren’t as good as a real enzyme, but mechanistically shows all the standard features of real enzymes: saturation kinetics, it can be inhibited, we can even see by mass spectrometry analysis the intermediate in the reaction Okay, so in general when we think about doing a design, we refer to this process flow chart First thing we obviously want to do is specify the design objective Once we have a design objective in mind, we can then employ our computational design software, which in our case is called ORBIT, which stands for Optimization of Rotamers by Iterative Techniques The output of that calculation is an amino acid sequence, or a list of amino acid sequences, and then we can actually take that list of amino acid sequences and then use standard molecular biology techniques, gene assembly, and expression in our favorite host, we typically use E. coli but have used other hosts when appropriate Once that protein is expressed, we can isolate that protein and purify it using standard techniques, and then evaluate the properties of that protein using standard biophysical approaches It’s at that second to last step that we learn whether or not we’ve been successful in our design If not, then we can learn hopefully quantitatively where the failings are and then iterate on the design process to hopefully get closer to our design objective as we go multiple rounds into the process Okay, so let me just take you through a design calculation, a very simple design calculation to give you an idea of what we’re actually computing So this is a case where we have on the lower left, a model of a protein Maybe it’s a protein with two alpha helices And we’re going to select two positions out of this protein for which we are going to do a design Position P1 and position P2 The first thing we have to do is specify the list of amino acids that we’re going to allow at each of these positions in the design calculation So at position P1, we’ve selected alanine, valine, and serine as the allowed amino acids But you’ll notice that for example for the valine column, for the valine group, there are three different types of valines

with the subscripts 1, 2, and 3 And these are the 3 different rotameric states of valine So as you’ll recognize chemically, valine has a single relevant rotatable bond in the side chain, and it can exist in 3 energetically stable states, 3 different rotamer states And then serine, for this simple example, we’re also using 3 different rotamers For position P2, we’re selecting the same amino acids and allowed rotamers to keep the calculations simple Now for a normal design calculation, there might be 10s to 100s of design positions, and 100s to 1000s of rotamers per design position So what I’m hoping to convey here is that this is a simple example, but the real example is much larger, real examples are much larger and much more computationally challenging than what I’ll show right now Ok and so there are 2 types of energies that have to be computed The first is what we often refer to as the one body energy, and that is the energy of interaction between an amino acid side chain, a rotamer, and the rest of the protein backbone And then as you’ll see momentarily, there will be a second type of energy, called the two body interaction, that will be interactions between pairs of rotamers, pairs of side chains Okay, so the first thing we do then is we have to build the rotamer onto the backbone. So we grab the first rotamer, alanine, which has a single rotamer, and we put it on the protein backbone, shown as the yellow ball And then we have to compute its interaction energy, the interaction energy between the atoms of the alanine side chain and the rest of the protein backbone And the way we compute the energy is by evaluating the potential functions that are shown, the types of potential functions that are shown on the slide So the first potential function is a Vander Waals potential It scores the interactions between atoms by basically measuring their distance If atoms get too close together, the energy skyrockets as shown on the graph The x-axis there is distance between atoms, as they get closer together the energy goes positive and large, which is bad, you always want your energies to be negative, you want them to be as small as possible And however, if atoms are at some appropriate distance, than you get an attractive energy, which is shown as a minimum in this plot Okay so that’s the Vander Waals potential function, it basically scores steric interactions There are other types of potential functions Here is a simple electrostatic potential, based on Coulomb’s law, and in the diagram, you see that there is an amino acid with the blue tip is a lysine, which carries a positive charge, and the amino acid with the forked red tips is either an aspartic acid or a glutamic acid, carrying a negative charge, and so there will be an electrostatic attraction between those two amino acids So we can score that, in this cause, using a simple Coulombic potential There’s also hydrogen bonding potential, so certain polar amino acids will form strong interactions that are primarily electrostatic in nature but also have a strong directional component We can capture that with that potential function And then in addition, we often use a solvation potential where the solvation terms are divided into various groups We want to benefit the burial of hydrophobic amino acids, that is benefit hydrophobic surface area burial We want to penalize, this is actual a negative design term, we want to penalize the exposure of hydrophobic surface area. If we have too much exposed hydrophobic surface area in our designs, then the proteins will most likely aggregate, that is an alternative state that we don’t like We also want to penalize the burial of polar surface area Charged amino acids and polar amino acids don’t like to be stuffed in a hydrophobic core of a protein, they rather be exposed to solvent or interacting with other polar amino acids And that is sort of captured at least schematically in that last equation Okay then for all the rotamers, at each position, one at a time, we evaluate these potential functions and then we store the energies in this vector, this one body energy array Okay so then the rest of this we’ll then go through and one at a time you’ll see the different amino acids with their individual rotamers getting loaded onto the protein structure, and then the calculation being done, and you’ll see now that it switches to position P2, and then

its energies are also being stored in this array, this one dimensional array Okay so now we’ve actually computed all the one body energies for the system The next step is to compute the two body energies, and that’s then illustrated here So again, for a normal calculation, we’d have many more positions, so this matrix would actually be very large. Here we have a single pair of positions, because there are only 2 positions in the calculation And for the first position, you’ll notice I have all the amino acids and the rotamers down the side, and for the second position, all the amino acids and its relevant rotamers are across the top And so I’m actually going to do the same sort of calculations, but now the protein backbone is not important So the protein backbone sort of fades away and is not a part of the calculation. So at position P1, I select the first choice alanine, number 1, and at position P2 I select the second choice alanine, number 1 as well And I compute the interaction energy Now I’m just looking at the interaction energy between the atoms in the first rotamer and the second rotamer. And I store that value in the matrix. I’m using the same potential function, the same set of potential functions that I used for the one body interactions And then I can then step through this and compute all the pairwise interactions in the system. And eventually I will fill up the entire matrix So now I’ve computed all the energies. I’ve got a list of one body energies and a list of two body energies, and now is where I can actually do a calculation, do an optimization calculation. So for the optimization calculation, the protein structure goes away, we no longer have to worry about the protein backbone. In fact, in the simplest cases, we don’t even care that this is a protein anymore All we actually have are a bunch of numbers that we’re trying to optimize We’re trying to march through this matrix, and find a combination of numbers that gives us the lowest energy And so what I’ve done for this example is I’ve retained the two body matrix intact, and I have taken the one body interaction vector and I’ve broken it into two pieces. So down the side is the piece for just the first position, and across the top, are the energy values for the rotamer in the second position. So in this particular optimization, I’m going to employ a technique called Monte Carlo simulations, which is based on effectively random selections of potential solutions and scoring those choices and then making decisions about whether we want to keep or reject various changes And ultimately we’ll be able to sample the matrix in a way that allows us to find solutions which are low in energy. So the first thing which we have to do is initialize a search So I have two positions that I’m worried about, position P1 and P2 So I randomly select a rotamer for position P1, the second serine in this case I randomly select a rotamer for position P2, the first valine I grab the two one body energies and the single two body energy and I just sum them. And the sum of those three numbers is 21.8. So that’s our initial energy Then I make random changes to the system, keeping track of what I’m doing, and eventually via this Monte Carlo process, I’ll end up finding the best solution Which in this case is the total energy of -9 kcal/mol and it turns out to be the third serine rotamer on position P1 and the second serine rotamer on position P2 So maybe they’re on the helix making a hydrogen bond So once I’ve actually identified what the solution is, I can map that back onto the structure So now I’ve got a model, a sort of atomic detailed model of the result of the calculation And I can use that for other purposes. I can look at it, I can use my intuition and make evaluations about whether or not this is a reasonable solution I can take that sequence then and pass it on to someone in the laboratory who might actually make that protein. Okay, now the Monte Carlo simulation is actually very easy, it’s perhaps the simplest way of doing these optimizations. For many real world design calculations, Monte Carlo is not necessarily sufficient, and there are more sophisticated optimization methods that can be used to do protein design, some of which are listed on the lower right The method that we prefer is a method called FASTER, which is fast, actually

And very effective at finding low energy solutions for these sort of design problems Okay so here’s other recent results that have been reported from different labs On the upper left is a case from the Hellinga lab at Duke, where they’ve been able to use these sort of methods to design biosensors for small molecules. In the middle is another example from that lab where they were able to design enzymatic functionality into a protein On the upper right is an example from the Baker lab at the University of Washington, where they were able to actually design a novel fold, which was an impressive result On the lower part of the slide is a protein therapeutics result from a company called Xencor, which I was involved in co-founding. In this particular case, we were able to design a variant of a protein that might be useful for anti-inflammatory indications