Computer Simulation of Protein Folding

Annu. Rev. Biol. Biomol. Struc., in press.

  • Richard A. Friesner Department of Chemistry Columbia University New York, NY 10027
    and
  • John R. Gunn Department of Chemistry University of Montreal C. P. 6128, Succursale A Montreal (Que.), Canada H3C 3J7

    I. Introduction

    The determination of protein structure from the amino acid sequence via computational methodology is a major goal of theoretical molecular biology. The determination of structure by experimental means such as NMR spectroscopy and x-ray crystallography cannot be expected to keep pace with explosion of protein sequence information provided by DNA sequencing techniques; in any case, such techniques are costly in terms of both equipment and human effort. With the cost/performance of computational hardware being reduced by roughly a factor of two every eighteen months, and with improvements in software and algorithm development providing an even greater enhancement of computational efficiency, it is clear that the goal of computation-based structure determination will eventually be reached.

    However, there are still formidable problems, both conceptual and numerical, in making this objective a reality. These problems involve both the issue of the appropriate representation of the protein (most critically, the potential functions that assign energies to any given protein structure) and the algorithmic issue of finding the native structure, starting from an unfolded chain, in a finite amount of computation time. These two central areas are of course intertwined: if computational expense was irrelevant, one could simply solve the Schrodinger equation for each protein configuration (averaging over all positions of the solvent molecules), thereby guaranteeing accurate evaluation of the potential energy. The coupled problem is then to have the total computation time required to obtaining the predicted structure, roughly the cost of evaluating the energy of each structure multiplied by the number of such evaluations, be tractable with current computing technology. The development of new potential functions and the investigation of improved methods for the minimization of functions (to reduce the number of evaluations of the potential function that are necessary) are therefore two key areas in computational studies of protein folding.

    Over the past 5 years, a wide range of approaches to the computational protein folding problem has emerged from many different research groups. At one extreme, simple cubic lattice models have been used to investigate fundamental features of heteropolymer folding which are more or less independent of the details of protein structure [15, 19, 20, 32, 61, 62]. At the other, a small number of atomic level molecular dynamics simulations of protein unfolding including explicit water molecules have been carried out [5, 17, 70]. In between, a number of different "intermediate" or "reduced" protein models have been developed and parametrized. The variety of computational techniques that have been applied to these models is equally diverse, ranging from standard versions of Monte Carlo molecular dynamics, to novel approaches like genetic algorithms [71] and the diffusion equation method of Scheraga and coworkers [44, 64].

    Useful theoretical insights into the folding process have emerged from all of these approaches as well as from analytically based theory, many of which have been discussed in recent reviews [9, 19]. The present paper, however, will be focused primarily upon the practical problem of the determination of structure from sequence by computational means. Furthermore, our emphasis will be on the thermodynamics of the problem as opposed to kinetics, i.e. the detailed pathway taken by the protein to fold in nature. While some insights into kinetics will arise naturally in the course of generating ensembles of structures during the minimization process, care must be exercised in drawing conclusions about real time events from such simulations. Finally, we restrict our discussion to the problem of prediction of novel folds which are not represented in the current protein data base; homology modeling [63] and inverse folding [6], while extremely important areas, assumes the existence of an approximate three dimensional template structure and the methods relevant to this problem differ substantially from those needed to span a wide range of phase space to search for novel folds.

    The paper is organized as follows. In section II, we give a general overview of the folding problem from a physical and computational viewpoint. We then discuss in section III the models and algorithms that we, and others, have utilized to generate folded structures. Section IV present results from these calculations and attempts to assess the degree of progress that has been made by following various lines of attack. In section V, the conclusion, we identify what we believe to be the key issues in making future progress, and how they are likely to be successfully addressed.

    II. Overview of the Protein Folding Problem

    A. Qualitative Physical Features of Protein Folding

    The two most striking features of native protein structures in water are (a) the overwhelming tendency of charged or polar side chains to appear on the "exterior" of the protein (i.e. exposed to solvent) and hydrophobic side chains to be located in the "interior" of the protein (protected from solvent), forming a unique compact globular structure for a given sequence; (b) the formation of secondary structure segments, alpha-helices and beta-sheets, so as to satisfy backbone hydrogen bonds. In a naturally occurring globular protein, the pattern of hydrophobic and hydrophilic residues is such that the formation of a micellar structure, with the hydrophilic groups exposed to water and the hydrophobic groups buried in the interior, is possible. A central feature of proteins, as compared to other polymers or colloidal structures, is that this micellar formation must be compatible with the two secondary structure classes. If the backbone groups inside the protein cannot make a sufficient number of internal hydrogen bonds, the protection of hydrophobic side chains from solvent will not compensate for the removal of the backbone carbonyl and amino moieties from aqueous solution.

    Simple topological arguments would then suggest that a relatively large fraction of the protein (~40-70%) must be in either an alpha-helix or a beta-sheet. Furthermore, each individual secondary structure region cannot be too long (or else the protein would lack the ability to fold into a compact globular structure) and the "loop" regions joining the secondary structure elements also cannot be too long (or else the backbone residues in the interior of this region would be have difficulty efficiently forming hydrogen bonds, as suggested above). The picture that then emerges is one of rigid rod-like segments (ignoring for the moment the twists and deformations of alpha-helices and beta-strands) joined by flexible loops, with each segment of length between 3 and 30 residues. This description in fact characterizes the vast majority of proteins whose structures are available in the protein data bank .

    It is well known from both theoretical and experimental results that secondary structure elements in proteins are marginally stable; indeed, helical and beta-sheet segments in proteins rarely retain their form when removed from the protein environment and studied as small peptides [2, 50]. Thus, protein folding is a highly cooperative phenomenon in which the final secondary structure pattern and tertiary architecture are determined by global optimization of the hydrophobic effect and backbone hydrogen bonding. The native structure is a result of a delicate balance of energetic terms; the overall thermodynamic stability of a native protein at room temperature is many orders of magnitude smaller than the total energy of the system. These features make the computational determination of protein structure extremely demanding with regard to the accuracy of the potential function and the magnitude of the conformational searching problem.

    B. Computational Strategies for Protein Structure Prediction

    We shall assume in what follows that the native structure of a protein corresponds to the global free energy minimum of a physically accurate potential function (this assumption may not be universally valid for all proteins, but is likely to be true for a large number). We will furthermore, as stated above, not be concerned with the kinetics of in vivo protein folding here, for example the issue of the timing of the onset of secondary structure formation as compared to collapse into a compact tertiary structure. The problem is then reduced to construction of a potential function and algorithms for minimization of that potential function. The first critical question, then, in devising a strategy for protein structure prediction, is to evaluate the suitability of the different types of theoretical protein models for this task.

    We begin by eliminating several types of models as unsuitable either on the grounds of accuracy or computational tractability. Firstly, models utilizing explicit solvent representations are too expensive to allow determination of the global energy minimum, starting from an unfolded state, with any degree of efficiency. This is not only because of the substantial CPU cost involved in evaluating the solvent-solvent and solvent protein interactions; it is also because an enormous amount of averaging over solvent configurations is required to determine the free energy of any compact protein structure. The only real hope would be to run energy conserving molecular dynamics and obtain formation of the native state on the same timescale as is observed for the in vivo system. At present, the discrepancy between the timescale for practicable MD runs (nanoseconds) and the timescale for protein folding (milliseconds at best) is a minimum of six orders of magnitude; even parallel computers will not bridge this gap any time in the near future.

    Coming from the other end of the spectrum, we believe that oversimplified models -e.g., simple cubic lattice models with nearest neighbor interactions - are inadequate to obtain predictive results for protein structures at sufficient resolution to be of biological interest. The lack of proper geometrical structure and of the peptide bond and the absence of a reasonably accurate representation of secondary structure are serious flaws which it is difficult to see how to overcome in the potential function. There are a number of results in the literature in which cubic lattice models have yielded coarse representations of native-like geometries which are ranked in the top several thousand structures out of those surveyed [15,16, 32]; however, it is unclear how useful such results would be if the native structure were not known a priori, and virtually all such studies have been carried out on quite small proteins. The question of whether simplified models can provide qualitative insight into folding thermodynamics or kinetics via universal features which are transferable to more accurate representations, as has been quite plausibly suggested by a number of workers [9, 19,61], is a more controversial subject, which we shall not discuss here due to lack of space. Even if this is the case, however, this in no way implies that detailed a priori structural predictions can be made with such models.

    If these judgments are accepted, there remain three important types of models. The first of these is detailed, atomic level models of the protein coupled with a continuum model for the solvation free energy. There is a long history of the development of molecular modeling potential functions for proteins based upon the physical chemistry of intermolecular interactions, and a significant number of protein force fields (CHARMM [7], ECEPP [60], AMBER [72], OPLS [70]) are available. The development of continuum solvent models has made considerably progress over the last decade, and a number of alternatives [21,66] (the most accurate of which are based upon numerical solution of the Poisson-Boltzmann equation [24,34] ) are currently available. While issues about accuracy of both parts of the potential remain, and the computational expense of evaluating the energy with such a potential is far from trivial, these models have the advantages of being systematically improvable as more accurate physical-chemical potential functions are developed, and of providing atomic level detail for the structure and energetics without the necessity of explicitly averaging over ensembles of solvent molecules.

    The second type of model is a reduced model which can realistically represent the geometry of the peptide bond and the various secondary structure elements, but treats side chains and intermolecular forces in an approximate manner, thus substantially reducing the computational effort required to evaluate the potential function. Potential functions must in this case be constructed heuristically, utilizing statistical information from the protein data base as well as physical-chemical energetics. Such models are probably limited to 4-6 Å resolution; however, one can hope to start from a structure of this type and then utilize a detailed atomic level model to refine the resolution further. This approach has been pioneered by Skolnick and coworkers [38-43,55,56], who began with simple lattice models but, by the use of increasingly complex lattices, have ultimately emerged with rather fine grained representation of the torsion angle phase space of the protein, along with an accurate geometric mapping from the lattice to three-dimensional space. The alternative approach to a fine grained lattice representation is to simply utilize off-lattice models with an accurate treatment of the backbone structure. In practice, it is doubtful that there is much difference between these two types of models, from the point of view of either computational efficiency or achievable structural accuracy. The crucial issues for either lattice or off-lattice models of this type lie in the details of the potential functions.

    The third type of modeling is unique to the protein folding problem: prediction of secondary structure from the protein sequence. These methods are typically based exclusively upon statistical analysis of the data base, and employ a variety of mathematical procedures, ranging from simple secondary structure propensities [23] to sophisticated pattern recognition techniques like neural networks [59] or segment based approaches [54], to predict the secondary structure state of each residue. At present, accuracy is on the order of 70% for the most successful methods, and there are clearly difficulties in going beyond this value, due in part to the lack of tertiary contact information inherent in such procedures. However, this is an active area of research, and new methods are continually being developed and tested. For example, recent work of Benner and coworkers [3,4] utilizing heuristic, as opposed to statistical, methods has achieved interesting results in blind test predictions.

    These three types of models in turn suggest three distinct strategies for computational prediction of protein structure. Firstly, one could simply minimize the energy of the atomic level model with continuum solvation. This, in essence, has been the strategy pursued by Scheraga and coworkers for the past 30 years. Putting aside questions concerning the accuracy of the potentials, the minimization of an atomic level model is formidably difficult, not only because of the cost of evaluation of the potential per structure but because the use of atomic level detail leads to a very "rough" energy landscape where it is easy to become trapped in a high energy local minimum. However, recent work of Scheraga's group has led to the development of a novel minimization approach, based upon solution of a diffusion equation [44], which appears to have the potential to circumvent these problems. While only small systems have as yet been successfully minimized, there is real hope that an approach along these lines will render direct location of the global minimum tractable at least for small proteins, albeit at substantial computational expense.

    The second strategy is to carry out reduced model simulations to determine secondary and tertiary structure simultaneously. This approach has been most vigorously pursued by Skolnick and coworkers, who have recently achieved some impressive results for small proteins, described in more detail below. The key problems here are construction of a potential capable of reliably discriminating secondary structure formation and whether the correct structures will be formed on a reasonable simulation timescale. Subsequent mapping of the final structure onto an atomic level model is relatively straightforward.

    The third strategy is to carry out reduced model simulations with fixed secondary structure, obtained from either secondary structure prediction algorithms or from experimental data, for example NMR data. This approach was initiated by Cohen, Richmond, and Richards many years ago [14] and has been pursued by Cohen and coworkers [57] and also in our research groups [28,51,52] over the past several years. It is based upon the idea that once secondary structure is specified the phase space for the folding problem is qualitatively reduced and simplified, thus allowing a relatively rapid determination of the correct tertiary packing. Recent analytical work by Wolynes and coworkers [47] suggests that a dramatic reduction of phase space when secondary structure is specified will in fact allow tractability in the conformational searching problem to be achieved. As described below, considerable success has been achieved by this approach, even for relatively large, complex structures such as myoglobin.

    Despite the improved prospects for direct minimization of atomic level models describe above, it is our belief that the development of reliable potentials and simulation strategies for reduced models is essential for solving the general computational protein folding problem, particularly if one wants to treat the full range of protein sizes and topologies. A typical reduced model evaluation of a protein potential function costs 100-1000 times less than a detailed atomic level model and this advantage will be decisive whatever minimization algorithm is used. The issue of how to best incorporate secondary structure into reduced models is at present far less clearcut, and it is likely that some combination of strategies two and three described above will ultimately be optimal for this purpose. The remainder of the paper is primarily focused upon reduced model and algorithm development and results, with a brief discussion of issues that arise in building atomic level models from reduced templates.

    III. Reduced Models, Potential Functions, and Algorithms for Protein Structure Prediction

    A. Models

    The distinguishing feature of any reduced model of proteins is the goal of lowering the number of degrees of freedom in an effort to render the problem more manageable from a computational point of view. As well as simply reducing the dimensionality of the problem, these models can also incorporate a coarse-graining of the representation of the molecule, reducing the level of detail in the model, as well as the discretization of the space of possible conformations the model can adopt, limiting the sampling required for optimization. The use of simple lattice models provides the most direct form of discretization, however often at the expense of being restricted to model polymers rather than real proteins. For a cubic lattice with each site corresponding to an amino-acid residue, the structural prediction for a real sequence is limited to roughly 7-8 Å RMS for relatively small (50-75 residues) molecules [15, 16]. One novel approach involves determining the topology of the molecule by exhaustive enumeration of paths on a diamond lattice [32], but with a flexible assignment of residues to the lattice sites, allowing them to seek out the most favorable contacts and generating a more realistic distance map.

    An alternate method has shown [25], however, that the flexibility of the backbone can be reasonably approximated by using an expanded set of basis vectors (and therefore possible nearest-neighbor bonds) along with a number of lattice sites to represent the volume of each residue, and that the results do not differ qualitatively from dynamics with continuous degrees of freedom [40]. The asymmetric torsional potential (due to the chirality of the Calpha centers) can be modeled with many-body interactions which can give rise to secondary structure [41,38,16] which would not appear otherwise [39].

    Similar models have also been introduced using the representation of the backbone and sidechains by the Calpha and Cbeta atoms, respectively, but with continuous spatial coordinates [56,26,68,67]. Other methods which use a finite set of discrete values for the internal degrees of freedom (typically backbone dihedral angles) combine the advantage of a discretized conformational space with unrestricted atomic coordinates [51,28,11,35]. The set of values can be selected from a small number of representative local conformations [51,11] shown to be able to reproduce global structure [58], or from a uniform grid in the dihedral angle space [28,35]. This type of discretization has also been extended to side-chains, where the possible conformations of each amino acid are described by a small number of rotamer states [1,42,35].

    An important question arises concerning the formation of secondary structure in the context of a reduced model, since the specific interactions which determine the detailed features of secondary structural motifs are difficult to represent with a coarse-grained model. This can be overcome by including specific potential interactions designed to stabilize the secondary structure [38,56,42,68,67], or by simply holding it fixed a priori during the course of the minimization [51,28,12]. This represents a simplification of the folding problem, but of course begs the question of knowing the secondary structure in advance. If other methods are available to assign the secondary structure, however, this effectively becomes a very simple way to represent the local interactions which stabilize the structure leading not only to a reduction in the number of degrees of freedom but also to a qualitative simplification in the potential function, allowing one to concentrate on sampling the large-scale features of the tertiary structure more efficiently.

    With the secondary structure held in place during the simulation, there is an enormous computational advantage to using a coarse-grained model based on the secondary-structure motifs--for example, a "cylinder-and-sphere" (CS) model which we have successfully employed--however at that level of resolution there is a question of the predictive value of the resulting structures. We have therefore implemented a strategy [28] designed to take advantage of this coarse-graining while at the same time retain the resolution of the residue-based (RB) reduced models described above. Our hierarchical model is based on the nested minimization of the two levels of representation while maintaining a strict one-to-one correspondence between the two. The CS model is used as a crude screen of the trial conformations, with only those structures left at the end of a CS minimization cycle being passed on as trial conformations at the residue-based RB level. The use of multiple levels of resolution has also been introduced in the context of lattice models [42], with a crossover from model to another as the simulation progresses.

    In the hierarchical model, the CS-level representation consists of the endpoints of the cylinder axes which define the secondary structural elements and the endpoints of the connecting loops. These points define the reduced chain which can be simulated by changing the corresponding internal coordinates. The RB-level representation consists of the series of backbone dihedral angles which determine the positions of the Calpha and Cbeta atoms. Since for a given set of f-y dihedral angles the pitch of the corresponding helix is easily calculated, the specification of the axis and starting point completely determines the position of all backbone atoms within the helix. The key to the hierarchical method lies in maintaining a similar identity for the flexible loop regions. This is done by using a precalculated list of loops where each entry lists both the backbone dihedral angles and the corresponding "loop geometry," which specifies the internal coordinates in the CS representation. A rapid screening of structures can therefore be carried out using only the CS model, but the complete set of RB coordinates can be generated at any time in order to evaluate the RB-level potential function. The use of both models throughout the simulation leads to a qualitative improvement of the results relative to a CS-level minimization followed by fitting to a RB representation.

    B. Potential Functions

    In the previous section, the geometrical representations of some reduced models were briefly described, however for each model of the structure there must also be a corresponding model of the force-field, and this is a much more difficult problem. In addition, from a computational point of view, the evaluation of the potential is the dominant cost compared to the manipulation of the coordinates. Reduced potential functions generally have two roles; they must represent the true potential at a comparable level of coarse-graining as the model, and they must compensate for structural features omitted from explicit consideration. The former generally includes average packing forces and the effects of solvent and hydrophobicity while the latter tries to mimic specific short-range forces such as backbone steric interactions and hydrogen bonding.

    The backbone torsional potential can be included explicitly as a function of f and y [68,67], included indirectly as Calpha pseudo-angle and pseudo-dihedral potentials [56,42], or included implicitly in using a biased selection of trial moves [51,28,1,35,68]. Hydrogen bonding can be included with specific multi-body interactions designed to identify backbone conformations characteristic of hydrogen-bonded structures [41,35], and can also be combined with non-local orientation-dependent interactions to further stabilize secondary structure [38,42]. Another approach, which we are currently pursuing as an generalization of using fixed secondary structure, consists of selecting secondary structural elements, including beta-strand pairings, from a pre-determined list of possible choices. This is a way of including the effects of short-range interactions implicitly in the choice of trial structures, rather than as a complicated set of terms in the potential function.

    It is the coarse-grained non-local potential, however, that poses the greatest challenge, because that is what determines the overall folded topology. It is precisely the subtle cooperativity of the entire sequence, which cannot be built up from known elements (excluding the case of homology modeling), that makes the protein-folding problem so difficult in the first place. In the case of globular proteins, it has been estimated [10] that in order to correctly predict the structure, in the sense of identifying the native contacts in some discretized sense, the energy of those contacts needs to be known to within a few percent. In other words, if the error in the contact energy is too great, the structure with the native set of contacts is unlikely to be a minimum of the potential. With a level of coarse-graining that includes solvent effects and sidechain conformational averaging, as well as being discretized in space, this is a formidable challenge.

    Not surprisingly, there are several reduced contact potentials currently in use in the literature [32,8,27,50,73,48]. These are generally in the form of discrete values corresponding to a square well potential (in some cases with several "steps") or fit to a simple functional form [13]. Because of the complexity of the interactions being represented, these potentials are determined empirically from the distribution of known structures in the database. This leads to a compromise between the level of resolution desired and the statistical reliability of each parameter. The pairwise interactions can be supplemented with single-residue "context" potentials [42,27] which take into account the surface exposure of a residue. Solvation can also be represented by a single size-dependent "external pressure" term in the potential [28,35]. Results obtained with a few of these potentials will be discussed further in Sec. IV.

    For the hierarchical model, a further coarse-graining was developed [28] to calculate directly the interactions between entire secondary structure elements. This was done by taking a long-range potential with a smooth distance dependence [13] and expanding it around the center-center vector between cylinders. The second term in the expansion corresponds to a "hydrophobic dipole" interaction, which depends on the distribution of hydrophobic and hydrophilic residues within each helix. While this was found to give a credible approximation to the solvation energy of helices, it fails to represent site-specific contact energies and needs to be supplemented with a specific "pairing" potential in order to reasonably model b-sheet formation [52]. The resolution of any potential at this level of detail is clearly limited, but is nonetheless useful in screening out unlikely conformations.

    Improved resolution within the context of a residue-residue potential can be achieved by incorporating orientation-dependent or many-body interactions [38,42]. This takes into account the fact that the contact potential between two sidechains depends on the relative positions and orientations of the backbone segments as well as the position of the sidechain centers. This is just a renormalization of the averaged-over internal degrees of freedom of the sidechains (and solvent). This is the rationale behind our recently-developed multi-dimensional contact potential [52], in which the interaction between two residues depends simultaneously on all four of the inter-residue Ca and Cb distances. This allows for a significant improvement over a sum of pairwise interactions of the same atoms, since the correlation among the indices contains information on how the residues "fit together". This is important for representing the excluded volume without explicit sidechains. Recent results suggest that the problems of sparse statistics can be partially overcome by fitting the potential to a simple functional form as a means of filling the gaps in the data.

    C. Simulation Algorithms

    A number of methods have been employed in an effort to effectively determine the global minima of the various models discussed up to this point. The predominant technique is that of Monte Carlo simulated annealing [41,38,56,15,16,42,35], however molecular dynamics [26] and Langevin dynamics [40,12] have also been used. In addition, non-canonical sampling methods have been suggested [31,29,30] in an effort to improve the barrier-crossing efficiency. Alternatively, we have introduced a systematic variation of the potential function itself (parameter annealing) [28] in the context of MC annealing as a means of smoothing out the potential barriers in the early stages of the simulation. This is most significant for the excluded volume, where a relaxation of the hard-core constraints substantially speeds up the development of compact structures. The resulting atomic overlaps are then slowly annealed out as the simulation progresses. The same idea is also applied to the coefficients of each term in the potential to generate a crossover from long-range to short-range interactions.

    The use of a biased distribution of trial moves, mentioned above in the context of the potential function, has also been used as a means of improving the efficiency of the conformational sampling [1]. This idea has been extended [68,67] to include the selection of segments of up to several residues from a biased distribution or from a library of fragments from the database. The precalculation of a list of multi-residue trial moves has also been used in the context of a lattice model [41,38,42] as a computational shortcut, and, as mentioned above, forms the basis of the hierarchical model [28]. As well as simply specifying the correspondence between the CS and RB models, the loop lists can also be screened according to the loop geometries in order to bias the distribution towards those loops already present in an ensemble of structures. The acceptance rate of the trial moves can be improved significantly by thus generating moves which are on average smaller displacements from the existing structures. It is highly non-trivial to generate a loop of several residues with endpoints in a desired relative position, and the use of a precalculated list makes the resulting segments available for a large number of subsequent trial moves.

    The substitution of larger and larger segments of the structure as trial moves leads naturally to the concept of genetic algorithms (especially since the system in question has a well-known genetic code!) Generalized algorithms involving cross-breeding and incorporating "offspring" structures into the "parent" population have been established for simple lattice models [71,36], and a simple implementation has been used with the hierarchical model [28]. In this method, the molecule was divided into three parts: the N-terminal end taken from one structure in the ensemble, the C-terminal end taken from another structure, and a loop taken from the list to connect them together. In this way a large number of hybrids (3200 per parent) were generated and the lowest in energy at the CS level (1 per parent) were then used as additional trial moves at the RB level. The genetic algorithm led to a qualitative improvement in the results, since it can generate new conformations which are different from any of the existing structures by more than one simultaneous loop substitution. In addition, the cross-over effectively allows the other structures to be used as a reservoir of trial loops, allowing low-energy fragments of the structure to replicate throughout the ensemble.

    The success of the genetic algorithm is also coupled to the hierarchical structure of the trial moves. The genetic algorithm allows for the effective sampling at a coarse-grained level of a "gene pool" of loop fragments, which themselves are already constructed from smaller fragments and screened independently. In this way, trial structures are built up from an evolving library of structural elements, while the random mutations are carried out independently within the individual fragments.

    D. Reconstructing atomic-level detail

    Clearly, if the results of reduced-model calculations are to provide detailed structural predictions, it is necessary to be able to generate an atomic-level structure based on the reduced representation. This will naturally depend on the simplicity of the model, but the important requirement is that this step be local, or in other words, that it can be carried out systematically without qualitatively changing the global structure. If not, then one is faced with yet another global optimization problem and the utility of the initial coarse-grained structure becomes questionable.

    From a starting point consisting of (some or all) Ca coordinates, there are two methods in use to generate a complete set of backbone coordinates. This first involves fitting to a library of fragments from the database [46,33], and has proven highly accurate in several examples. An alternative method [55] makes use of geometric criteria to place the Cbeta atoms and then deduces the values of f and y by assuming a planar peptide geometry. This method has also shown excellent performance in tests of known structures, confirming that resolution of the Calpha positions is sufficient to construct a (nearly) unique all-atom backbone.

    The addition of sidechains (beyond the Cbeta) is more challenging, as the backbone trace no longer provides a constraint. The method of template-fitting, however, can also be applied to generate complete sidechains [46], albeit with reduced accuracy compared to that of the backbone. Other methods make use of discrete "rotamer states" [53] which correspond to the observed values of c angles of each amino acid. The problem is then one of simultaneously determining the state of each sidechain. While the problem is coupled, due to the excluded volume, it is still local in that a change in the state of any sidechain only affects its neighbors and does not alter the overall structure. This problem has been addressed using an iterative mean-field technique [37] and MC simulated annealing [33,22]. It has also been shown [18] that pairwise interactions can be used to rigorously eliminate possible rotamers from inclusion in the global minimum, thereby drastically reducing the combinatorial problem. Good results can be obtained with these techniques using only a simple Lennard-Jones potential function [37,45,33].

    The success of using rotamer states has some implications for the construction of reduced models; there is a certain flexibility in choosing the point at which the reduced model ends and the "reconstruction" begins. An additional degree of freedom representing the rotamer state can be included in the coarse-grained description of a residue conformation, taking into account more of the local constrains around each Calpha center both in the potential function [42] and the choice of trial moves [1]. Alternatively, the sidechain conformations can be considered as another level of the hierarchical structure, in which new sidechains are selected from an appropriately evolving list. The important principle is the continuity of refinement, that is, the fact that successive levels of detail can be built upon simpler representations and that the knowledge of the folded structure is passed on intact to a more detailed calculation. Reduced models (or series of models) for which this holds thus offer the promise of bridging the gap between effective sampling of conformations and accurate structural resolution.

    IV. Results

    A. Recent Progress

    There has been a substantial amount of work carried out with simple lattice models of proteins in the past few years, but relatively little aimed at the predictive folding of real sequences. Some recent results in this area have shown only moderate success [15,16,32], with typical RMS deviations from the native structures on the order of 7-8 Å for small proteins in the range of 50-70 residues. Although the simplicity of the models makes them easy to work with, the lack of sufficient flexibility will likely hinder their extension to more complex proteins.

    The more sophisticated lattice models of Skolnick and co-workers have fared much better, in large part due to a much more detailed representation of secondary structure. Simple examples of both four-helix bundles [38,43] and beta-sheets [31] have been assembled from random initial conditions. The 46-residue protein crambin has also recently been folded to less than 4 Å RMS deviation from its native structure [43]. The authors state however, that results with larger, more complex proteins have shown mixed results, with secondary structural elements generally being stable, but with limited success at reproducing the overall folded topology. It is not yet known if further refinements to the potential function and simulation algorithm will be able to surmount this barrier. The off-lattice reduced model of Sun [68] has also shown mixed results. The model is capable of reproducing segments of detailed structure, but fails at finding the a globally correct conformation. Results for small proteins typically show larger errors, including, for example, a deviation on the order of 8 Å for crambin.

    On the other hand, work to date with larger proteins has proceeded generally with simplified representations of the secondary structure elements, with the emphasis on overcoming the combinatorial problem of packing them into a compact structure. In a recent example [12], a model of myoglobin (with 153 residues and eight helices) using rigid helices connected by flexible loops was folded to roughly 6 Å RMS deviation, as determined by the helical Calpha positions, with a clear resemblance to the native topology at the cylinder-and-spring level of resolution. This result shows the efficacy of coarse-grained models in determining the overall features of the tertiary structure, however lacks a clear path to atomic resolution.

    B. The Hierarchical Model

    The initial tests of the hierarchical model were concerned with the question of whether or not rigid secondary structure elements could be packed together into the correct tertiary structure. Simulations were carried out [51] for proteins forming four-helix bundles using a residue-based reduced model and a simple long-range potential function [13]. The trial moves were restricted to changes in the backbone dihedral angles of the loop residues and the conformation was minimized with a simple MC quenching algorithm. Results on the order of 4 A RMS were obtained for cytochrome b-52 and myohemerythrin (shown in Figure 1), each of which folded into essentially the correct overall structure. Despite the simplicity of the structures in this case, the significance of this result is highlighted by the fact that the corresponding native structures differ by 7.5 Å RMS. In addition, experiments using randomized or monomeric sequences failed to generate comparable structures. Most importantly, the results showed a clear difference in energy between correctly folded and misfolded structures, with the native-like folds always being lower in energy and the crystal structure itself lowest of all.

    Despite the ease with which the above results were obtained, positive results for more complex structures required considerable development of the minimization algorithm. This led to the hierarchical combination of the CS and RB models [28] as introduced in Section III. In particular, the use of the genetic algorithm and the annealing in of the excluded volume constraints were found to have an enormous influence on the speed with which compact structures were generated. It also became clear that it was necessary to work with a large ensemble of structures in order to generate results which were reproducible in a statistical sense. This underscores the magnitude of the increase in complexity associated with a mere doubling of the sequence length.

    A final distribution of the 256 lowest energy structure from a typical myoglobin simulation (~48 hours of run time on a 16 node partition of a CM-5 massively parallel supercomputer) is shown in Figure 2; these structures represent the final results of examining more than 1010 structures in the course of the simulations. The distribution was analyzed by calculating the RMS deviation between all pairs of structures and grouping those within 5 Å into distinct clusters. The results can thus be seen to fall into only two different folded topologies, one of which is clearly native-like. This result is remarkable in that the native structure itself (being more than 5 Å from any of the calculated structures) was not used in any way to identify the clusters. This suggests that there is indeed a native-like "attractor" in the potential surface, since the simulation is producing variations on a common theme rather than a uniform distribution of wrong structures. This is even more striking in the case of the misfolded structures, given that the number of ways of generating such structures grows extremely rapidly with distance from the native, and yet again only one distinct topology is observed.

    The low-energy structures from each cluster are shown in Figure 3 (for the low-RMS, superimposed on the native) and Figure 4 (for the high-RMS). Clearly, the 6 A structure closely resembles the native in the packing of the helices and has the same overall topology. In fact, most of the RMS error comes from the loop regions with the helical residues differing by only 4 Å from the native. Also striking is that the misfolded structure shares many of the same features, including a densely packed hydrophobic core and a solvated flexible loop region. The entire C-terminal half of the molecule is actually virtually the same. Without knowing the correct structure, it would be very difficult to choose between these possibilities by any intuitive means, thus demonstrating the ability of the simulation to generate "intelligently" misfolded structures which are viable tests of the potential function.

    Comparable results were also obtained for the C-terminal fragment of L7/L12 ribosomal protein, which is a mixed a/b protein with 66 residues [52]. The low-energy structure obtained from the simulation is shown superimposed on the native in Figure 5. This structure shows the correct overall topology, but considerable error in the structure of the beta-sheet. This is due to the complete neglect of hydrogen bonding and electrostatic interactions in the model, causing the beta-strands to bunch together without any specific structure (although a simple strand-strand angular term was included in the potential in an attempt to compensate for this). In the case of helical proteins, most of the backbone hydrogen bonding is inside the helices, and therefore well accounted for by holding the secondary structure fixed. That approximation clearly breaks down, however, in the case of beta-sheet formation.

    C. Evaluation of Potential Functions

    The next example studied was the small a-helical protein R69 (N-terminal domain of the phage 434 repressor protein), for which the distribution of structures is shown in Figure 6. In this case, there is clearly no native-like cluster of structures and the crystal structure itself is nowhere near the minimum in energy. This result, as well as similar problems with larger beta-sheet proteins, prompted us to look more deeply at the characteristics of the potential function. This was done by taking selected structures from the hierarchical simulations and comparing them in the context of an all-atom model with an atomic level continuum solvent potential function, in this case the AMBER protein model of Kollman and coworkers and the generalized Born (GB) model of Still and coworkers.

    Sidechains were added using the RSA algorithm [22] followed by simulated annealing of the structure to find a local minimum of the all-atom potential. The results of repeated trials [52] indicate that this can be done reliably to give structures which differ by typically 1-2 Å from the reduced model coordinates, comparable to the change observed in directly minimizing the crystal structure with the same potential. This confirms that the reduced model can provide a reasonable starting point for all-atom refinement. The resulting distributions (albeit with only a handful of structures) show that the all-atom potential is capable of distinguishing the native from the calculated structures for 1MBO and ICTF. In the case of 1R69, there are some structures slightly lower in energy than the native, however this still represents a substantial improvement over the reduced-model potential. On the other hand, the all-atom potential still did not show any significant difference between the energies of the two competing topologies for myoglobin. Further analysis of the different contributions to the energy shows that all structures have roughly the same total electrostatic energy (Coulombic plus solvation), but that the native has a much lower van der Waals packing energy. Furthermore, the native structure generally has a lower internal electrostatic energy while the calculated structures have lower solvation energies. This suggests that the reduced-model potential does a good job of predicting the overall solvation energy (consistent with the idea of hydrophobic-hydrophilic segregation), but fails to account for the details of the internal structure such as the close-packing of sidechains and close-range electrostatic interactions. While the native structure can simultaneously accommodate these interactions, the calculated structures have more unfavorable contacts, or alternatively, must relax the all-atom structure at the expense of overall compactness.

    V. Conclusions

    A central conclusion of the work described above is that the problem of determining tertiary structure once secondary structure is specified, while nontrivial from both the point of view of algorithms and potential functions, is tractable with current computing technology. The success of relatively simple potential functions for a significant number of proteins, including several with rather complicated structures, suggests that a refined potential of similar structure, in conjunction with some of the algorithmic improvements suggested above (particularly for complex b-sheet topologies), will be able to robustly generate tertiary structures in the 4-6 Å range. Such a potential can be developed by optimizing parameters and functional form against an increasingly large data base of misfolded structures. This procedure must be carried out via an iterative cycle, optimizing a potential with the current data base, adding the new low energy misfolded structures, and repeating until reasonable results are obtained for a substantial number of proteins in the data base (the "training set") using a single potential function. The resulting potential can then be tested in its ability to determine the structure of proteins it was not optimized for, i.e. those proteins in the "test set". The ultimate test, of course, is the prediction of new structures. We are currently pursuing this strategy utilizing the orientation-dependent reduced potential mentioned above [] and preliminary results indicate substantial improvement over the simplified potential functions that we have used in the past.

    Determination of secondary structure from first principles is clearly considerably more difficult. However, progress is being made with novel simulation approaches (for example in the work of Skolnick's group) and heuristic prediction methods (e.g., the work of Benner's group). These methods can perhaps be synthesized with the secondary structure packing algorithms; indeed, Skolnick already freezes in secondary structure components once the simulation their location as an algorithmic device. It is certainly straightforward to allow secondary structure elements to fluctuate in length during a simulation. A more difficult issue is whether for large proteins one can obtain the native structure if the initial guess for the secondary structure contains a major flaw. Alternatively, one could try several different initial guesses for the secondary structure pattern and see which one produced a better final structure. These approaches will likely require significantly more computation time, but this should be available with the next generation of computer hardware, particularly massively parallel machines such as the IBM SP2.

    An alternative approach, which we are pursuing in our laboratories, is to utilize experimental data in new ways in combination with reduced model simulations to determine low resolution structures. The most obvious step along these lines is to extract secondary structure from NMR data; this is considerably easier than obtaining sufficient long range NOE constraints to determine a high resolution tertiary structure. Using a small subset of NMR constraints (available from the initial experiments required to assign resonances), we have been able to improve the resolution of the structures described above by ~2 Å, as well as fold even more complicated proteins such as lysozyme to satisfactory resolution. However, this effort is only a beginning along this path. Once it is established that determination of secondary structure plus a very small number of additional distance constraints is sufficient to determine structure, there will undoubtedly be clever new experimental techniques, not only NMR but other types of spectroscopy as well, designed to reach this goal with as little effort as possible. The reduction of effort in structure determination, while not as spectacular a goal as de novo folding, may still prove to be extremely valuable in a practical sense.

    Finally, methods are still needed to take a 4-6 Å RMS structure and reduce the error to 1-2 Å RMS if the structures are to be used in applications like rational drug design. This process must necessarily involve atomic level models with improved accuracy as compared to current potential functions, which are not yet capable of achieving this level of resolution, at least in the tests that have been carried out to date. While the paths toward progress are clear - better solvation models, hydrogen bonding energies, and torsional potentials - the precision needed to obtain high resolution is not yet apparent. Nevertheless, significant advances in the next 5 years can be expected towards this goal.

    Acknowledgments

    This work was supported in part by grants from the NIH Institute of General Medicine (GM-52018) and by the NIH Division of Research Resources (P41-RR-06892) (RAF) and by a grant from the Canadian government (JRG). We thank Barry Honig and his research group for generously sharing their knowledge of all aspects of protein modeling with us over the past several years, and Peter Shenkin and Clark Still for providing access to the Macromodel molecular modeling program.

    Literature Cited:

    1. Abagyan R, Totrov M. 1994. Biased Probability Monte Carlo Conformational Searches and Electrostatic Calculations for Peptides and Proteins. J. Mol. Biol. 235:983-1002

    2. Baldwin RL. 1995. Alpha-Helix Formation By Peptides of Defined Sequence. Biophy. Chem. 1-2:127-135.

    3. Benner SA, Gerloff D. 1991. Patterns of Divergence in Homologous Proteins as Indicators of Secondary and Tertiary Structure: A Prediction of the Structure of the Catalytic Domain of Protein Kinases. Advances in Enzyme Regulation 31:121-181.

    4. Benner SA, Gerloff DL, Jenny TF. 1994. Predicting Protein Crystal Structures. Science 265:1642-1644.

    5. Boczko EM, Brooks CL. 1995. First-Principles Calculation of the Folding Free Energy of a Three-Helix Bundle Protein. Science 222:393-396.

    6. Bowie JU, Eisenberg D. 1993. Inverted Protein Structure Prediction. Current Opinion In Structural Biology 3:437-444.

    7. Brooks CR, Bruccoleri R, Olafson B, States D, Swaminathan S, Karplus M. 1983. CHARMM: A Program for Macromolecular Energy, Minimizations and Dynamics Calculations. J. Comp. Chem. 4:187.

    8. Bryant SH, Lawrence CE. 1993. An Empirical Energy Function for Threading Protein Sequence Through the Folding Motif. Proteins16:92-112

    9. Bryngelson JD, Onuchic JN, Socci ND, Wolynes PG. 1995. Funnels, Pathways, and The Energy Landscape Of Protein Folding - A Synthesis. Proteins 3:167-195.

    10. Bryngelson JD. 1994. When is a potential accurate enough for structure prediction? Theory and application to a random heteropolymer model of protein folding. J. Chem. Phys. 100:6038-45

    11. Buturovic LJ, Smith TF, Vajda S. 1994. Finite-State and Reduced-Parameter Representations of Protein Backbone Conformations. J. Comp. Chem. 15:300-12

    12. Callaway DJE. 1994. Solvent-Induced Organization: A Physical Model of Folding Myoglobin. Proteins 20:124-38

    13. Casari G, Sippl MJ. 1992. Structure-derived Hydrophobic Potential: Hydrophobic Potential Derived from X-ray Structures of Globular Proteins is able to Identify Native Folds. J. Mol. Biol. 224:725-32

    14. Cohen FE, Richmond TJ, Richards FM. 1980. Protein Folding; Evaluation of Some Simple Rules for the Assembly of Helices into Tertiary Structures with Myoglobin as an Example. J. Mol. Biol. 137:9-22.

    15. Covell DG. 1992. Folding Protein a-Carbon Chains Into Compact Forms by Monte Carlo Methods. Proteins 14:409-20

    16. Covell DG. 1994. Lattice Model Simulations of Polypeptide Chain Folding. J. Mol. Biol. 235:1032-43

    17. Daggett V, Levitt M. 1994. Protein Folding and Unfolding Dynamics. Current Opinion In Structural Biology 2:291-295.

    18. Desmet J, De Maeyer M, Hazes B, Lasters I. 1992. The dead-end elimination theorem and its use in protein side-chain positioning. Nature 356:539-42

    19. Dill KA, Bromberg S, Yue KZ, Fiebig KM. 1995. Principles of Protein Folding - A Perspective From Simple Exact Models. Protein Science 4:561-602.

    20. Dinner A, Sali A, Karplus M, Shakhnovich E. 1994. Phase Diagram Of A Model Protein Derived By Exhaustive Enumeration Of The Conformations. J. Chem. Phys. 2:1444-1451.

    21. Eisenberg D, Wesson M, and Yamahshita M. 1989. Interpretation of protein folding and binding with atomic solvation parameters. Chem. Scr. 29a:217-221.

    22. Farid H, Shenkin PS, Greene J, Fetroe J. 1992. Prediction of side chain conformations in protein core and loops from rotamer libraries. Biophys. J. 61:A350

    23. Fasman G. 1989. in The Development of the Prediction of Protein Structure (Fasman, G. ed.), pp. 193-316, Plenum.

    24. Gilson MK, Davis ME, Luty BA, and McCammon JA. 1993. Computation of Electrostatic Forces on Solvated Molecules Using the Poisson-Boltzmann Equat