Why did the protein structure prediction accuracy in terms of GDT-TS decrease from 2008 (CASP 8) to 2014 (CASP 11)?

Why did the protein structure prediction accuracy in terms of GDT-TS decrease from 2008 (CASP 8) to 2014 (CASP 11)?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I read on

Why did the protein structure prediction accuracy in terms of GDT-TS (Global Distance Test - Total Score) decrease from 2008 (CASP 8) to 2014 (CASP 11)?

Target difficulty

Comparison of performance across targets within a CASP or between CASPs requires consideration of the relative target difficulty. As in previous papers (2𠄵) , we consider the difficulty in terms of two factors: first, the extent to which the most similar existing experimental structure may be superimposed on a modeling target, providing a template for comparative modeling and second, the sequence identity between the target and the best template over the superimposed residues. Figure 1 shows the difficulty of targets in all CASPs. Targets are divided into domains using the procedure described in Methods. CASP10 domains span a wide range of structure and sequence similarity, as did those in earlier CASPs. Labeled targets are discussed later. The inset shows average target difficulties. Here it is apparent that the full set of CASP10 targets (�’) is of similar average difficulty to that in CASP9, and substantially easier than that in CASP5. CASP10 human/server targets (6) (�h’), on the other hand, are of similar difficulty to that of the full CASP5 set, by these measures.

Relative modeling difficulty of CASP targets, as a function of the fraction of each target that can be superimposed on a known structure (horizontal axis) and the sequence identity between target and template for the superimposed region (vertical axis). Each point represents one target. Inset shows the average values for each CASP. For recent CASPs, averages are shown for server only targets (marked with an “_s” suffix), human/server targets (“_h”), and complete set of targets (“_all”). CASP10 human/server targets are on average of similar difficulty to those of CASP5, by these measures.

Improvement over a best template

Historically, in template based modeling there was very limited ability to model parts of a structure not present in a template. Already in CASP5 we had seen progress in this regard. Figure 2 shows the fraction of residues that are not covered by the best structural template but are correctly modeled in the best model (by the criterion of Cα errors less than 3.8 Å) in CASPs 5, 9 and 10, as a function of target difficulty. (A single parameter difficulty index is used, based on a linear combination of the coverage and sequence identity used in figure 1 (see Methods)). Only the targets in which at least 15 residues could not be aligned to the best template are considered. There has been significant progress in this area since CASP5: For the relatively easy targets coverage of non-best template residues has increased from

40%, and in the mid-range of difficulty from

35%. The average improvement over the full difficulty range is 5%. Also somewhat larger is the scatter of the values for CASPs 9 and 10 compared with CASP5, clearly visible on the plot. In CASP10, there are 13 targets where predictors were able to model more than 40% of residues not covered by the best templates, while in CASP5 there are only 4 such cases. As discussed later, recent CASPs contain a number of targets that are more difficult to model accurately in ways not captured by the standard scale, and these pull down the apparent overall performance. A balanced comparison with CASP5 is very difficult, but it appears that the real improvement since then is substantially more than the averages and trend lines suggest. As also discussed later, the improvement is further disguised by the increased difficulty of picking the best template in recent CASPs.

% of residues successfully modeled that were not available from the single best template. Each point represents the best model for a human/server target for CASPs 9 and 10, and all targets for CASP5. CASP10 performance is similar to that found in CASP9, and markedly improved over CASP5.

Overall model accuracy

Figure 3 shows the trend in overall backbone accuracy for the best models submitted for each target as a function of target difficulty and using the GDT_TS measure (7) . GDT_TS of 100 would correspond to exact agreement between the Cα co-ordinates of a model and the corresponding experimental structure. In practice, GDT_TS of 90 reflects an essentially perfect model, as at that GDT_TS level model deviations are comparable to experimental error and deviations due to varying experimental conditions. Random structures typically return a GDT_TS between 20 and 30. As previously noted, progress between CASP1 and 5 is dramatic. Progress by this measure since CASP 5 is not apparent. Although several recent CASPs have trend lines above that of CASP5, the CASP10 line is essentially the same as CASP5.

Best GDT_TS scores of submitted models for targets in all CASPs, as a function of target difficulty. For recent CASPs, human/server targets are included, and in earlier CASPs, all targets. Trend lines show little significant change in this measure since CASP5.

Given the obvious progress in modeling non-template regions seen in figure 2 , this is a very puzzling result. One observable effect in figure 3 is that there are some CASP10 targets which fall way below the trend lines, pulling the overall performance down. They include the four domains of target 739, a large, elongated, intimately trimeric, phage tail spike protein (8) . Targets of this difficulty were seldom found in early CASPs. There are also targets that fall well above the trend lines in CASP 10, for example 743 and 717-D2, corresponding to some of those with greatest non-template region success, as seen in figure 2 . We have investigated several general factors that may explain the similarity of CASP5 and recent CASP performance. First, the ‘human/server’ subsets of targets are used for recent CASPs, as opposed to all targets for CASP7 and earlier. Figure S1 shows the same plot using all targets in all CASPs. Here the CASP10 line is above that of CASP5, but only by a little. Second, it may be that as CASP has progressed, targets have tended to become more complex, multi-domain, and multi-chain. Interdomain and interchain interactions influence structure in a manner not easily modeled. A plot for the single domain targets is also similar, though (figure S2).

One significant difference between CASP5 and CASP10 targets is structure irregularity, as measured by radius of gyration, R. Figure 4 shows the radius of gyration of domain targets from CASPs 5, 9 and 10, as a function of target length. Also shown are the boundaries in which most PDB structures fall, 2.5Å on either side of a line derived by fitting to the radii of PDB crystal structures determined at 1.7Å or better resolution. The form of this line (R = 2.77 L 0.34 , where L is target length) is similar to that found in an earlier study (9) . While almost all targets fall within these boundaries, there are twelve outliers constituting 17% of all human/server predictor perspective domains in CASP10 (one of the outliers is at a radius of 60Å, and not shown for clarity) and only four (constituting 6% of all domains) in CASP5.

Radius of gyration of CASP targets as a function of target length. Dashed lines mark the boundaries +/𢄢.5Å on either side of a line (not shown) derived from fitting to high resolution crystal structures. CASP10 has a number of unusually high radius targets (one at 60Å, not shown).

We also consulted members of the prediction community for possible explanations of the apparent lack of progress. Several suggested that although by our criteria the average structural coverage provided by the best available CASP10 templates is similar overall to that in CASP5, best templates have become more difficult to identify in practice, making CASP10 targets effectively harder. To investigate this factor, we compared three sets of templates for targets from CASP10 and CASP5. One set is the one used for the standard analysis of target difficulty. That is, the template is taken from the PDB structure that has maximum coverage of the target, as determined by structure superposition using LGA (7) . The second set of templates is derived from the PDB structures with the best PSI-BLAST score to each target sequence (10) , a method in use from roughly CASP2 through CASP4. The third set has templates derived from the PDB structures with the best HHsearch score (11) , one of the most effective profile-profile type methods. This class of methods came into general use in CASP5, and although some improvements have been made, probably has not essentially changed since. Figure 5A shows the comparison of coverage using these three template sets, as a function of target difficulty. The following points are clear: First, LGA derived templates provide essentially the same average coverage in CASP5 (red line) and in CASP10 (black), at all levels of difficulty. Second, except at the easy target end of the scale, PSI-BLAST derived templates from CASP5 (dotted red) and CASP10 (dotted black) provide very substantially lower coverage than the LGA ones (

75 in the mid-range of difficulty). Third, PSI-BLAST coverage for CASP10 is significantly worse than for CASP5 (about 8% in the mid-range). Fourth, HHsearch derived templates also provide substantially lower coverage than LGA ones (

15 difference in the midrange), although not as low as with PSI-BLAST. Fifth, coverage by CASP10 HHserach templates is lower than the corresponding CASP5 ones by up to 10%, though this difference disappears at the more difficult end of the scale.

(A): Target coverage provided by three classes of template: best available (solid lines), best detectable using HHsearch (long dashes), and best using PSI-BLAST (short dashes). With both sequence-based methods, achievable coverage is substantially lower than the provided by the best available template, and lower in CASP10 (black lines) than in CASP5 (red lines), showing that good templates are harder to find in recent CASPs. The dash-dotted line shows coverage of CASP10 targets obtained using HHsearch and the CASP5 structure database. The low coverage indicates that increased database size is not the primary cause of increased difficulty in finding good templates in CASP10.

(B): Average loss of coverage relative to the best available template for the best templates found with the methods shown in panel (A) and for templates declared by three of the best performing CASP10 servers. With both PSI-BLAST and HHsearch, loss of coverage is substantial and larger for the CASP10 human targets than for those of CASP5. Declared parent lists for best models do contain near optimal templates, but typically amongst many others. Best templates for CASP10 human targets returned by the selected servers have similar coverage to HHsearch. This view of the data further supports the conclusion that identification of near optimal templates has become substantially harder since CASP5.

Figure 5B shows the reduction in average template coverage using PSI-BLAST and HHsearch compared with the coverage provided by the best available template, for CASP5 and CASP10 (the latter for all and for human/server targets separately). For both methods, the loss of coverage is quite substantial (between 17 and 25% with PSI-BLAST and 7 and 13% with HHsearch. Further, there is a significant difference between the coverage loss for different CASP target sets. In particular, for HHsearch, the most relevant for recent CASPs, CASP10 human targets suffer a 6% greater loss than CASP5 targets. We also examined a fourth way of assigning templates, based on the sets of templates that prediction submissions stated were used in the building of models (normally provided in the ‘PARENT’ field in a standard CASP prediction file). The ‘PARENT’ analysis in figure 5 shows these data. Typically, a number of templates are declared in a prediction file. We superimposed all of the declared templates onto the target structure and selected the one with the highest coverage. As the plot shows, the PARENT template lists do usually contain an entry with nearly as good coverage as that of the best available template (the left 3 bars in the PARENT section). Note though, that in the underlying calculations we took into account all the templates acknowledged by the predictor groups, in this way establishing the maximum achievable performance by the community as a whole. To check if a specific group can consistently include the best template in the list, we examined predictions from three of the better performing CASP10 servers: Rosetta, Zhang-server and Tasser-VMT on all CASP10 targets. As can be seen, the methods are roughly equal in their ability to pick a good template, on average losing about 12% of coverage compared with the best possible template and 5% compared with the best HHsearch template. Note that the weight given to that template in the modeling method may be small, and therefore these results can be deceptive. Overall, it does appear that it has become harder to pick a good template since CASP5, resulting in about an average 6% loss of coverage.

Since template quality goes a long way to determining overall model quality, these data suggest that CASP10 models would be expected to be worse than those of CASP5, because of the greater difficulty of choosing a good template using HHsearch-like methods. In fact, figure 3 shows they are of similar quality, suggesting that improvements in modeling methods have roughly compensated for the increasing difficulty of the targets.

The question remains as to why it is harder to identify a near-optimal template in recent CASPs. Both structure and sequence databases have grown enormously in the last decade (the PDB roughly quintupled, while NCBI’s NR database grew twenty-fold), so that there is a much larger background effect to deal with. It has been shown that including too many sequences in a multiple sequence alignment leads to less accurate alignments (12) , but so far there are no published methods of optimizing sequence inclusion. For structure, an implication of increased difficulty of finding a good candidate because of increased database size is that better template choices would be made for CASP10 targets using the structure database available at the time of CASP5. We tested this possibility. In fact templates chosen in this way using HHsearch provide very substantially (13% on the average) less coverage than those found using the CASP10 structure database ( figure 5A ). It should be mentioned, though, that inclusion of all CASP10 targets into such a comparison is not restrictive enough, as many targets that have very good, easy identifiable templates in the CASP10 structural database would not have had such at the time of CASP5. To eliminate this bias, we repeated the analysis comparing only the targets that had quite good templates (coverage 㹀%) in the CASP5 database (i.e., essentially eliminating free modeling targets) and where the difference in coverage between the best CASP10 and CASP5 templates was below 20% (i.e., eliminating those TBM targets where in the last decade a much better template has become available). It appeared that for the remaining subset of 67 CASP10 targets the difference in the coverage was much thinner (only 3%), but still in favor of the CASP10 dataset. So, while it is true that picking good templates has become harder, it is not apparently clear why that is the case.

Alignment accuracy

Figure 6 shows alignment accuracy as a function of target difficulty over all the CASPs. Trends here are very similar to that of figure 3 for backbone accuracy. The similarity of the two plots suggests that overall model quality continues to be dominated by alignment accuracy, in spite of the improvement in non-template region modeling discussed earlier. There is no apparent improvement in alignment since CASP5, consistent with the increased difficulty of finding a near optimal template, discussed above. The large fall-off in overall alignment quality as a function of target difficulty in figure 6 is the a combination of two effects - actual alignment errors and the extent to which the best template does cover the target. Figure 7 shows the difference in achieved alignment accuracy compared with theoretically possible using the best template for the template based modeling targets in CASPs 5, 9 and 10. It is apparent that already in CASP5, errors are quite small – close to zero for easy models, about 10% in mid-range, and rising to

25% at the difficult end of the scale. While there is evidence of improvement in CASP9, compared to CASP5, CASP10 and CASP5 results are very similar. It is likely this is because remaining errors are sufficiently small that they cannot be resolved at the sequence level, and further improvement will only come from the use of methods that test alignment alternatives at the three dimensional structure level.


Relative target difficulty

The difficulty presented by a modeling target depends on many factors. Two primary ones are the similarity of the protein sequence to that of a protein with known structure, and the similarity of the structure of the target protein to potential templates. Other factors are known to affect modeling difficulty but their influence cannot be easily quantified. 8-10 In particular, the number and phylogenetic distribution of related sequences affect difficulty of detecting an evolutionary relationship to the target and also the quality of the alignment that can be generated the number and structural distribution of available templates influences modeling of the regions of structure not present in the single best template.

In compliance with previous analyses, we define difficulty of the target based on combination of the structural coverage of the target by the best template and target-template sequence similarity of the superimposed regions (See Methods for details). Omitting additional factors such as those mentioned earlier will add some noise to the relationship between model quality and our difficulty scale.


Some targets may consist of two or more structural domains. Since domains within the same structure may present modeling problems of different difficulty, assessment treats each identifiable domain as a separate target. As domain definitions are nearly always subjective, for most of the analysis, we subdivide template-based modeling targets (TBM) into domains only if these divisions are likely identifiable by a predictor and require different modeling approaches (i.e., belong to different difficulty categories), or the domains are sequentially related to different templates. There are 15 such targets in CASP8. For evaluation of nontemplate-based models, all domains identified by the assessors are treated as separate targets. In CASP6, 7, and 8 some of single-domain targets were trimmed by the assessors to avoid evaluation on uncertain residues or residues strongly influenced by crystal packing. To maintain consistency with the previous articles, 4-7 our analysis was based on the untrimmed domains following the notion that predictors had no means to establish a priori which residues in the target will be abandoned by the assessors. Obviously, this choice affects the results presented here as approximately one-fifth of CASP8 targets were trimmed in the assessors' analysis because of crystal contacts and additional reasons (see Methods for details). We do use official (trimmed) domain definitions for the multidomain targets containing at least one free modeling (FM) domain and for some of the NMR targets (see Methods), where deviation of the models in the ensemble is very large.


All models, correct structures and MQAP predictions were downloaded from the Prediction Center ( Both the global and local quality for all models were then calculated locally. To provide a baseline two of the most frequently used MQAPs Verify3D and ProsaII were also added to the analysis.

All comparisons are performed on a set of models or residues (in the case of local prediction) where all MQAPs do have a prediction, that is, if one MQAP did not make a prediction for a particular model or residue this is not used in the analysis. Care was taken to make sure that not too many residues or models were excluded. If one method only made a few predictions that method was excluded instead. In total, 19,221 models from 87 targets and 1,932,288 residues from 76 targets were used in the assessment of global and local quality prediction, respectively.

Evaluation of global quality

In the present analysis, GDT_TS 27 is used as an evaluation measure of model correctness, as this nowadays is a CASP standard. There are some discrepancies in performance depending on which evaluation measure is used, in particular in terms of correlation (see Discussion), but for the calculation of the overall quality of the models the difference is marginal.

The analysis is performed using the full-length PDB chains, that is, not using the domain definitions provided by the CASP organizers. The reason for this is that the quality that was predicted, was the quality for the full-length PDB chain and not the quality of the individual domains.

For assessing the overall performance two measures were used: (i) Pearson's correlation coefficient and (ii) the average GDT_TS for the highest ranked models for each target (GDT1). During the evaluation other measures such as the best of the top five ranked models and Receiver Operating Characteristic curves (ROC) were used. The overall ranking using any of these measures is consistent with the two measures used.

Evaluation of local quality

To assess the performance of MQAPs predicting local quality is tricky. The MQAPs were supposed to predict how much (in Angstrom) each residue in a model deviates from its correct position. The most simple performance measure would be to compare the absolute differences between the correct and predicted values. However, this measure has the severe flaw that it does not recognize the ability of a method to capture the good and bad regions in a model. The real value of local quality prediction is when the method is able to distinguish between high and low quality regions.

In this study, a simple measure that analyze a fraction (here 10%) of the highest and lowest scoring residues in terms of correctness is used. The idea is that a good method should have more correct residues among its highest scoring residues and more incorrect residues among the lowest scoring. A similar scheme has been used in an earlier study and showed to highlight relevant performance differences in agreement with what is observed using ROC curves. 24 The scheme only requires a ranking of the residue and it does not matter if the prediction consistently is off by a couple of Angstroms as long as it separates good from bad residues.

Two quality measures are used to give a quantitative estimate of the performance for the high and low ranking residues sets: average local CA–CA deviation and the fraction of wrongly predicted residues defined by residues with CA–CA deviation >3 Å or <3 Å for the highest and lowest ranked residues, respectively. To avoid the effect a few large CA–CA deviations, the average deviation was calculated by scaling all deviations between [0,1] using S-score 24 before taking the average. This scaling effectively puts all deviations larger than 10 Å at 0.

In addition to measures described above the Pearson's correlation coefficient between the predicted CA–CA deviation to the actual deviation is calculated for each model. However, it should be noted that the correlation does not take accuracy into account, but it was found that the accuracy in general seems acceptable for correlation coefficient higher than 0.6.

Easy and hard targets

The average model quality differs significantly between targets, and particular MQAPs might perform differently depending on target difficulty. It was found that the distribution of the GDT_TS per target median was bimodal with peaks around GDT_TS = 20 and 60 (data not shown) therefore, a cutoff of GDT_TS = 40 was used to separate easy from hard targets, that is, all targets with a median GDT_TS below 40 were classified as hard and those above as easy. In most cases, this provides a good estimate of the difficulty of the target. The exception is T0328, where most servers missed the only good template that was released during the CASP season, resulting in a too low median.


Multiple rounds of unrestrained MD simulations were employed to build Markov state models covering the conformational space between a given homology model and the experimental structure. In the first round of the simulations, 10 100-ns-long MD simulations were started each from the homology model and the experimental structure. Residues present in only one of the structures were removed in the other to match systems. The resulting conformations were classified via tICA from Cα−Cα distance matrices and clustered based on Euclidian distances in tICA space as the distance metric. New clusters that were located between the native and homology structures were then preferentially used as starting points for subsequent simulations. Several simulation trajectories were generated for each starting structure at the next iteration, and the procedure was repeated until there was sufficient overlap between the sampling initiated from the experimental structure and the homology model to build a single combined Markov state model. Further details of the simulation methodology, MSM construction, and scoring and reweighting protocols are given in SI Appendix.


We compare results of the community efforts in modeling protein structures in the tenth CASP experiment, with those in earlier CASPs particularly in CASP5, a decade ago. There is a substantial improvement in template based model accuracy as reflected in more successful modeling of regions of structure not easily derived from a single experimental structure template, most likely reflecting intensive work within the modeling community in developing methods that make use of multiple templates, as well as the increased number of experimental structures available. Deriving structural information not obvious from a template is the most demanding as well as one of the most useful tasks that modeling can perform. Thus this is gratifying progress. By contrast, overall backbone accuracy of models appears little changed in the last decade. This puzzling result is explained by two factors—increased database size in some ways makes it harder to choose the best available templates, and the increased intrinsic difficulty of CASP targets as experimental work has progressed to larger and more unusual structures. There is no detectable recent improvement in template-free modeling, but again, this may reflect the changing nature of CASP targets. Proteins 2014 82(Suppl 2):164–174. © 2013 Wiley Periodicals, Inc.


A longstanding challenge in using computational methods for protein structure prediction is the refinement of low-resolution structural models derived from comparative modeling methods into highly accurate atomistic models useful for detailed structural studies. Previously, we have developed and demonstrated the utility of the internal coordinate molecular dynamics (MD) technique, generalized Newton–Euler inverse mass operator (GNEIMO), for refinement of small proteins. Using GNEIMO, the high-frequency degrees of freedom are frozen and the protein is modeled as a collection of rigid clusters connected by torsional hinges. This physical model allows larger integration time steps and focuses the conformational search in the low frequency torsional degrees of freedom. Here, we have applied GNEIMO with temperature replica exchange to refine low-resolution protein models of 30 proteins taken from the continuous assessment of structure prediction (CASP) competition. We have shown that GNEIMO torsional MD method leads to refinement of up to 1.3 Å in the root-mean-square deviation in coordinates for 30 CASP target proteins without using any experimental data as restraints in performing the GNEIMO simulations. This is in contrast with the unconstrained all-atom Cartesian MD method performed under the same conditions, where refinement requires the use of restraints during the simulations.

1 Introduction

While the 3D structures of proteins arise as a consequence of their amino acid sequences, computational design methods operate by creating sequences that minimize the energy of a pre-defined protein backbone. This approach has provided solutions to challenging design problems ranging from enzyme catalysis to viral inhibition (Bialas et al., 2016 Fleishman et al., 2011 Jiang et al., 2008 Joh et al., 2017 Pejchal et al., 2011 Röthlisberger et al., 2008 Smadbeck et al., 2014 Tinberg et al., 2013). Computational design begins with defining a set of constraints that constitute the design problem, followed by either acquiring a natural backbone [from the Protein Data Bank (PDB)] or building one from scratch (de novo design). Once a backbone is defined, amino acid sequences are designed onto it (Leaver-Fay et al., 2011).

Obtaining a suitable backbone is the most challenging step in computational protein design. Highly evolved native backbones tolerate only minimal deviations from their original sequences, leading to a restricted set of solutions for a given design problem. De novo design, which uses physical and chemical principles to build proteins from scratch, offers at least three advantages over native-based methods (Huang, et al., 2016b). First, because the role of each residue in a de novo structure is precisely defined, the range of tolerated modifications is well understood, allowing for more controlled customization. Second, de novo proteins often have greater thermodynamic stability, which facilitates the introduction of function (Bloom et al., 2006 Tokuriki et al., 2008). Lastly, by not being restricted to existing structures, de novo backbones offer solutions otherwise unattainable from native scaffolds, and are likely more adaptable to a wider range tasks (Huang et al., 2016b).

Despite these advantages, de novo design is challenging because it requires the construction of a protein with neither a known structure nor sequence. Since the vastness of the protein torsion space (φ, ψ, χ) prohibits an exhaustive search, current de novo design protocols generate backbones by combining small, continuous segments of φ-ψ torsions (𠆏ragments’) collected from the PDB. While this allows for a significant reduction in search space, an enormous amount of sampling is still required to find the lengths, types (e.g. helix, beta-sheet) and order of secondary structure elements (here on, ‘topologies’) that result in viable structures. The process of identifying successful models and topologies currently relies on a combination of scoring functions, hydrogen-bonding patterns, and other discernible regularities to screen for models that satisfy the desired criteria (Brunette et al., 2015 Dou et al., 2018 Huang et al., 2016a Koga et al., 2012 Marcos et al., 2017, 2018 Rocklin et al., 2017). However, such heuristics are often subjective and chosen ad hoc, and there are currently no unbiased, generalizable methods that can perform automated structure selection based on the overall organization of a protein.

In the field of computer vision, convolutional neural networks (CNNs) have revolutionized pattern-recognition tasks ranging from facial recognition (Schroff et al., 2015) to object detection (Redmon et al., 2015). In applications to proteins, several groups have used 1D and 3D CNNs to process protein sequence and structure data, performing tasks such as domain prediction (Hou et al., 2018) and mutation stability evaluation (Torng and Altman, 2017). Nonetheless, several limitations have prevented the use of these models in protein design. In the case of 1D CNNs, low-dimensionality often results in loss of important features needed to describe realistic structures. In the case of 3D CNNs, model sizes (i.e. the number of weights in a network) scale quickly with input size, resulting in significant hardware requirements for efficient processing of full-length protein structures. Most 3D representations also lack rotational and translational invariance, thus requiring large quantities of data to train deep models. Surprisingly, few reported studies have used 2D CNNs to perform protein structure analysis, despite the fact that 2D CNNs are the most well-studied and widely implemented class of neural network.

In this study, we demonstrate how CNNs intended for 2D image processing can be used as powerful 3D structure analysis tools to guide computational protein design. A number of recently reported 2D CNN architectures have enabled advanced forms of image classification, namely instance (Tokuoka et al., 2018) and semantic (Long et al., 2014) segmentation, which predict class labels for individual pixels as opposed to entire images—segmentation provides information about where exactly an object is in an image, rather than simply indicating whether or not an object is present. We speculated that this capability is pertinent to protein structure analysis and hypothesized that image segmentation networks could be adapted to create a model that quantifies structural features at various scales within a protein (e.g. per-residue, per-domain).

We trained a 2D CNN that classifies each residue in a protein into one of 38 architecture classes designated by the CATH (Dawson et al., 2017) protein database ( Fig.ꀚ ). We represent a protein using the pairwise distance matrix (here on, 𠆌ontact map’) between all α-carbons in its structure (Anand and Huang, 2018 Wang et al., 2017) ( Fig.ꀛ ). This representation is rotationally and translationally invariant, and provides an image-like rendering of 3D protein structures in 2D. In the same way that image segmentation predicts a class for the basic unit of an image𠅊 pixel, our model predicts classes for the basic unit of a protein𠅊 residue.

Semantic segmentation of protein structures using contact maps. (a) A comparison of semantic segmentation for objects in images, and domains in proteins. The two-domain protein shown (PDB ID: 1mlaA) contains a 3-layer-αβα-sandwich (cyan), and a 2-layer sandwich (green). Multiple distal regions in a contact map can correspond to a single domain in a protein. The example segmentation and contact maps are real inputs and outputs obtained from our model. (b) Two examples of single-domain proteins colored as chainbows (left), with α-carbons shown as spheres. Red indicates the beginning, and blue indicates the end of the chain. The corresponding contact maps are shown to the right of each protein with units in Å. Axes correspond to amino acid index

While the most obvious function of our network is in protein classification, its true significance lies in its utility as a de novo design tool. On the single-residue level, Shannon entropies (Shannon and Weaver, 1964) of the CNN-predicted probability distributions can be used as indicators of local structure quality, allowing for identification of low-confidence regions that require further refinement or reconstruction. On the full-protein level, per-residue class probabilities predicted by our classifier can be averaged across an entire protein to provide a quantitative measure of the degree to which a structure assumes a fold. This function is useful for quickly searching large design trajectories for proteins that adopt specific architectures.

1 Introduction

Simulations of molecular biosystems can give insights into molecular mechanism of folding, 1,2 functionally important protein motions, 3 protein–ligand affinity, 4 lipid bilayer behavior, 5 and DNA𠄽rug interactions. 6 Theoretical and experimental studies in this field are complementary. 7 Molecular dynamics (MD) approaches to study ligand-receptor binding have been used in preliminary in silico experiments of drug development, successfully reducing the cost of designing new drugs. 8,9

Great progress in extending the time and size scales of all-atom simulations is constantly achieved. Many different approaches to improve calculation speed were proposed such as use of world-distributed computing (e.g. the [email protected]), 10 development of very efficient load-balanced parallel codes such as, e.g., GROMACS, 11 NAMD, 12 and DESMOND, 13 implementation of all-atom molecular dynamics programs on graphical processor units (GPUs), 14 and the construction of dedicated machines such as ANTON. 15 The recent advances in computation methods have facilitated the simulations of very large systems at all-atom resolution. 7 With the ANTON machine, all-atom simulations of smaller systems (e.g. up to 100 residues) can be performed at submillisecond time scale. 16,17 However, access to the ANTON 15 super-computer is limited and even calculations with ANTON are restricted to either relatively small (microsecond) time scale, 18 or to small systems (up to 120,000 atoms with solvent). 16,17 Owing to recent improvements of the all atom force fields and simulation techniques, ab initio folding simulations at the all-atom resolution have become feasible for small proteins. 16,19�

The need for simulating large systems at large time scales is addressed by coarse-graining approaches, in which some of the details of a system are omitted from the model. One of such approaches to proteins is the UNited RESidue (UNRES) model of polypeptide chains, which is being developed in our laboratory. 22� Owing to the use of a coarse-grained representation of polypeptide chains, simulations with UNRES are faster by 3𠄴 orders of magnitude with respect to all-atom molecular dynamics simulations in explicit water or by two orders of magnitude faster with respect to all-atom simulations in implicit solvent 31,32 (implicit solvent is assumed in UNRES). Part of the speed-up results from the extension of the effective time scale because of averaging out fast-moving degrees of freedom such as, e.g., the solvent degrees of freedom. Thus, 1 µs of UNRES simulations corresponds to 1 ms of all-atom simulations or 1 ms of real time. 31,32 Another part of the speed-up is a result of the reduction of the number of interaction sites and, thereby, a lower cost of energy and force evaluation.

The UNRES force field performs well in the prediction of overall folds, 33� including domain packing, 37 as proved in Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiments. 33� In the CASP10 experiment, the predictions made with the use of UNRES for targets T0663 and T0740 were featured by the assessors as the best for these targets. 37 The reason for good performance of UNRES is use of anisotropic potentials for side chain – side chain interactions, represented by the Gay-Berne functional form, which has spheroidal symmetry, 38 and introduction of multibody terms for the potential of mean force of polypeptide chains in water, 24 derived in a systematic way through Kubo’s cluster-cumulant expansion. 39 Recently, other coarse-grained models of proteins with Gay-Berne potentials for the side chain – side chain interactions have been developed. 40,41

On the other hand, UNRES does not reproduce local chain conformations that well. To address this problem, in our recent work 42 we developed new torsional potentials that depend on the angles involving side-chain centers. These potentials were derived based on statistics from the Protein Data Bank (PDB). 43 These potentials improved the quality of UNRES-simulated structures of proteins, especially in loop regions. 42 However, statistical potentials are dependent on a database and, moreover, cannot be used with confidence to handle D-amino-acid residues and non-standard residues. Therefore, in this work, we focused on the improvement of the specificity of local interactions. We introduced new physics-based energy terms that account for the coupling between backbone- and side-chain conformational states.

In this study, we tuned only the weight of the new terms in the UNRES energy function to obtain the best performance of the force field with eight selected training proteins. We did not change the weights of the other energy terms except for reducing the weight of the torsional terms following the introduction of the new terms, to avoid double-counting of the same interactions. Such an approach enabled us to assess the improvement resulting from the introduction of the new terms and not from optimization of the other terms already present in the energy function. To test the force field with the new terms, we used a set of 22 proteins ( Table 2 ), none of which was present in the training set. Both the calculations with the new terms and the reference calculations without the new terms were run on this set of test proteins.

Table 2

Proteins used to test the performance of the force field with the new terms.

Figure 5

Figure 5. MolProbity scores of models refined with locPREFMD vs Cα RMSD from the initial model for the server prediction test set using the regular protocol that restrains Cα positions to the initial model (black) and a modified protocol where larger deviations of Cα positions are allowed (red, see Methodology section and Figure 2 for details).