Implementation of a perturbation propagation algorithm
A perturbation propagation algorithm is the foundation for allosteric network analysis in Ohm; this algorithm predicts allosteric sites, pathways, critical residues, and inter-residue correlations. The workflow of Ohm is illustrated in Fig. 1. The perturbation propagation algorithm is a repeated stochastic process of perturbation propagation on a network of interacting residues in a given protein. First, contacts are extracted from the tertiary structure of the protein. Next, the algorithm calculates the number of contacts between each pair of residues, and further divides the number of contacts by the number of atoms in each residue. This information, in turn, is used to obtain a probability matrix Pij (via Eq. (3) (“Methods”)). Each probability matrix element, Pij, is a measure of the potential that the perturbation from one residue is propagated to another residue. Next, the algorithm perturbs residues in the active site and the perturbation is propagated to other residues according to the probability matrix. At each step in the propagation, a random number between 0 and 1 is generated. If this random number is less than the perturbation propagation probability between residue i and residue j, Pij, then we propagate this perturbation from i to j; if the number is greater than the perturbation propagation probability, the perturbation is not propagated. This process is repeated 104 times. Finally, the frequency with which each residue is affected by a perturbation is calculated. We call this frequency the allosteric coupling intensity (ACI). Next, we cluster all residues according to their ACI values and their three-dimensional (3D) coordinates (“Methods”). Each cluster is predicted as an allosteric hotspot. To further utilize the perturbation propagation algorithm to identify allosteric pathways, in each step of the described perturbation propagation process, the residues through which the perturbation passes are recorded. This process yields the allosteric pathways that connect the active site with the allosteric site or sites.
Validation of Ohm predictions
The primary objective of the perturbation propagation algorithm is to identify candidate allosteric sites in proteins. We utilized a dataset compiled by Amor et al.22, which consists of 20 known allosteric proteins (Supplementary Table 1), to test how accurately Ohm predicts the locations of allosteric sites. The dataset was compiled from SCOP database, which is a manually curated database using a hierarchical classification scheme to collect protein domains into structurally similar groups: α, β, α/β, α + β, and multi-domain, which cover all the major fold-types for cytosolic proteins. They randomly selected proteins from each of the five major classes. The dataset includes seven monomers, two dimers, one trimer, seven tetramers, two hexamers, and one dodecamer proteins. The ligands of these proteins include chemical compounds, nucleosides, peptides, and DNA molecules. The lengths of proteins range from 147 to 3311 amino acids. Thus, the dataset covers a broad region of protein structure space. For each of the proteins, we used the 3D structure and the position of active sites as input and calculated the ACIs of all residues using Ohm. One of the proteins in the dataset is Caspase-1, a protein involved in cellular apoptosis and inflammation processes that is allosterically regulated41. Despite strong pharmacological interest, targeting the active sites of caspases with drug-like molecules has been very difficult. However, small molecules that bind to the allosteric sites of the protein have been shown to be potent inhibitors of caspase-1 enzymatic activity42. The structure of Caspase-1 contains two asymmetric dimers further organized into a tetrameric structure43. Each dimer contains one active site that is allosterically coupled to a distal active site42. With the tertiary structure of Caspase-1 (PDB ID: 2HBQ) as input, Ohm was used to calculate atom contacts, construct the perturbation propagation probability matrix, and determine ACI values of all residues. There are 6 prominent peaks in ACI values along the sequence (Fig. 2a). The known allosteric site, which is at the center of the dimer, exactly corresponds to peak P6, indicating that the allosteric site of Caspase-1 was successfully identified by Ohm. In fact, these peaks are all around the allosteric site, and our clustering algorithm clustered them into one allosteric hotspot. We observed that residues having high ACI values typically have low solvent-accessible surface areas (SASA) (Fig. 2a), and most low ACI areas are located on the surface.
To further evaluate how reliably Ohm predicts the allosteric pathways, we compared the predictions with previously reported experimental data. Datta and coworkers41 mutated nine residues of Caspase-1 to test their roles in allosteric pathways (Supplementary Table 2). Of these nine residues, R286A and E390A mutants strongly altered allosteric regulation, and S332A, S333A, and S339A had moderate effects, and S334A, S336A, S337A, and S388A mutants had little effect (Supplementary Table 2). Ohm predicted that R286 and E390 are the most critical residues in the Caspase-1 allosteric network (Supplementary Table 2 and Fig. 2b, c); S332, S333, and S339 were the next most important residues, and S334, S336, S337, and S388 were the least important for allostery (Supplementary Table 2). These predictions are in perfect accord with experimental results. In addition, Ohm predicts that another residue C285, should be of importance to the network, although this has yet to be experimentally verified.
A second example of the accuracy of Ohm predictions is the 129-residue response regulator protein, CheY, which has allosteric properties and numerous crystal structures are available for both wild-type and mutant proteins44,45,46. The phosphorylation of D57 in CheY allosterically activates binding of FliM within bacterial flagellar motors45,47,48. There were four prominent peaks in the calculated ACI of residues in CheY (Fig. 3a). Remarkably, BeF3− (the ligand binding in the allosteric site) and D57 (the allosteric site) have the highest and second highest ACI, respectively. The four ACI peaks P1, P2, P3, and P4 are all around the allosteric site, and they are finally clustered into one allosteric hotspot. Based on the analysis of pathways, the most important allosteric pathway is D57-T87-Y106-FliM (Fig. 3b–d). The two residues, T87 and Y106 (Fig. 3b), are of significantly higher importance than other residues based on ACI scores. Cho and coworkers49 solved the NMR structure of activated CheY and proposed a Y–T coupling between residues Y106 and T87 to explain the mechanism of the allosteric activation of CheY: The phosphorylation of D57 causes T87 to move toward the phosphorylation site due to enhanced hydrogen bonding interaction between the two residues, which leaves more space for Y106 to occupy the buried rotameric state. The Y–T coupling is the reason for the importance of T87 and Y106 in the allosteric communication between the allosteric site and FliM, which is in agreement with our predictions. Thus, for both Caspase-1 and CheY, the residues predicted by Ohm to be the most critical components of the allosteric pathways in these two proteins are those previously established to be of high importance.
The perturbation propagation algorithm in allosteric pathways identification starts at the allosteric site, because the perturbation in protein is propagating from the allosteric site to the active site, but the perturbation propagation algorithm in allosteric site prediction actually starts at the active site, because the active site is known and the objective is to find the allosteric site. To interrogate the difference of perturbation propagation directions, we used the allosteric site D57 in CheY to predict the active site (Supplementary Fig. 1A). There are three major ACI peaks and the third one that includes residues 100-105 is exactly the active site. We have also identified the pathways from the active site to the allosteric site (Supplementary Fig. 1B). The most critical residues in the identified allosteric pathways are still 87 and 106. These results indicate that the allosteric correlation between the allosteric site and the active site in CheY is reversible, while the allosteric correlation in other proteins could also be irreversible50.
We performed allosteric analysis for all 20 proteins (Fig. 4 and Supplementary Figs. 2–21) and compared the allosteric site prediction results to that of Amor’s method (Supplementary Fig. 22 and Supplementary Table 4). We utilized the clustering algorithm (“Methods” section) to identify allosteric hotspots based on ACI values and calculated the true-positive ratio (TPR)—the ratio of the number of true hotspots to the total number of predicted hotspots. Ohm identifies several allosteric hotspots for small proteins and less than 15 hotspots for large proteins (such as 1EYI, 6DHD, and 7GPB). In stark contrast, if we apply the clustering algorithm to the quantile scores, which is the metric in Amor’s method to evaluate the allosteric correlation, the number of predicted hotspots is much larger than that predicted by Ohm (Supplementary Fig. 22a). A plethora of identified hotspots create hurdles for users to identify the true allosteric site. For large proteins such as 1D09, 1XTT, 1EFA, 7GPB, and 1YBA, >30 hotspots are identified based on quantile scores because the quantile scores are scattered around the structure (Supplementary Fig. 23). Most importantly, the TPR of hotspots predicted by Ohm is much higher than that predicted by Amor’s method for most proteins in the dataset (Supplementary Fig. 22b). The average TPR of Ohm is 0.57, compared to 0.23 of Amor’s method. TPR of Ohm-predicted hotspots for the four small proteins—1F4V, 2HBQ, 1PTY, and 3K8Y—are all equal to 1. Besides, although 1XTT is a large tetramer protein composed of 868 residues, the TPR of Ohm is still equal to 1. We also calculated the positive predictive value (PPV)—the ratio of the number of identified allosteric site residues to the total number of all allosteric site residues—of Ohm and Amor’s method, respectively (Supplementary Fig. 22c). Ohm can recapitulate more allosteric site residues than Amor’s method for most proteins. The PPV of Ohm is 0.72, compared to 0.48 of Amor’s method. These results indicate that Ohm outperforms Amor’s method in the ability to identify allosteric sites by featuring both higher TPR and higher PPV.
The pathways are identified by the perturbation propagation algorithm and the critical residues are identified by the importance in the pathways according to Eq. (4). Dijkstra’s algorithm51 has also been utilized by network models to identify optimal pathways38 and betweenness centrality is usually utilized to measure the importance of nodes in the network. For each of the 20 proteins, we compared the top-10 ranked residues identified by importance of residue in Ohm pathways to residues identified by betweenness centrality in Ohm pathways and residues identified by betweenness centrality in Dijkstra optimal pathways (Supplementary Table 5). The three methods share some common residues but these residues are ranked differently. For CheY (1F4V), the most critical residues identified by Ohm are 87 and 106, while these two residues are lower-ranked by the other two methods.
Finally, we utilized a designed four-helix bundle protein (1MFT) as the negative control (Supplementary Fig. 24). The tertiary structure colored by the calculated ACI values shows that apart from the designated pseudo active site—the N-terminus—there are no ACI hotspots in this un-allosteric protein structure.
Comparison of allosteric correlations from CHESCA and Ohm
Taking advantage of the celerity of ACI calculation, Ohm could yield the ACIs of all residue–residue pairs of even a large protein in a short period of time. Experimentally, inter-residue allosteric correlations can be determined using CHESCA based on the combination of agglomerative clustering and singular value decomposition as proposed by Selvaratnam et al.52. Aimed at revealing extended networks of coupled residues, CHESCA is essentially a covariance analysis of NMR chemical shift changes caused by local perturbations, such as a series of related compounds or mutations53,54. We analyzed seven different mutants of unphosphorylated CheY (F8V, D13K, M17A, V21A, T87I, Y106W, and A113P) using CHESCA (see “Methods”). Inter-residue cross-correlations are determined for residue pairs (Fig. 5a). As there are errors in the calculated allosteric correlations55 and missing data for certain residues, we subjected the allosteric correlation matrix to a Gaussian low-pass filter (Fig. 5b). The most dramatic correlations were found in residues belonging to the C-terminal helix α5 (Fig. 5b, e region V), indicating that most residue pairs in this helix respond in a correlated fashion to the seven mutations. This finding held true even when the A113P dataset, which is the data collected on the protein with a mutation in a residue in α5, was removed from the calculation. Thus, the C-terminal helix appears to be sensitively coupled to the rest of CheY.
We next utilized Ohm to predict the inter-residue correlations within CheY (Fig. 5c). As we used unphosphorylated protein for the CHESCA, we also used the unphosphorylated state of CheY (PDB ID: 1JBE) for Ohm analysis. There were 12 regions in common in the matrices obtained from CHESCA and Ohm. Regions I, II, III, IV, and V each involve consecutive residues (Fig. 5e): region I consists of residues 6 to 9; region II consists of residues 33 to 43; region III consists of residues 50 to 58 and includes the allosteric site residue D57; region IV consists of residues 85 to 107, including T87 and Y106, which are the two critical residues in the dominant allosteric pathway (Fig. 5e, green cylinders); and region V consists of residues 112 to 124, which are all in the C-terminal helix, corroborating the speculation that the C-terminal helix is coupled to the rest of CheY. Besides, the coupling in the C-terminal helix can also be demonstrated by Anisotropic Network Model (ANM) through ANM 2.1 server56. The other seven regions involve correlations between residues located in regions I, II, III, IV, and V. The similarities of the matrices obtained from CHESCA and Ohm indicate the ability of Ohm to reliably identify inter-residue allosteric correlations in proteins.
Since PRS39,40, the hitting time in Chennubhotla and Bahar’s Markov propagation model34,35, and the shortest paths lengths have also been used38 to evaluate the correlation between residues in proteins, we further quantitatively compared Ohm to these methods, as well as the native contacts (Fig. 5d). We ranked all residue pairs by the predicted correlation and selected a certain number of top-ranked residue pairs. We then measured TPR—the ratio of the number of residue pairs that match CHESCA results (true positives) to the total number of selected residue pairs (positives)—to evaluate the accuracy of the calculated correlations. A selected residue pair matches CHESCA results when the corresponding CHESCA correlation is higher than 0.5. The TPR decreases as the number of positives increases, while the TPR of Ohm is higher than other methods when the number of positives is less than 60. When 20 top-ranked residue pairs are selected, the TPR of Ohm and the hitting time are both 47.6%, while the TPR of native contacts, shortest paths lengths, and PRS are 38.1%, 33.3%, and 33.3%, respectively.
Stability of allosteric pathways over an MD simulation
The perturbation propagation algorithm relies only on an input static structure, but protein structures are dynamic. Therefore we analyzed whether the conformational changes induced by in the CheY protein structure during the time course of a molecular dynamics (MD) simulation57 caused significant variation in the allosteric network. We used Gromacs to perform a 100-ns molecular dynamics simulation of CheY using the crystal structure (PDB ID: 1F4V) as the starting structure. We extracted five snapshots at 0, 25, 50, 75, and 100 ns, and used Ohm to analyze the allosteric pathways. The results indicated that the allosteric network of CheY changes moderately over the simulation time course but that the core allosteric pathway, D57-T87-Y106-FliM, remains stable (Supplementary Fig. 25). Only at 75 ns, were T87 and Y106 not the most critical residues, but even at this time point the pathway D57-T87-Y106-FliM existed. In addition, W58, I95, G105, and K109 are among the critical residues in allosteric pathways in all five snapshots. Thus, although Ohm uses only the static tertiary structure of a protein as input, under conditions where conformational change is induced the allosteric pathway calculated by Ohm is stable. This observation is consistent with our proposed view of allostery, whereby protein structure is coupled to dynamics and the allosteric phenomenon is a manifestation of this coupling.
Factors that affect perturbation propagation
As the perturbation propagation algorithm is de facto a random process, the results of each calculation differ. Hence, to evaluate the statistical significance of the perturbation propagation algorithm outcome, we performed multiple perturbation propagation simulations and ensure convergence of the results. To determine the number of steps required for convergence and for the statistical error to reach an acceptable level, we tested the standard deviation of ACI when different numbers of perturbation propagation rounds were used (Supplementary Fig. 26A). As the number of perturbation propagation rounds was increased from 100 to 10,000, the standard deviation of ACI decreased (Supplementary Fig. 26B). The standard deviations of ACI values were negligible when the number of perturbation propagation rounds was 10,000.
To interrogate the difference of using different experimental structures as input for Ohm analysis, we calculated ACI of residues in CheY by using four different experimental structures: 1FQW (apo CheY-BeF3−), 3CHY (apo unphosphorylated CheY), 1F4V (CheY-BeF3−FliM), and 2B1J (unphosphorylated CheY-FliM). Because of the very short lifetime of phosphorylated CheY, its structure is not available. BeF3− can bind at the allosteric site of CheY to mimic the phosphorylation effect to finally result in the binding of FliM at the active site. We observe that BeF3− in 1FQW and 1F4V both have the highest ACI values (Supplementary Fig. 27). On the other hand, the ACI peak at the allosteric site D57 is much more prominent in apo structures than in holo structures. In the unphosphorylated holo structure (2B1J), the ACI peak at the allosteric site is even less prominent than other regions. Based on this result, we propose a four-state hypothesis for CheY. In the unphosphorylated state of CheY, multiple regions are allosterically correlated with the active site, including the allosteric site. When CheY is activated by phosphorylation or the binding with BeF3−, the allosteric site is so strongly affecting the active site that it finally leads to the binding with FliM. After the binding event, the allosteric correlation in the allosteric site becomes less prominent. When the holo structure is unphosphorylated, the allosteric correlation at allosteric site is even lower than other regions so as to protect the holo structure from being separated by any remotely propagated perturbation at the allosteric site.
Another factor that largely affects the performance of allosteric site prediction performance is the selection of residues in the active site to serve as the start of the propagation algorithm. We calculated ACI of residues in CheY-FliM complex (1F4V) by setting all residues in FliM as the start of the algorithm, and also calculated ACI of residues in apo CheY structure (3CHY) by setting all residues, residue Y106, or K119 on the CheY-FliM binding surface as the start of the propagation algorithm, respectively. By using FliM as the start for the bound structure (Supplementary Fig. 28A), or using all binding surface residues as the start for the unbound CheY structure (Supplementary Fig. 28B), we can successfully find the allosteric site D57. However, if we choose Y106 (Supplementary Fig. 28C) or K119 (Supplementary Fig. 28D) for unbound CheY structure, we cannot find the allosteric site D57. Thus, we recommend to use all residues on the active site for the unbound structure.
Parameter α in Eq. (3) is a user-defined parameter to amplify or reduce probability. We evaluated the dependence of the predictions as a function of α on the allosteric interaction network of CheY (Supplementary Fig. 29). When α was increased from 1 to 5, the absolute values of ACI increased. However, the positions of the peaks in ACI did not change. The network of allosteric interactions within CheY changed only slightly, and the most important pathway, D57-T87-Y106-FliM, was unaffected, and the most critical residues remained T87 and Y106. Thus, the value of α does not affect the accuracy of the results. To further determine the value of α that can result in the best performance, we calculate ACI of residues in CheY by using α from 0.5 to 10 (Supplementary Fig. 30). When α is 10, the ACI values of nearly all residues are approaching 1, driving the identification of ACI hotspots intractable. Therefore, we prefer low α-values (0.5–3) than high α-values (>3).
For proteins, the backbone–backbone atom contacts between two sequence-adjacent residues are excluded when computing the perturbation propagation probability matrix. If contacts between backbone atoms of two adjacent residues are not excluded, perturbation propagates mainly through the backbone from one residue to its sequence-adjacent residues, which generates a large number of non-dominant pathways, and the identification of genuine allosteric pathways becomes prohibitively inefficient. When versions of Ohm excluding or including backbone–backbone contacts between sequence-adjacent residues were used to predict the allosteric pathways of CheY, only the version that excluded backbone–backbone contacts identified the correct allosteric site and allosteric pathways (Supplementary Fig. 31).
Determining the impact of mutations on allosteric pathways
Although allosteric pathways are insensitive to protein dynamics, mutations on certain residues, taking CheY as an example45,58,59,60,61,62, especially those located in core allosteric pathways, can influence the allosteric behavior of proteins. To interrogate to what extent the mutations on peripheral residues, which are irrelevant to allosteric pathways, influence coupling between the allosteric site and the active site, we used Ohm to identify the allosteric pathways of two CheY mutants, V21A and I55V. The tertiary structures of these two mutants were generated using Eris63. The Ohm analysis showed that the allosteric pathways in these two mutants changed moderately relative to that in the wild-type protein (Supplementary Fig. 32). The core pathway, D57-T87-Y106-FliM, remained dominant in both mutants, however. The ACI values between the active site and the allosteric site are 0.2, 0.3, and 0.3 in the wild-type protein and V21A and I55V mutants, respectively. Thus, surprisingly, the allosteric correlations in the two mutants are even stronger than that in the wild-type structure.
To validate this result, we performed autophosphorylation experiments with wild-type CheY and the two mutants. CheY can either be phosphorylated by its sensor kinase CheA or autophosphorylated in the presence of an appropriate small-molecule phosphodonor47. Autophosphorylation results in a change in the fluorescence emission intensity at 346 nm, which derives from W58; this is a useful measure of functional activity45,59. When the data were analyzed, the two CheY mutants, V21A and I55V, both had higher values, by about 50%, of the slope of the plot of the observed rate constant versus phosphodonor concentration (Supplementary Fig. 33), which is consistent with the Ohm predictions.