CN115938488B

CN115938488B - Recognition method of protein allosteric modulator based on deep learning and computational simulation

Info

Publication number: CN115938488B
Application number: CN202211500668.3A
Authority: CN
Inventors: 蒲雪梅; 陈建芳; 陈欣; 毛俊; 刘静
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-09-08
Anticipated expiration: 2042-11-28
Also published as: CN115938488A

Abstract

The application discloses a recognition method of a protein allosteric modulator based on deep learning and calculation simulation, which is used for obtaining an MD simulation track of a protein complex combined with an endogenous agonist; preliminary classification is carried out on the MD simulation track to generate a clustering label; inputting MD simulation tracks and cluster labels into MDNN to find key residues of each conformational state, selecting valuable conformational states, inputting FTSite to perform allosteric site prediction, and finding potential allosteric sites; obtaining potential drug molecules with the most stable binding by using structure-based virtual screening aiming at the predicted potential allosteric sites; the allosteric regulation mechanism of the potential drug molecule is revealed by means of dynamic network analysis, confirming the properties of the potential drug molecule. By means of molecular dynamics simulation, deep learning, virtual screening and dynamic network analysis, potential allosteric sites of proteins are identified, potential allosteric modulators are screened out and the allosteric modulation mechanisms thereof are studied.

Description

Recognition method of protein allosteric modulator based on deep learning and computational simulation

Technical Field

The application relates to the technical field of recognition of a G protein coupled receptor allosteric modulator, in particular to a recognition method of a protein allosteric modulator based on deep learning and calculation simulation.

Background

Allosteric mechanisms provide a new paradigm for modulating receptor function, and rational design of allosteric modulators is therefore of increasing interest. The orthosteric site of a protein is the site at which endogenous activating ligand binds. The strong evolutionary conservation of orthosteric sites leads to cross-reactivity problems of orthosteric ligands, which may lead to adverse therapeutic side effects. While allosteric modulators bind to a site that is topologically different from the orthosteric site and thus do not compete with the orthosteric ligand. Allosteric sites are less evolutionarily conserved than orthosteric sites, and allosteric modulators may exhibit superior properties in subtype selectivity and specificity, potentially reducing side effects compared to orthosteric ligands. The allosteric regulation of proteins is quite fine. For example, positive allosteric modulators may enhance downstream signaling in four different ways: (1) promotes normal agonist binding affinity but does not directly affect signaling, (2) directly enhances signaling without affecting normal agonist binding, (3) increases normal ligand binding affinity and simultaneously increases signaling itself, and (4) decreases normal ligand binding affinity but increases signaling itself. Negative allosteric modulators may use similar combinations to reduce downstream signals. Allosteric modulators stabilize the unique conformation of the protein pool, providing new pharmacology for the receptor. Thus, an increasing number of allosteric modulators are found as potential drugs.

However, since the discovery process is very challenging, only a few allosteric modulators are approved as drugs or for clinical trials. This is because detecting the allosteric behaviour of a modulator from pharmacological experiments is a challenging process and false positives are often observed using mutation experiments to determine the binding site of an allosteric modulator. Since 2013, resolving complex structures has been the most successful method of identifying allosteric modulator binding sites and gestures in GPCRs, however it is costly. In recent years, advances in structural biology and computing technology have revealed a number of targeted allosteric mechanisms, which make rational design of allosteric modulators a new approach to drug discovery.

The recognition of allosteric sites is a prerequisite for virtual screening of structural-based allosteric modulators, however, allosteric sites are often mystery and difficult to find in the protein crystals being resolved. An allosteric pocket is typically not present in an apo structure that does not bind a ligand, and the relaxed state of the allosteric site is dominant in the conformational set only in the presence of the ligand. Molecular modeling is an effective method for generating conformational sets and exploring these sites. Combining site prediction with a set of MD-based GPCR conformations may detect sites that are not apparent in static experimental structures, which is attractive for the discovery of new allosteric sites.

Although the design of computer-aided allosteric modulators has found some applications, the following problems still remain:

allosteric effects are typically mediated by microseconds (mus, 10 ^-6 ) Or milliseconds (ms, 10) ^-3 ) Occurs on a time scale, classical conventional molecular dynamics simulation (cMD) generally captures only nanoseconds (ns, 10 ^-9 ) Conformational changes on a time scale may therefore be difficult to capture in stealth allosteric sites.

Extensive molecular dynamics modeling is required to capture the cryptic allosteric sites, which will yield a large amount of modeling data. Manually analyzing such data is very difficult, time consuming and subject to some a priori knowledge bias. Screening for valuable conformational intermediates is a prerequisite for rapid and efficient prediction of allosteric sites.

Virtual screening can screen out the small molecules that bind most stably, but the nature of the small molecule and its allosteric regulation mechanism cannot be determined. The mechanism of action of allosteric modulators is one of the most important subjects in the related studies and is difficult to reveal in experiments due to its complexity.

Disclosure of Invention

The application aims to provide a recognition method of a protein allosteric modulator based on deep learning and calculation simulation, which is used for recognizing potential allosteric sites of proteins, screening out potential allosteric modulators and researching an allosteric modulation mechanism thereof.

The application solves the problems by the following technical proposal:

a method of identifying a protein allosteric modulator based on deep learning and computational simulation comprising:

step S100, obtaining MD simulation tracks of a protein complex combined with an endogenous agonist by using Gaussian acceleration molecular dynamics simulation; gaussian acceleration molecular dynamics simulation is an enhanced sampling mode, and by adding potential energy to reduce energy barrier, the conformational change characteristics of millisecond level can be sampled in the simulation time scale of nanoseconds, so that allosteric sites on the millisecond time scale can be captured;

step S200, performing preliminary classification on MD simulation tracks of the protein complex by using unsupervised cluster analysis, and generating a cluster label;

step S300, inputting the MD simulation track and the clustering label into a classification model MDCNN based on CNN, identifying different conformation states from the MD simulation track, and finding a key structure and a key residue of each conformation state by a model interpreter LIME in the MDCNN while identifying the functional state, wherein the key residue fed back by the LIME assists in selecting valuable conformation states for subsequent allosteric site prediction;

step S400, inputting the selected conformational state into a site prediction tool FTSite for predicting the allosteric site, wherein the site with the highest score except the orthosteric site is considered as the potential allosteric site;

step S500, aiming at the predicted potential allosteric site, using structure-based virtual screening to obtain the potential drug molecule with the most stable combination;

step S600, revealing allosteric regulation mechanism of the potential drug molecules by means of dynamic network analysis, confirming the properties of the potential drug molecules, wherein the dynamic network analysis can identify allosteric paths and important residues which play an important role in structural information transmission, so that the allosteric regulation mechanism of the potential drug molecules is revealed.

The step S100 specifically includes:

step S110, acquiring an inactive state crystal structure of a target protein;

step S120, deleting other components except the target protein in the crystal structure, and reconstructing a missing structural region in the crystal structure to ensure that the structure of the target protein is complete;

s130, acquiring the structure of a protein endogenous agonist, and then constructing a complex structure of the protein endogenous agonist by using molecular docking and selecting a reasonable docking pose with the highest score;

step S140, protonating the protein and the ligand in the physiological environment of the target protein to construct a simulation system similar to the physiological environment, wherein the simulation system generally comprises a protein compound, a solvent molecule, ions, (lipid membrane components);

step S150, for the simulation system after the system is minimized and heated, performing unconstrained cMD under NPT system to operate the simulation system to a relatively balanced state. The final structure after cMD equilibrium was used as the starting structure for the gaussian acceleration molecular dynamics simulation, starting the gaussian acceleration molecular dynamics simulation procedure.

The step S200 specifically includes:

step S210, extracting protein conformations from protein complex GaMD tracks at intervals to form a protein conformation set capable of representing the whole track. Calculating conformational features for distinguishing conformational states according to a research system;

and S220, using the conformational features as clustering indexes, clustering the protein conformations by using an unsupervised clustering analysis algorithm, and selecting the optimal clustering result to be used as a label of the protein conformational set in the MDNN model.

The step S300 specifically includes:

step S310, data processing: using the protein conformational set obtained in S210, protein ca atom superposition was used to eliminate overall rotation and translation. Deleting all hydrogen atoms, and then converting the coordinates of other atoms into RGB coordinates, thereby obtaining a data set;

step S320, adding a label: reading the protein conformation aggregation class result obtained in the step S220 as tag data of a data set, wherein the data set corresponds to the data tag one by one so as to indicate which class the conformations in the data set belong to;

step S330, data set division: the data are grouped to eliminate the influence of the simulation time sequence, and then are randomly divided into a training set and a verification set according to a certain proportion. Carrying out K-fold division on the data set to obtain a K-fold cross-validation data set;

step S340, constructing a model: training a model MDNN for protein conformational state classification recognition based on a Convolutional Neural Network (CNN) by taking a data set as an input, adopting K-fold cross validation in the training and validation process, and then evaluating the performance of the classifier by using an accuracy ACC;

step S350, constructing a model interpreter: constructing a LIME interpreter to interpret the prediction result of the MDCNN in a local linear fitting mode;

and step S360, adding all the atomic scores contained in the residues to obtain importance scores of each residue of the protein, and selecting the residues with the top scores in sequence to be regarded as important residues in the conformational state.

The step S400 specifically includes:

step S410, projecting important residues of each conformation reflected by MDNN into a protein structure, using Pymol to carry out visualization to select out valuable conformation intermediate states,

step S420, extracting a representative structure from the target conformation intermediate state track and storing the representative structure in a pdb format;

step S430, uploading the representative protein structure file to a server of the FTSite (https:// FTSite. Bu. Edu /), and obtaining three potential ligand binding site prediction results, wherein the site with the highest score except the orthosteric site is used as a potential allosteric site for virtual screening.

The step S500 specifically includes:

step S510, preparing a micromolecule data set for virtual screening, generating a butt joint box to enable the butt joint box to exactly and completely cover all residues in potential allosteric sites, and starting the virtual screening based on the structure;

step S520, selecting the most stable molecule combined with protein as a potential allosteric modulator by combining the butt joint scoring result of small molecules with a plurality of modes such as manual inspection or combination free energy calculation with higher precision, and outputting a protein-allosteric modulator compound structure.

The step S600 specifically includes:

step S610, constructing a simulation system for the protein-allosteric modulator compound obtained in the step S500 in a physiological environment, and obtaining a section of simulation track by using molecular dynamics simulation;

step S620, generating a dynamic network: inputting the simulation track of the protein-allosteric modulator compound into VMD software, and generating a dynamic network by using a NetworkView plug-in;

step S630, analyzing an allosteric regulation mechanism of the potential drug molecule by using a dynamic network, including:

community analysis: dividing the dynamic network into different sub-networks by using Girvan-Newman algorithm in the VMD, and carrying out visual analysis on the community network in the VMD so as to obtain the communication network distribution condition of each structural domain of the protein under the influence of the allosteric modulator;

shortest path analysis: the shortest distance path between two nodes in the protein network is searched using the subpt program in the VMD using the Floyd-Warshall algorithm. The shortest path is often the most likely or biologically relevant signaling pathway, whereby allosteric communication pathways between functional domains of proteins under the action of allosteric modulators can be obtained, revealing the allosteric regulatory mechanisms of the allosteric modulators.

Compared with the prior art, the application has the following advantages:

(1) The application recognizes potential allosteric sites of protein, screens out potential allosteric modulators and researches the allosteric modulation mechanism thereof by means of molecular dynamics simulation, deep learning, virtual screening and dynamic network analysis. Taking the case of the β2 adrenergic receptor (β2ar), we successfully identified a novel modification site of β2ar, which is located between the highly conserved molecular switches W6.48 and D2.50. Screening for this site resulted in a potential negative allosteric modulator, ZINC5042, which stabilized the receptor in a non-activated state with a negative synergistic effect with the positive agonist. The application is equally applicable to the study of other allosteric modulators.

(2) The application uses Gaussian acceleration molecular dynamics simulation to overcome the time scale limitation of traditional molecular dynamics simulation, and can sample conformational changes on the millisecond scale, thereby being beneficial to capturing potential allosteric sites. The Gaussian acceleration molecular dynamics simulation does not need to manually set acceleration parameters, and the operation is simple, convenient and quick. Gaussian acceleration molecular dynamics simulation can be performed by adding potential energy to reduce the energy barrier, and conformational changes in the millisecond level can be sampled in the nanosecond simulation time scale, so that the calculation cost is greatly reduced.

(3) The application combines the unsupervised clustering and the supervised classification Model (MDNN) based on the convolutional neural network, and realizes the automatic processing of the molecular simulation track. The user only needs to input the track into the model for operation without complex pretreatment, the MDNN can automatically complete modeling, explanation and analysis processes, the analysis efficiency of the simulated track is greatly improved, and the deviation of manual analysis can be effectively avoided. At the same time, the MDCNN integrates a LIME interpreter to interpret the CNN model, which can help us capture the specificity and important residue distribution between different conformations. This advantageously helps us screen and identify valuable intermediates for subsequent allosteric site prediction.

(4) The application integrates dynamic network analysis for revealing the action mechanism of the allosteric modulator, the dynamic network analysis can analyze the transmission efficiency of the allosteric information and identify the allosteric path and important residues which play an important role in the transmission of the structural information, thereby revealing the drug property and action mechanism of the allosteric modulator and assisting the experimental and theoretical research in the field.

Drawings

FIG. 1 is a flow chart of the present application;

FIG. 2 is a score and profile of importance of 20 important residues identified by the LIME interpreter;

FIG. 3 is a schematic diagram of a novel allosteric site for predicting β2AR in two representative conformations conf1 and conf2 of conformational intermediate cluster1 by FTSite;

FIG. 4 is a schematic diagram of a virtual screening strategy;

FIG. 5 is a schematic representation of optimal docking positions of ligands at the conf1 and conf2 allosteric sites;

FIG. 6 is a diagram of dynamic network analysis;

FIG. 7 is a graph showing the results of MDNN cross-validation at a rate of 2AR five times;

FIG. 8 is details of the predicted resulting allosteric pockets in conf1 and conf 2;

FIG. 9 is a graph of docking scores for four candidate compounds;

FIG. 10 shows the binding energy composition (kcal/mol) between β2AR and four candidate ligands ^a Is a schematic diagram of (a).

Detailed Description

The present application will be described in further detail with reference to examples, but embodiments of the present application are not limited thereto.

Examples:

referring to fig. 1, a method for identifying a protein allosteric modulator based on deep learning and calculation simulation is exemplified by identifying an allosteric modulator of β2 adrenergic, comprising:

step 1: acquiring a β2ar complex molecule mimetic trajectory dataset that binds to endogenous agonist NE: performing extensive Gaussian acceleration molecular dynamics simulation for a total of 5*3 mu s to obtain a conformation space of the protein; specifically:

preparation:

we obtained from the protein PDB database (Protein Data Bank, PDB) the inactive crystal structure of β2ar (PDB ID:2RH 1), which is the highest resolution of all reported β2ar crystal structures, and is commonly used as a template for β2ar in many studies. Other components in the crystal structure than protein are deleted. The deleted ICL3 region (residue number: 231-262) was reconstructed using MODELLER V9.21. The 3D structure of the agonist norepinephrine NE was downloaded from the PubChem database and optimized at DFT/B3LYP/6-31G levels using gaussian 09 procedures prior to docking. Ligand docking was performed using autodock4.2 and the highest scoring rational docking pose was selected for subsequent simulation.

Hydrogen atoms were added in an environment at ph=7 prior to dynamic simulation. Protein structures were aligned in a membrane-oriented protein (Orientation of protein in membrane, OPM) database and then inserted into a cell containing 80% phosphatidylcholine (POPC) and 20% cholesterolIn the lipid bilayer. Then use TIP3P water molecules inAnd (3) internal solvation, and adding 0.15mol/L NaCl into the water phase to neutralize the system. All of the above steps were performed using a CHARMM-GUI server.

MD simulation settings:

first, by adding proteins and lipids separatelyAnd->To achieve energy minimization. The system is then heated to 310K in the NVT ensemble in 250 ps. Thereafter, an unconstrained balancing of 10ns cMD was performed using a Langevin thermostat and a Monte Carlo barometer at 310K and an NPT ensemble of 1 bar. All hydrogen bond containing lengths are constrained by the SHAKE algorithm, and the cutoff distance for non-bond interactions is set to +.>Long range electrostatic interactions were calculated using the particle grid Ewald (PME) method. The time step is 2fs. Protein, lipid and salt ions use CHARMM36 force field, water ions use CHARMM TIP P model, and CHARMM universal force field (CGenFF) is used to generate ligand parameters. Track snapshots are saved every 10 ps. All of the above steps are performed in Amber 18. The above simulation settings were used in all cMD and GaMD.

Gaussian acceleration molecular dynamics simulation:

an unconstrained balance of 210ns cMD was performed before the Gaussian acceleration molecular dynamics simulation was performed, the acceleration parameters were calculated from the cMD trajectory of the last 10ns, and the final structure of the cMD simulation was selected as the starting structure for the subsequent Gaussian acceleration MD (GaMD) simulation. Gaussian acceleration molecular dynamics (Gaussian Accelerated Molecular Dynamics, gaMD) can enhance conformational sampling of biomolecules (eqns (1) - (2)) by adding a harmonic gain potential to the system when the system potential V (r) is below the reference energy E

k is a simple harmonic force constant, E and k are adjustable parameters automatically determined by three enhanced sampling principles, and reference energy E can be calculated by eqn (3):

where V_min and V_max are the minimum and maximum potential energy of the system, respectively. In order to ensure that eqn (3) is effective, k must satisfy k.ltoreq.1/(V) _max -V _min )。

Step 2: calculating the RMSD of each conformation relative to a protein skeleton of a reference conformation by taking a first simulated frame as a reference, and performing unsupervised clustering by using k-means by taking the RMSD as a clustering index to obtain a clustering label of each conformation in the track; specifically:

to explore the conformational diversity of MD sets, we generated MD conformational sets representing receptor trajectories using a k-means clustering algorithm based on the receptor backbone atoms RMSD. All snapshots use protein C alpha atom superposition to eliminate overall rotation and translation. Cluster analysis was performed using the cpstraj module in Ambertools 18. Clustering results were evaluated using DBI (Davies-Bouldin Index), pSF (pseudo-F statistical), and SSR/SST (R-squared value). The clustering result can be input into the MDNN model as a label for learning. The centroid structure of each cluster is used as a cluster representation in site prediction and conformational analysis.

Step 3: the trajectory file is deleted for hydrogen atoms and aligned with the first frame conformation, and then entered into the MDCNN model along with the cluster labels for training. Selecting important conformational intermediate states according to the results of the LIME interpreter; specifically:

a frame structure was extracted every 100ps from each 3 μs GaMD trace, each trace taking 30000 snapshots as a conformational dataset, all MD coordinates required to be aligned to remove translation and rotation, and then used for subsequent CNN analysis. The other components except the receptor and the ICL3 region with large flexibility are removed, in addition, hydrogen atoms are also removed, the coordinates of other atoms are converted into RGB coordinates, and each frame of conformation is converted into a pixel map as the input of a model, as shown in eq 4

C _RGB ＝M ^-1 C _XYZ (4)

C _RGB And C _XYZ Respectively representing three-dimensional RGB coordinates and XYZ coordinates of one color C. M is the transformation matrix (see eq 5).

M＝[w _r R,w _g G,w _b B] (5)

R, G, B are the three-dimensional color coordinates of the three primary colors (red, green, blue), respectively. W is the three-dimensional coordinates of the white point, passing through W _r 、w _g And w _b The lengths of the three primary colors in the vector space are calibrated. R, G, B and W are fixed under a certain color system. w (w) _r 、w _g And w _b Can be calculated by eq 6.

W＝w _r R+w _g G+w _b B (6)

And reading the clustering result as tag data of the conformational data set, wherein the conformations are in one-to-one correspondence with the tags so as to indicate which type the conformations in the data set belong to. Grouping the data to eliminate the effects of analog timing, then following 8: the proportion of 2 is randomly divided into a training set and a verification set. Performing five-fold division on the data set to obtain a five-fold cross-validation data set;

a convolutional neural network CNN model was constructed, with a total of four convolutional layers, the first two containing 32 3*3 convolutional kernels and the second two containing 64 3*3 convolutional kernels, using ReLU as the activation function. After every two convolutional layers, a 2 x 2 max pooling layer is added and a dropout (0.25) operation is performed to prevent overfitting. The classification task was implemented using the fully connected layer after convolution and pooling, with 512 and n (n=number of classes) neurons in MDCNN, respectively, and adding dropout (0.5) to provide the generalization capability of the model, using ReLU and Sigmoid as activation functions. The five-fold cross-validation is used to train the machine learning model, the accuracy is used to evaluate the performance of the machine-learned classification model, and the accuracy can be calculated by eq 7.

Here, TP is the number of positive samples that the model predicts as positive. FP is the number of negative samples predicted to be positive. FN is the number of positive samples predicted to be negative, and TN is the number of negative samples predicted to be negative.

LIME may interpret the classifier's predictions using local linear approximations. For each constellation, the LIME interpreter generates a LIME matrix that evaluates the importance scores for each pixel in the classification result. The LIME matrix has the same size as the image, with each element corresponding to a pixel representing an atom. Each element in the LIME matrix has a value of 0 or 1.0 indicates that the element has less effect on the classification selection, and 1 indicates that the element has significant effect on the classification decision. We sum all LIME matrices for each constellation and average them to get a score between 0 and 1. The larger the value, the more important the atom is in the classification result. The scores of all atoms in the residue are then averaged to represent the important score for the residue. In general, we can get an approximate interpretation of CNN by combining these linear models. The top 20 residue of the score rank is considered to be the important specific residue for this conformational state. Depending on the distribution of important residues, we can pick out valuable conformational intermediates.

Step 4: inputting the conformational intermediate state obtained in the previous step into FTSite for site prediction, wherein the site with the highest score except the orthosteric site is considered as a potential allosteric site; specifically:

the k-means cluster centroid structure in the conformational intermediate state is taken as a representative structure, the k-means cluster centroid structure is uploaded to an FTSite (https:// FTSite. Bu. Edu /) website in a pdb file format, a downloaded prediction result is visually checked in Pymol, positive structural sites of proteins are eliminated, and the allosteric sites with highest scores are selected according to the ranking of site scores for subsequent study.

Step 5: aiming at the potential allosteric site obtained by the previous prediction, the small molecule ZINC5042 with the most stable combination is obtained by virtual screening based on the structure;

for example, the ligand set used in this example has 103,862 ligand molecules and consists of two compound data sets, the doubles-lib and the drugs-lib. The diversity-lib is a diversity compound database and contains 99,288 diversified drug-like small molecules, so that the novel skeleton molecules can be found. While Drugs-lib contains 4,574 commercially available drug molecules on the market, which allows drug repositioning and thus reduces development costs. These ligand sets are provided by the mtibopenscreen software we use. We obtained about 6000 molecules (including isomers) by a preliminary screening of mtihopenscreen and used autodock4.2 for further docking assessment. All docking input files are written by autodocktools1.5.6 software package, and the active site lattice file is generated by AutoGrid 4.2. For purposes of our docking box is sized to cover exactly the predicted allosteric binding sites of FTSite with a pitch ofSemi-flexible docking (flexible ligand and rigid receptor) was performed. To ensure accuracy of the calculation results, we performed 100 docking calculations for each ligand separately and 1,750,000 energy assessments for each docking calculation using the Lamarck genetic algorithm (Lamarckian genetic algorithm), with the conformation with the lowest ligand binding energy as the optimal binding means for further analysis.

Molecular mechanics/generalized born surface area method (MM/GBSA) is an effective tool for obtaining protein-ligand interactions and free energy of binding of protein-protein interactions. We calculated the free energy of ligand-receptor interaction using MM/GBSA method as follows:

ΔG _binding ＝G _complex -(G _receptor +G _ligand ) (8)

wherein G is _complex ，G _receptor And G _ligand The free energy of the receptor-ligand complex, receptor and ligand, respectively, can be calculated by the following formula:

G＝E _gas +G _sol -TS (9)

E _gas ＝E _int +E _ele +E _vdw (10)

G _sol ＝G _psolv +G _npsolv (11)

wherein E is _gas Is gas phase energy, is thermodynamic energy E _int Van der Waals energy E _vdw And electrostatic interaction energy E _ele And (3) summing. G _psolv And G _npsolv Is to solvation energy (G) _sil ) Polar and nonpolar contributions of (a). T is the temperature and S is the total conformational entropy.

All free energy calculations were performed in the SANDER program of AMBER 18.

Step 6: a dynamic network is respectively constructed aiming at a beta 2AR-ZINC5042 compound system and an apo-beta 2AR system which is not combined with a ligand, and community analysis and shortest path analysis are carried out based on the dynamic network, so that the ZINC5042 is combined to enable a communication network to be looser, the information transmission efficiency is reduced, and a negative allosteric regulation effect is presented.

First, we use the netview plug-in the VMD to generate a dynamic network for each system using all MD simulated traces (4.5 μs total) of each system. In a dynamic network, the cα and ligand critical atoms of the acceptor residues are represented by nodes if the heavy atoms of both nodes are present within 75% of the sampling timeAnd (3) adding an edge between the two nodes. Correlation value C _ij Defining information transfer of two nodes i and j in a given simulation time, wherein the information transfer can be carried out by using eqns (12) - (-)13 A) calculating.

Where i and j are the nodes and where,and->The position vector corresponding to the time t. />The average position of node i in a given simulation time. Distance (d) between two nodes (i and j) in a dynamic network _ij ) Calculated with eqn (14), which represents the probability of passing information across edges between nodes. All calculations were performed in the Carma program.

d _ij ＝-log(|C _ij |) (14)

The thickness of each edge in the dynamic network is scaled by distance, thicker edges indicating greater relevance. The original network is then further divided into different sub-networks, called communities, using the Girvan-Newman algorithm, where there are more and stronger connections within nodes within the community than nodes of other communities.

The shortest path is generated by the subpt program of the dynamic network matrix. The Floyd-Warshall algorithm is adopted to search the path with the shortest distance between two nodes, and the shortest path is often the most probable or biologically relevant path, so that potential allosteric regulation paths and mechanisms are revealed. Path length between two nodes in a dynamic network (D _ij ) Equal to the sum of the individual path lengths involved between node sets i, j, can be calculated by eqn (15):

D _ij ＝∑ _k,l d _k,l (15)

the effect of allosteric modulators on the communication of the various regions of the protein can be revealed by the strength of the connection between each community in the community analysis. While the shortest path may reveal two parts, the length of the shortest path may reflect the efficiency of the ligation of two regions of the protein, the length being inversely proportional to the efficiency, and furthermore the residues involved in the shortest path may be considered as residues that play an important role in the allosteric regulation of the protein. Thus we have shown by such an analysis how the potential drug molecule has an effect on the allosteric communication of the protein and also how this effect may be caused by which protein residues, i.e. the allosteric modulator mechanism.

In summary, the application realizes:

(one) recognition of novel allosteric sites of β2AR

First, a large set of receptor conformations was obtained by Gaussian-accelerated molecular dynamics simulation (GaMD) of a total of 15 μs (5*3 μs) on the complex structure of inactive β2AR (β2AR-NE) binding to endogenous agonist NE to fully capture receptor flexible movements. The trajectories are clustered according to the RMSD of the acceptor framework atoms and then input into the MDCNN model for training. The larger the number of clusters, the smaller the cluster-to-cluster difference, and the higher the similarity. Thus, as the number of clusters increases, the model predictive accuracy of the MDCNN decreases somewhat, as shown in fig. 7. When the clustering number is equal to 3, the model accuracy of the MDNN is 0.903+/-0.003%, and the clustering index shows a good clustering effect. Thus, based on the model LIME interpreter results at cluster number 3, we calculated the importance scores for residues of β2AR, we looked at the three classes of conformationally important top-ranked 20 residues, as shown in FIG. 2, and when distinguishing cluster0 (FIG. 2A), cluster1 (FIG. 2B) and cluster2 (FIG. 2C), the importance scores and distributions of the 20 important residues identified by the LIME interpreter, the residues were identified using the Ballesteros-Weinstein number. The important residues of cluster0 and cluster2 are mainly distributed near ECL2 and ECL3, ECL2 and ECL3 belong to extracellular loop regions, and have higher flexibility and are therefore easy to undergo conformational changes so as to obtain higher scores. In addition, cluster0 has a partially specific residue distributed at the extracellular end of TM1 and near ICL 1. Surprisingly, the specific residues of cluster1 are mostly distributed in the middle of TM6 and near the intracellular end of TM7, including a number of molecular switches that affect receptor activation, such as F6.44 in PIF motif and N7.49, P7.50 in NPxxY motif, among other important residues Y7.43 and N7.45. This suggests that this class of conformational states is likely to be important functional intermediates. We therefore subsequently performed site prediction for this functional intermediate cluster 1.

The research shows that the accuracy of the site prediction tool FTSite for predicting the allosteric site in the protein is up to 88%. Two representative conformations (conf 1 and conf 2) were obtained from the cluster1 conformational set using k-means clustering according to the acceptor framework atom RMSD, which represented 70% of the conformations, with very high representativeness. Similar allosteric sites were predicted from these two representative conformations using FTSite, which is characterized as being intermediate between the two important molecular switches W6.48 and D2.50, as shown in fig. 3. However, the predicted pocket of conf1 contains more residues than conf2, which suggests that the allosteric pocket of conf1 assumes a more open state, as shown in fig. 8, in which pocket residues are numbered using the ballisteos-Weinstein. The specific residues possessed by the allosteric pocket of conf1 are underlined. We predict that the resulting allosteric site has never been reported in β2ar, so this newly discovered allosteric site should be a potential β2ar allosteric site.

(II) screening for potential allosteric modulators against predicted sites

Virtual screening has been successfully applied to the identification of allosteric modulators, including GPCRs. Considering that protein flexibility is critical for efficient drug design, combinations of two or three conformations of target proteins generally perform better than virtual screening of random single conformations. Thus, we performed a two-pass virtual screening using conf1 and conf2, as shown in fig. 4. The ligand set consisted of two parts, diverse-lib and Drugs-lib, for a total of 103,862 ligand molecules. The diversity-lib contains 99,288 molecules with chemical diversity that facilitates screening out new backbone drug molecules, while Drugs-lib contains 4,574 commercially approved drug molecules, which allows us to perform drug reuse. Based on the score of two molecular docking (ad4.2energy, vinascore) and visual inspection to exclude the top-ranked potential false positive molecules, we selected four candidate compounds, as shown in fig. 9, where compound ZINC5042 gave a very high score in both conformational screens. Notably, all candidate compounds formed hydrogen or ionic bonds with Asp79, indicating that this interaction is critical for binding and recognition of the ligands at the allosteric sites, as shown in fig. 5, the optimal docking positions of the ligands at the conf1 (a) and conf2 (B) allosteric sites. MD is widely used to study complex stability and interaction patterns and can be used as a virtual screening post-treatment tool to verify and improve docking protocols. Thus, MD simulations of five ligand complexes for 120ns were performed as a post-treatment for virtual screening to verify the binding strength of the ligand-receptor complex, and we used the more stable results for the subsequent discussion in the text for the simulation results of the binding of the same ligand ZINC5042 in both conformations. RMSDs showed that all systems reached equilibrium in a 120ns MD simulation and that RMSDs for other ligands, except for the complex macromolecular ligand ZINC252008995, were less than 2 angstroms, indicating that the ligand-receptor complex obtained by virtual screening was relatively stable, demonstrating the reliability of the docking protocol. MM/GBSA was used to calculate ligand-receptor binding free energy as shown in figure 10. The results show that ZINC5042 binds to the receptor most stably and the binding energy reaches-54.1681 kcal/mol. Since the structural differences of the four ligands are large, the contribution of conformational entropy to free energy is calculated, and the result shows that the conformational entropy of ZINC5042 is best-52.5650 kcal/mol, which indicates that the ligand is the most stable after binding, and that the total free energy after entropy change is-1.6030 kcal/mol, which is far smaller than the other three small molecules, which indicates that ZINC5042 is the most likely potential allosteric modulator, and therefore is selected for subsequent mechanism investigation.

(III) revealing allosteric regulatory mechanisms of allosteric modulators

Proteins, upon local stimulation, induce conformational changes in the protein, thereby causing coupling reactions in other areas away from the stimulated area, which long-range coupling reactions are central to allosteric regulation. Many experimental and computational studies have now shown that the ability of proteins to undergo conformational changes comes from being conferred by a network of interactions between residues. Calculation of the protein dynamic network containing dynamic correlations from the simulated trajectories approximates the allosteric signal intensities associated with experimental observations.

Therefore, to understand the allosteric regulation mechanism of ZINC5042, we performed a conventional molecular dynamics simulation of 1500ns for the β2ar-ZINC5042 complex system and the ligand-unbound apo- β2ar, respectively, and constructed a dynamics-dependent network for the system based on covariance matrix. Residue community networks are highly internal to the community nodes but loosely connected between communities, nodes within a single community can communicate through multiple paths but communication between different communities must occur through a small number of critical edges or interactions. The community network indicates the community network, and as shown in fig. 6 a, the APO- β2ar and β2ar-ZINC5042 are subjected to community analysis and displayed as a three-dimensional structure of β2ar and a two-dimensional (2D) community residue interaction network, respectively. The web communities are respectively colored according to the ID numbers. In the 2D graph, one community is represented by a node, and the size of the community is determined by the number of residues in the community. Communities containing the allosteric modulator ZINC5042 are represented by diamond nodes, others by circular nodes. The connecting edges are represented by grey lines, the width of which is proportional to the strength of the information flow between the connecting communities. The apo- β2ar system was shown to have 10 communities and 8 isolated clusters, and the β2ar-ZINC5042 system had 13 communities and 3 isolated clusters. The flexibility of the protein ends and flexible loop regions is greater, so there is a dramatic change in conformation and less interaction with surrounding residues, and thus easier independent clustering. The greater the number of communities, the more loose the network, and the less efficient the information transfer of the system. Thus community analysis indicates that the dynamic network community of the receptor becomes more loose when the allosteric ligand ZINC5042 is combined, the information transmission efficiency of each region of the receptor is reduced, and negative allosteric regulation is presented.

Floyd-Warshall algorithm was used to study the shortest path between critical extracellular binding pocket residues in different systems to intracellular critical molecular switches, e.gAs shown in fig. 6B, which is a schematic diagram of the shortest path of the APO- β2ar and β2ar-ZINC5042 systems from the extracellular ligand binding pocket to the intracellular region of the receptor, the shortest path is generally considered the most likely or biologically relevant path. The shortest allosteric pathway for the Apo- β2ar system is N312 ^7.39 -Y316 ^7.43 -G320 ^7.47 -P323 ^7.50 -Y326 ^7.53 While the shortest path N312 of the beta 2AR-ZINC5042 system ^7.39 -Y316 ^7.43 -G320 ^7.47 -N322 ^7.49 -Y326 ^7.53 The shortest path between both systems is to transmit the signal directly into the cell via TM7 and there is only one constituent residue difference. But apo has a shortest path length of 35 and β2ar-ZINC5042 has a shortest path length of 88. The path length is inversely proportional to the signaling efficiency, which suggests that ZINC5042 binding does not significantly alter the shortest signaling pathway between the inside and outside of the receptor, but greatly impairs signaling efficiency between the inside and outside of the cell, which is detrimental to receptor activation, consistent with community analysis findings. Thus, we analyzed through dynamic network that ZINC5042 binding does not alter the shortest signaling pathway between the receptor's cell and cell, but as this ligand binding would make the protein network more loose, thus reducing the intracellular and extracellular signaling efficiency, which is detrimental to receptor activation, ZINC5042 exhibits negative allosteric regulation and is a negative allosteric modulator of β2ar.

Although the application has been described herein with reference to the above-described illustrative embodiments thereof, the foregoing embodiments are merely preferred embodiments of the present application, and it should be understood that the embodiments of the present application are not limited to the above-described embodiments, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the scope and spirit of the principles of this disclosure.

Claims

1. A method for identifying a protein allosteric modulator based on deep learning and computational simulation comprising:

step S100, obtaining MD simulation tracks of a protein complex combined with an endogenous agonist by using Gaussian acceleration molecular dynamics simulation;

step S300, using MD simulation tracks and cluster labels to train a classification model MDCNN based on a convolutional neural network CNN, wherein the CNN model identifies different conformation states from the MD simulation tracks, a model interpreter LIME in the MDCNN finds out a key structure and key residues of each conformation state while identifying functional states, and the key residues fed back by the LIME assist in selecting valuable conformation states for subsequent allosteric site prediction;

step S400, selecting a valuable conformational input site prediction tool FTSite to predict allosteric sites according to key residues of conformational in MDNN, wherein the sites with highest scores except for the orthosteric sites are considered as potential allosteric sites;

step S500, aiming at the predicted potential allosteric site, using a structure-based virtual screening method to obtain the potential allosteric modulator with the most stable combination, and outputting a protein complex structure;

step S600, revealing an allosteric regulation mechanism of the potential drug molecules by means of dynamic network analysis, and confirming the properties of the potential drug molecules.

2. The method for identifying a protein allosteric modulator based on deep learning and calculation simulation according to claim 1, wherein the step S100 specifically comprises:

step S110, acquiring an inactive state crystal structure of a target protein;

step S140, protonating the protein and the ligand in the physiological environment of the target protein, and constructing a simulation system similar to the physiological environment;

step S150, for the simulation system which is completed by system minimization and heating, unconstrained dynamics simulation cMD is carried out under the NPT system, the simulation system is operated to a relatively balanced state, the last structure after cMD is balanced is used as the initial structure of Gaussian acceleration molecular dynamics simulation, and a Gaussian acceleration molecular dynamics simulation program is started to operate.

3. The method for identifying a protein allosteric modulator based on deep learning and calculation simulation according to claim 2, wherein the step S200 specifically comprises:

step S210, extracting protein conformations from the Gaussian acceleration MD track of the protein compound at intervals to form a protein conformational set representing the whole track, and calculating conformational features for distinguishing conformational states;

and S220, clustering protein conformations by using the conformational features as clustering indexes and using an unsupervised clustering analysis algorithm, and selecting an optimal clustering result as a label of a protein conformation set.

4. The method for identifying a protein allosteric modulator based on deep learning and calculation simulation according to claim 3, wherein the step S300 specifically comprises:

step S310, data processing: using the protein conformation set obtained in S210, using protein cα atom superposition to eliminate overall rotation and translation, deleting all hydrogen atoms, and then converting the coordinates of other atoms into RGB coordinates, thereby obtaining a data set;

step S320, adding a label: reading the tag of the protein conformation set obtained in the step S220 as a data tag, wherein the data set corresponds to the data tag one by one and is used for marking the conformation in the data set to which type;

step S330, data set division: grouping data to eliminate the influence of a simulation time sequence, then randomly dividing the data into a training set and a verification set according to a preset proportion, and carrying out K-fold division on the data set to obtain a K-fold cross verification data set, wherein K is more than or equal to 1 and less than 10, and K is an integer;

step S340, constructing a model: taking the data set as input, training a classification model MDNN based on a convolutional neural network CNN for classifying and identifying protein conformational states, adopting K-fold cross verification in the training and verification process, and using accuracy ACC to evaluate the performance of the classifier;

step S350, constructing a model interpreter: constructing a LIME interpreter, interpreting the prediction result of MDCNN in a local linear fitting mode, and finding out a key structure and a key residue of each conformational state;

5. The method for identifying a protein allosteric modulator based on deep learning and calculation simulation according to claim 4, wherein the step S400 specifically comprises:

step S410, projecting important residues of each conformational state reflected by MDNN into a protein structure, and visually selecting a valuable conformational intermediate state, namely a target conformational intermediate state, by using Pymol;

and S420, extracting a representative structure from the target conformational intermediate state track, storing the representative structure in a pdb format, uploading the representative structure to a server of the FTSite, and obtaining three potential ligand binding site prediction results, wherein the site with the highest score except the orthosteric site is used as a potential allosteric site for virtual screening.

6. The method for identifying a protein allosteric modulator based on deep learning and calculation simulation according to claim 5, wherein the step S500 specifically comprises:

step S510, preparing a small molecule data set for virtual screening, generating a butt joint box to completely cover all residues in potential allosteric sites, and starting structure-based virtual screening;

step S520, selecting the most stable molecule combined with the protein as a potential allosteric modulator, and outputting a protein-allosteric modulator complex structure.

7. The method for identifying a protein allosteric modulator based on deep learning and calculation simulation according to claim 6, wherein the step S600 specifically comprises:

step S610, constructing a simulation system for the protein-allosteric modulator compound in a physiological environment, and obtaining a section of simulation track by using molecular dynamics simulation;

step S630, analyzing an allosteric regulation mechanism of the potential drug molecules by utilizing a dynamic network.