CN115938488A

CN115938488A - Method for identifying protein allosteric modulator based on deep learning and computational simulation

Info

Publication number: CN115938488A
Application number: CN202211500668.3A
Authority: CN
Inventors: 蒲雪梅; 陈建芳; 陈欣; 毛俊; 刘静
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-04-07
Anticipated expiration: 2042-11-28
Also published as: CN115938488B

Abstract

The invention discloses a recognition method of a protein allosteric modulator based on deep learning and computational simulation, which comprises the steps of obtaining an MD simulation track of a protein compound combined with an endogenous agonist; carrying out preliminary classification on the MD simulation tracks to generate clustering labels; inputting the MD simulation track and the clustering label into MDCNN to find out the key residue of each conformational state, selecting a valuable conformational state and inputting the conformational state into FTSite to predict an allosteric site, and finding out a potential allosteric site; obtaining a potential drug molecule which is most stably combined by using virtual screening based on a structure aiming at the predicted potential allosteric site; and (3) revealing an allosteric control mechanism of the potential drug molecule by means of dynamic network analysis, and confirming the property of the potential drug molecule. By means of molecular dynamics simulation, deep learning, virtual screening and dynamic network analysis, potential allosteric sites of protein are identified, potential allosteric modulators are screened out, and the allosteric regulation mechanism is researched.

Description

Method for identifying protein allosteric modulator based on deep learning and computational simulation

Technical Field

The invention relates to the technical field of recognition of G protein coupled receptor allosteric modulators, in particular to a recognition method of a protein allosteric modulator based on deep learning and computational simulation.

Background

Allosteric mechanisms provide a new paradigm in modulating receptor function, and therefore rational design of allosteric modulators is of increasing interest. The orthosteric site of a protein is the site of endogenous activation of ligand binding. Strong evolutionary conservation of orthosteric sites leads to problems with cross-reactivity of orthosteric ligands, which may lead to undesirable therapeutic side effects. Whereas allosteric modulators bind to a topologically different site than the orthosteric site and therefore do not compete with orthosteric ligands. Allosteric modulators may exhibit superior properties with respect to subtype selectivity and specificity, and may have reduced side effects compared to orthosteric ligands, since the evolutionary conservation of the allosteric site is lower than that of the orthosteric site. Allosteric regulation of proteins is rather delicate. For example, a positive allosteric modulator can enhance downstream signaling in four different ways: (1) promoting orthosteric agonist binding affinity but not directly affecting signaling, (2) directly enhancing signaling without affecting orthosteric agonist binding, (3) increasing orthosteric ligand binding affinity while simultaneously increasing signaling by itself, and (4) decreasing orthosteric ligand binding affinity but increasing signaling by itself. Negative allosteric modulators may use similar combinations to reduce downstream signals. Allosteric modulators stabilize the unique conformation of the protein assembly, providing a new pharmacology for the receptor. Thus, more and more allosteric modulators are found as potential drugs.

However, since the discovery process is very challenging, only a few allosteric modulators are approved as drugs or for clinical trials. This is because detecting allosteric behavior of modulators from pharmacological experiments is a challenging process, and false positives are often observed using mutation experiments to determine binding sites for allosteric modulators. Since 2013, the resolution of complex structure has been the most successful method to identify allosteric modulator binding sites and positions in GPCRs, however it is costly. In recent years, advances in structural biology and computational technology have uncovered a large number of target allosteric mechanisms, which has made rational design of allosteric modulators a new avenue for drug discovery.

The identification of allosteric sites is a prerequisite for virtual screening of allosteric modulators based on structure, however, allosteric sites are usually mysterious and difficult to find in resolved protein crystals. An allosteric pocket is usually not present in the apo structure of unbound ligands, and only in the presence of ligand, the allosteric site relaxation state dominates the conformational set. Molecular modeling is an effective method for generating conformational aggregates and exploring these sites. Combining site prediction with the MD-based GPCR conformational set may detect sites that are not apparent in static experimental structures, which is attractive for the discovery of new allosteric sites.

Although there have been some applications for the design of computer-assisted allosteric modulators, the following problems still remain:

the allosteric effect is usually measured in microseconds (mus, 10) ^-6 ) Or milliseconds (ms, 10) ^-3 ) Occurring on a time scale, classical conventional molecular dynamics simulation (cMD) can generally capture only nanoseconds (ns, 10) ^-9 ) Conformational changes on a time scale, and thus cryptic allosteric sites may be difficult to capture.

Extensive molecular dynamics simulation is required to capture cryptic allosteric sites, which yields large amounts of simulation data. Analyzing these data manually is very difficult, time consuming and with some a priori knowledge bias. Screening for valuable conformational intermediates is a prerequisite for a rapid and efficient prediction of allosteric sites.

Virtual screening allows the selection of the most stably bound small molecule, but fails to determine the properties of the small molecule and its allosteric regulatory mechanism. The mechanism of action of allosteric modulators is one of the most interesting issues in related research and is difficult to reveal experimentally due to its complexity.

Disclosure of Invention

The invention aims to provide a method for identifying a protein allosteric modulator based on deep learning and computational simulation, which is used for identifying potential allosteric sites of a protein, screening potential allosteric modulators and further researching the allosteric modulation mechanism of the protein.

The invention solves the problems through the following technical scheme:

a method for identifying protein allosteric modulators based on deep learning and computational simulation, comprising:

s100, acquiring an MD simulation track of a protein compound combined with an endogenous agonist by using Gaussian accelerated molecular dynamics simulation; gaussian acceleration molecular dynamics simulation is an enhanced sampling mode, and by adding a potential energy to reduce an energy barrier, the conformational change characteristics of a millisecond level can be sampled within a nanosecond simulation time scale, so that an allosteric site on the millisecond time scale can be captured;

s200, carrying out primary classification on the MD simulation tracks of the protein compound by using unsupervised clustering analysis, and generating a clustering label;

step S300, inputting the MD simulation track and the clustering label into a classification model MDCNN based on CNN, identifying different conformational states from the MD simulation track, searching a key structure and a key residue of each conformational state while identifying a functional state by a model interpreter LIME in the MDCNN, and selecting a valuable conformational state by the aid of the key residue fed back by the LIME for predicting a subsequent allosteric site;

s400, inputting the selected conformational state into a site prediction tool FTSite for allosteric site prediction, and considering the site with the highest score except the orthosteric site as a potential allosteric site;

step S500, aiming at the predicted potential allosteric site, obtaining a potential drug molecule which is most stably combined by using virtual screening based on a structure;

step S600, revealing an allosteric regulation mechanism of the potential drug molecule by means of dynamic network analysis, confirming the properties of the potential drug molecule, and identifying an allosteric path and an important residue which play an important role in structural information transmission by the dynamic network analysis, thereby revealing the allosteric regulation mechanism of the potential drug molecule.

The step S100 specifically includes:

step S110, acquiring an inactivated crystal structure of the target protein;

step S120, deleting other components except the target protein in the crystal structure, and reconstructing a missing structural region in the crystal structure to ensure that the structure of the target protein is complete;

s130, obtaining a structure of a protein endogenous agonist, and then constructing a compound structure of the protein-endogenous agonist by using molecular docking and selecting a reasonable docking pose with the highest score;

step S140, protonating the protein and the ligand in the physiological environment of the target protein, and constructing a simulation system similar to the physiological environment, wherein the simulation system generally comprises a protein compound, solvent molecules, ions and lipid membrane components;

step S150, performing unconstrained cMD under NPT ensemble to run the simulation system to a relatively balanced state after system minimization and heating for the constructed simulation system. And taking the last structure after cMD balance as an initial structure of the Gaussian acceleration molecular dynamics simulation, and starting to run a Gaussian acceleration molecular dynamics simulation program.

The step S200 specifically includes:

step S210, extracting protein conformation from the protein complex GaMD track at intervals to form a protein conformation set which can represent the whole track. Calculating conformational features for distinguishing conformational states according to the study system;

and S220, clustering the protein conformation by using the conformation characteristics as a clustering index and using an unsupervised clustering analysis algorithm, and selecting an optimal clustering result to be subsequently used as a label of the protein conformation set in an MDNN model.

The step S300 specifically includes:

step S310, data processing: using the protein constellation obtained in S210, protein ca atom stacking is used to eliminate global rotation and translation. Deleting all hydrogen atoms, and then converting coordinates of other atoms into RGB coordinates, thereby obtaining a data set;

step S320, adding labels: reading the protein conformation set clustering result obtained in the step S220 as label data of a data set, wherein the data set corresponds to the data labels one by one so as to indicate which type the conformation in the data set belongs to;

step S330, data set division: grouping data to eliminate the influence of analog time sequence, and then randomly dividing the data into a training set and a verification set according to a certain proportion. Performing K-fold division on the data set to obtain a K-fold cross validation data set;

step S340, model construction: training a model MDCNN for protein conformation state classification recognition based on a Convolutional Neural Network (CNN) by taking a data set as input, adopting K-fold cross validation in the training and validation process, and then evaluating the performance of a classifier by using an accuracy ACC;

step S350, constructing a model interpreter: constructing a LIME interpreter to interpret the prediction result of the MDCNN in a local linear fitting mode;

and step S360, adding all atomic scores contained in the residues to obtain the importance score of each residue of the protein, and selecting the residue with the score in the front row by sequencing to be regarded as the important residue in the conformational state.

The step S400 specifically includes:

step S410, projecting important residues of each conformational state reflected by MDCNN into a protein structure, using Pymol to visually pick out valuable conformational intermediate states,

step S420, extracting a representative structure from the target conformation intermediate state track and storing the representative structure as a pdb format;

and S430, uploading the representative protein structure file to a server of FTSite (https:// FTSite. Bu. Edu /), obtaining three potential ligand binding site prediction results, and using the sites with the highest scores except orthosteric sites as potential allosteric sites for virtual screening.

The step S500 specifically includes:

step S510, preparing a small molecule dataset for virtual screening, generating a docking box to just completely cover all residues in a potential allosteric site, and starting virtual screening based on a structure;

and S520, selecting the molecule which is most stably combined with the protein as a potential allosteric modulator by combining the docking and scoring results of the small molecules with various modes such as manual inspection or calculation of binding free energy with higher precision and the like, and outputting the protein-allosteric modulator compound structure.

The step S600 specifically includes:

step S610, constructing a simulation system for the protein-allosteric modulator compound obtained in the step S500 in a physiological environment, and obtaining a section of simulation track by using molecular dynamics simulation;

step S620, generating a dynamic network: inputting the simulation track of the protein-allosteric modulator compound into VMD software, and generating a dynamic network by using a NetworkView plug-in;

step S630, utilizing a dynamic network to analyze an allosteric control mechanism of potential drug molecules, comprising the following steps:

community analysis: the dynamic network is further divided into different sub-networks by using a Girvan-Newman algorithm in the VMD, and the community network is subjected to visual analysis in the VMD, so that the communication network distribution condition of each structural domain of the protein under the influence of the allosteric modulators is obtained;

analyzing the shortest path: and (3) searching a path with the shortest distance between two nodes in the protein network by using a subpt program in the VMD by using a Floyd-Warshall algorithm. The shortest path is often the most likely or biologically relevant signaling pathway, from which allosteric communication pathways between functional domains of proteins under the action of allosteric modulators can be derived, thus revealing the allosteric regulatory mechanisms of allosteric modulators.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) The invention identifies potential allosteric sites of protein, screens out potential allosteric modulators and researches the allosteric modulating mechanism of the potential allosteric modulators by means of molecular dynamics simulation, deep learning, virtual screening and dynamic network analysis. Taking the β 2-type adrenoceptor (β 2 AR) as an example, we succeeded in identifying a novel allosteric site for β 2AR, which is located between the highly conserved molecular switches W6.48 and D2.50. Screening aiming at the site obtains a potential negative allosteric modulator ZINC5042 which can stabilize the receptor in an inactivated state and has a negative synergistic effect with a positive agonist. The invention is equally applicable to the study of other allosteric modulators.

(2) The invention uses Gaussian acceleration molecular dynamics simulation to overcome the time scale limitation of the traditional molecular dynamics simulation, and can sample the conformational change on a millisecond scale, thereby being beneficial to capturing potential allosteric sites. And the Gaussian acceleration molecular dynamics simulation does not need to manually set acceleration parameters, and the operation is simple, convenient and quick. Gaussian accelerated molecular dynamics simulation can sample to millisecond-level conformational changes within nanosecond simulation time scale by adding a potential energy barrier lowering mode, and therefore calculation cost is greatly reduced.

(3) The invention combines unsupervised clustering and supervised convolutional neural network-based classification Model (MDNN), and realizes the automatic processing of molecular simulation tracks. The user only needs to input the track into the model for operation without complex pretreatment, and the MDCNN can automatically complete modeling, interpretation and analysis processes, thereby greatly improving the analysis efficiency of the simulated track and effectively avoiding the deviation of manual analysis. At the same time, the CNN model was interpreted by the LIME interpreter integrated in MDCNN, which can help us capture specific and important residue distributions between different conformational states. This helps us to screen and identify valuable intermediate states for subsequent allosteric site prediction.

(4) The invention integrates dynamic network analysis for revealing the action mechanism of the allosteric modulator, and the dynamic network analysis can analyze the transmission efficiency of the allosteric information and identify the allosteric path and the important residue which play an important role in the transmission of the structural information, thereby revealing the drug property and the action mechanism of the allosteric modulator and assisting the experimental and theoretical research in the field.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is an importance score and profile of the 20 important residues identified by the LIME interpreter;

FIG. 3 is a schematic diagram of the prediction of a new allosteric site of β 2AR by FTSite in two representative conformations conf1 and conf2 of the conformational intermediate state cluster 1;

FIG. 4 is a schematic diagram of a virtual screening strategy;

FIG. 5 is a schematic diagram of the optimal docking positions of ligands at the conf1 and conf2 allosteric sites;

FIG. 6 is a diagram of dynamic network analysis;

fig. 7 is a diagram showing the result of quintupled cross-validation of MDCNN at β 2 AR;

FIG. 8 is a detail of the resulting allosteric pockets predicted in conf1 and conf 2;

FIG. 9 is the docking scores for four candidate compounds;

FIG. 10 is a graph showing the binding energy composition (kcal/mol) between β 2AR and four candidate ligands ^a Schematic representation of (a).

Detailed Description

The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.

Example (b):

referring to fig. 1, a method for identifying protein allosteric modulators based on deep learning and computational modeling, for example, an allosteric modulator for β 2-type epinephrine, comprises:

step 1: acquiring a beta 2AR complex molecular simulated trajectory dataset binding endogenous agonist NE: performing a total of 5 x 3 μ s of extensive gaussian acceleration molecular dynamics simulation to obtain the conformational space of the protein; specifically, the method comprises the following steps:

preparing:

we obtained the non-activated crystal structure of β 2AR (PDB ID:2RH 1), which is the highest resolution of all reported crystal structures of β 2AR, from Protein PDB databases (Protein Data Bank, PDB) and is commonly used as a template for β 2AR in many studies. Other components in the crystal structure than the protein are deleted. The deleted ICL3 region (residue numbers: 231-262) was reconstructed using MODELLER V9.21. The 3D structure of the agonist norepinephrine NE was downloaded from the PubChem database and optimized at DFT/B3LYP/6-31G x levels using the gaussian 09 program prior to docking. And (3) carrying out ligand docking by using AutoDock4.2, and selecting a reasonable docking pose with the highest score for subsequent simulation.

Hydrogen atoms were added in a PH =7 environment prior to dynamic simulation. Protein structures were aligned in the OPM database and inserted into a pool consisting of 80% phosphatidylcholine (POPC) and 20% cholesterol

In the lipid bilayer. Then using TIP3P water molecule in

Internal solvation, adding 0.15mol/L NaCl into the water phase to neutralize the system. All the above steps are performed using the CHARMM-GUI server.

Setting MD simulation:

first, by adding to the protein and lipid separately

And &>

To achieve energy minimization. The system was then heated to 310K in the NVT ensemble over 250 ps. Thereafter, unconstrained balancing of 10ns cMD was performed using a Langevin thermostat and a Monte Carlo barometer under an NPT ensemble of 310K and 1 bar. All hydrogen-containing bonds are constrained by the SHAKE algorithm, with the cutoff distance for nonbond interactions set to ≤>

Long range electrostatic interactions were calculated using the Particle Mesh Ewald (PME) method. The time step is 2fs. Protein, lipid and salt ions adopt CHARMM36 force field, water ions adopt CHARMM TIP3P model, and CHARMM universal force field (CGenFF) is used for generating ligand parameters. The trace snapshot is saved every 10 ps. All the above steps are performed in Amber 18. The above simulation settings were used in all cMD and GaMD.

Gaussian accelerated molecular dynamics simulation:

performing unconstrained balancing of 210ns cMD before performing Gaussian acceleration molecular dynamics simulation, calculating an acceleration parameter according to a cMD track of the last 10ns, and selecting a final structure of the cMD simulation as a starting structure of subsequent Gaussian acceleration MD (GaMD) simulation. Gaussian Accelerated Molecular Dynamics (GaMD) can enhance conformational sampling of biomolecules by adding a harmonic gain potential to the system when the system potential V (r) is lower than the reference energy E (eq (1) - (2))

/>

k is a simple harmonic force constant, E and k are adjustable parameters automatically determined by three enhanced sampling principles, and the reference energy E can be calculated by eqn (3):

wherein V _ min and V _ max are respectively the minimum and maximum potential energy of the system. To ensure that eqn (3) is effective, k must satisfy k ≦ 1/(V) _max -V _min )。

Step 2: calculating RMSD of each conformation relative to a reference conformation protein framework by taking the simulated first frame as a reference, and carrying out unsupervised clustering by using the RMSD as a clustering index and using k-means to obtain a clustering label of each conformation in a track; specifically, the method comprises the following steps:

to explore the conformational diversity of the MD ensemble, we generated an MD conformational ensemble representing the receptor trajectory using the k-means clustering algorithm based on the receptor framework atoms RMSD. All snapshots used protein ca atom stacking to eliminate global rotation and translation. Clustering analysis was performed using the CPPTRAJ component in Ambertools 18. The clustering results were evaluated using DBI (Davies-Bouldin Index), pSF (pseudo-F static), and SSR/SST (R-squared value). The clustering result can be used as a label to be input into the MDNN model for learning. The centroid structure of each cluster is used as a cluster representation in site prediction and conformational analysis.

And step 3: the trace file is stripped of hydrogen atoms and aligned to the first frame conformation, and then the trace file is input into the MDCNN model along with the clustering labels for training. Selecting important conformational intermediates according to the results of the LIME interpreter; specifically, the method comprises the following steps:

one frame structure was extracted every 100ps from each 3 μ s GaMD trace, each trace taking 30000 snapshots as a conformational dataset, all MD coordinates needing to be aligned to remove translation and rotation, then used for subsequent CNN analysis. Removing other components except the receptor and the flexible ICL3 region, removing hydrogen atoms, converting the coordinates of other atoms into RGB coordinates, converting each frame conformation into a pixel map as the input of the model, and displaying eq 4

C _RGB ＝M ^-1 C _XYZ (4)

C _RGB And C _XYZ Three-dimensional RGB coordinates and XYZ coordinates of one color C are respectively represented. M is the transform matrix (see eq 5).

M＝[w _r R,w _g G,w _b B] (5)

R, G and B are three-dimensional color coordinates of three primary colors (red, green and blue) respectively. W is the three-dimensional coordinate of the white point, passing through _r 、w _g And w _b The lengths of the three primary colors in the vector space are scaled. R, G, B and W are fixed under a certain color system. w is a _r 、w _g And w _b This can be calculated by eq 6.

W＝w _r R+w _g G+w _b B (6)

And reading the clustering result as label data of the conformation data set, wherein the conformations correspond to the labels one by one to indicate which kind of conformations in the data set belong to. The data is grouped to eliminate the effect of analog timing, and then according to 8: and 2, randomly dividing the ratio into a training set and a verification set. Carrying out five-fold division on the data set to obtain a five-fold cross validation data set;

a convolutional neural network CNN model was constructed with a total of four convolutional layers, the first two convolutional layers containing 32 convolution kernels of 3 x 3, and the last two convolutional layers containing 64 convolution kernels of 3 x 3, using ReLU as the activation function. After every two convolution layers, a 2 × 2 max pooling layer was added and a dropout (0.25) operation was performed to prevent overfitting. The classification task is implemented using a fully connected layer after convolution and pooling, with 512 and n (n = number of classes) neurons in the MDCNN, respectively, and dropout (0.5) is added to provide the generalization capability of the model, using ReLU and Sigmoid as activation functions. The machine learning model was trained using five-fold cross validation, and the performance of the machine learned classification model was evaluated with an accuracy, which can be calculated by eq 7.

Here, TP is the number of positive samples that the model predicts as positive. FP is the number of negative samples predicted to be positive. FN is the number of positive samples predicted to be negative, TN is the number of negative samples predicted to be negative.

The LIME may interpret the classifier's prediction with a local linear approximation. For each conformation, the LIME interpreter generates a LIME matrix that evaluates the importance score of each pixel in the classification result. The LIME matrix has the same size as the image, where each element corresponds to a pixel point representing an atom. Each element in the LIME matrix has a value of 0 or 1.0 indicates that the element has a small impact on the classification selection and 1 indicates that the element has a significant impact on the classification decision. We sum all LIME matrices for each conformation and average them to get a fraction between 0 and 1 values. The larger the value, the more important the atom is in the classification result. The scores for all atoms in the residue are then averaged to represent the significant score for the residue. In summary, by combining these linear models, we can get an approximate interpretation of CNN. The top 20 residues were scored as important specific residues for this conformational state. Based on the distribution of important residues, we can sort out the valuable conformational intermediates.

And 4, step 4: inputting the selected conformation intermediate state into FTSite for site prediction, and considering the site with the highest score except orthosteric site as potential allosteric site; specifically, the method comprises the following steps:

and taking the k-means clustering mass center structure in the conformational intermediate state as a representative structure, uploading the k-means clustering mass center structure to an FTSite (https:// FTSite. Bu. Edu /) website in a pdb file format, downloading a prediction result, carrying out visual inspection in Pymol, excluding orthomorphic sites of the protein, and selecting the allosteric site with the highest score according to the site score ranking for subsequent research.

And 5: aiming at the potential allosteric sites obtained by the prediction in the previous step, obtaining a micromolecule ZINC5042 which is most stably combined through virtual screening based on the structure;

for example, the ligand set used in this example has 103,862 ligand molecules, and consists of two compound datasets, a reverse-lib and a drugs-lib. The Diverse-lib is a Diverse compound database which comprises 99,288 Diverse drug-like small molecules, and is beneficial to the discovery of new skeleton molecules. While Drugs-lib contains 4,574 commercially available drug molecules, this allows drug relocation and thus reduces development costs. These ligand sets are provided by the MtiOpenScreen software we use. We obtained approximately 6000 molecules (including isomers) first by preliminary screening with MTiOpenScreen, and further docking evaluations using autodock 4.2. All docking input files are written by an AutoDockTools1.5.6 software package, and dot matrix files of active sites are generated by Autogrid 4.2. In order, we will dock the large of the boxSmall set to exactly cover the allosteric binding site predicted by FTSite at a spacing of

Semi-flexible docking (ligand flexible, receptor rigid) was performed. To ensure the accuracy of the calculation results, we performed 100 docking calculations for each ligand and performed 1,750,000 energy evaluations for each docking calculation using the Lamark genetic algorithm (Lamarkian genetic algorithm), with the conformation with the lowest ligand binding energy being the best binding mode for further analysis.

The molecular mechanics/broad born surface area method (MM/GBSA) is an effective tool for obtaining the binding free energy of protein-ligand interactions and protein-protein interactions. We used the MM/GBSA method to calculate the free energy of ligand-receptor interaction, as follows:

ΔG _binding ＝G _complex -(G _receptor +G _ligand ) (8)

wherein G is _complex ，G _receptor And G _ligand The free energies of the receptor-ligand complex, the receptor and the ligand, respectively, can be calculated by the following formula:

G＝E _gas +G _sol -TS (9)

E _gas ＝E _int +E _ele +E _vdw (10)

G _sol ＝G _psolv +G _npsolv (11)

wherein E is _gas Is gas phase energy, is thermodynamic energy E _int Van der Waals energy E _vdw And electrostatic interaction energy E _ele And (4) summing. G _psolv And G _npsolv Is to the solvation energy (G) _sil ) A polar contribution and a non-polar contribution. T is temperature and S is total conformational entropy.

All free energy calculations were performed in the SANDER program of AMBER 18.

Step 6: a dynamic network is respectively constructed aiming at a beta 2AR-ZINC5042 compound system and an apo-beta 2AR system which is not combined with a ligand, and community analysis and shortest path analysis are carried out on the basis of the dynamic network, so that the ZINC5042 combination enables a communication network to be looser, the information transmission efficiency is reduced, and a negative allosteric regulation effect is presented.

First, we use the NetworkView plugin in the VMD to generate a dynamic network for each system using all MD simulation traces (4.5 μ s total) for each hierarchy. In a dynamic network, the C.alpha.and ligand key atoms of a receptor residue are represented by nodes if the heavy atoms of both nodes are 75% of the sample time

Within, an edge is added between two nodes. Correlation value C _ij The information transfer defining the two nodes i and j in a given simulation time can be calculated in eqns (12) - (13).

Where i and j are the nodes of the node,

and &>

Is the position vector corresponding to the time t. />

Is the average position of node i over a given simulation time. Distance (d) between two nodes (i and j) in a dynamic network _ij ) Eqn (14) is used to represent the probability of information passing across edges between nodes. All calculations are performed in the Carma program.

d _ij ＝-log(|C _ij |) (14)

The thickness of each edge in the dynamic network is scaled by distance, with thicker edges indicating greater correlation. The original network is then further divided into different sub-networks, called communities, by using the Girvan-Newman algorithm, with more and stronger connections within the nodes within the community than with the nodes of other communities.

The shortest path is generated by the subscribe program of the dynamic network matrix. And (3) searching a path with the shortest distance between two nodes by using a Floyd-Warshall algorithm, wherein the shortest path is the most possible or biologically relevant path, so that potential allosteric regulation paths and mechanisms are disclosed. Path path length (D) between two nodes in a dynamic network _ij ) Equal to the sum of the individual path lengths involved between the node sets i, j, can be calculated by eq (15):

D _ij ＝∑ _k,l d _k,l (15)

the effect of allosteric modulators on the communication of various regions of the protein can be revealed by the strength of the linkage between each community in the community analysis. While the shortest path may reveal two parts, the length of the shortest path may reflect the efficiency of the ligation of two regions of the protein, the length is inversely proportional to the efficiency, and furthermore, residues involved in the shortest path may be considered to be residues that play an important role in the allosteric regulation of the protein. We therefore show from such an analysis how the potential drug molecule has an effect on the allosteric communication of proteins and also what protein residues this effect may have through, which is the allosteric modulator mechanism.

In conclusion, the invention realizes that:

(one) recognition of novel allosteric site of beta 2AR

First, a large conformational set of receptors was obtained by gaussian accelerated molecular dynamics simulation (GaMD) of a total of 15 μ s (5 × 3 μ s) on a complex structure of inactive β 2AR binding to endogenous agonist NE (β 2 AR-NE) to fully capture the flexible movement of the receptors. After clustering is carried out on the tracks according to RMSD of acceptor framework atoms, an MDCNN model is input for training. The larger the cluster number is, the smaller the difference between clusters is, and the higher the similarity is. Thus, as the number of clusters increases, some reduction in the model prediction accuracy of the MDCNN occurs, as shown in fig. 7. When the number of clusters is equal to 3, the model accuracy of the MDCNN is 0.903 +/-0.003%, and the clustering index shows a good clustering effect. We therefore calculated the importance scores for each residue of the β 2AR from the model LIME interpreter results when the cluster number is 3, and we looked at the top 20 residues of the three classes of conformational importance, as shown in figure 2, the importance scores and distribution of the 20 important residues identified by the LIME interpreter when distinguishing cluster0 (a in figure 2), cluster1 (B in figure 2) and cluster2 (C in figure 2), the residues being identified using the Ballesteros-Weinstein numbering. The important residues of cluster0 and cluster2 are mainly distributed near ECL2 and ECL3, ECL2 and ECL3 belong to extracellular loop regions, and have higher flexibility, so that conformational changes are easy to occur to obtain higher scores. In addition, cluster0 has some specific residues distributed around the extracellular end of TM1 and near ICL 1. Surprisingly, cluster1 specific residues are mostly distributed in the middle of TM6 and near the intracellular end of TM7, including a number of molecular switches that influence receptor activation, such as N7.49, P7.50 and other important residues Y7.43 and N7.45 in F6.44 and NPxxY motif among PIF motif. This suggests that this class of conformational states is likely to be important functional intermediate states. We therefore subsequently made site prediction for this functional intermediate state cluster 1.

Research shows that the accuracy of the site prediction tool FTSite on the allosteric site located in the protein is up to 88%. Two representative conformations (conf 1 and conf 2) were obtained from the cluster1 conformation set using k-means clustering based on the receptor backbone atoms RMSD, representing 70% of the conformations with high representativeness. From these two representative conformations, a similar allosteric site was predicted using FTSite, which is characterized by being located in the middle of two important molecular switches W6.48 and D2.50, as shown in fig. 3. However, the predicted pocket of conf1 contains more residues than that of conf2, indicating that the allosteric pocket of conf1 appears to be more open, as shown in FIG. 8, where the pocket residues are numbered using Ballesteros-Weinstein. Specific residues that the allosteric pocket of conf1 has are underlined. We predict that the resulting allosteric site has never been reported in β 2AR, and therefore this newly discovered allosteric site should be a potential β 2AR allosteric site.

(II) screening for potential allosteric modulators for predictive sites

Virtual screens have been successfully applied to the identification of allosteric modulators, including GPCRs. Considering that the flexibility of proteins is critical to efficient drug design, the combination of two or three conformations of a target protein generally performs better than virtual screening of random single conformations. Therefore, we have performed a two-channel virtual screening using conf1 and conf2, as shown in fig. 4. The ligand set is composed of two parts, namely Diverse-lib and Drugs-lib, and the total number of the ligand molecules is 103,862. The river-lib contains 99,288 molecules with chemical diversity that facilitates the screening of new backbone drug molecules, while drug-lib contains 4,574 commercially approved drug molecules, which allows us to reuse Drugs. Based on the scores of two molecular docking (ad4.2energy, vinascore) and visual inspection to exclude the top-ranked potential false positive molecules, we selected four candidate compounds, as shown in figure 9, where compound ZINC5042 gave very high scores in both conformations of the screen. Notably, all candidate compounds formed hydrogen or ionic bonds with Asp79, suggesting that this interaction is critical for binding and recognition of the ligand at this allosteric site, as shown in fig. 5, with the ligand in the optimal docking position at the conf1 (a) and conf2 (B) allosteric sites. MD is widely used to study complex stability and interaction patterns and can be used as a virtual screening post-processing tool to validate and improve docking protocols. Therefore, 120ns MD simulation of five ligand complexes as a post-treatment of the virtual screen to verify the binding strength of the ligand-receptor complex, we used more stable results for the simulation of the same ligand, ZINC5042, binding in both conformations for the subsequent discussion in the text. RMSDs showed that all systems were balanced in a 120ns MD simulation and that RMSD of all ligands, except the complex macromolecular ligand ZINC252008995, was less than 2 angstroms, indicating that the ligand-receptor complexes obtained by virtual screening were relatively stable, demonstrating the reliability of the docking protocol. MM/GBSA was used to calculate the free energy of ligand-receptor binding, as shown in FIG. 10. The results show that ZINC5042 binds to the receptor most stably with binding energies of-54.1681 kcal/mol. The conformational entropy contribution to free energy was calculated due to the large structural differences of the four ligands, and the results showed that the conformational entropy of ZINC5042 was best-52.5650 kcal/mol, indicating that the conformation of the ligand after binding was most stable, and considering that the sum of free energies after entropy change was-1.6030 kcal/mol, which is much smaller than the other three small molecules, indicating that ZINC5042 is the most likely potential allosteric modulator and was therefore selected for subsequent mechanistic studies.

(III) revealing the allosteric control mechanism of allosteric modulators

Proteins induce conformational changes in the protein upon local stimulation, causing coupling responses in other regions remote from the stimulated region, which long-range coupling responses are central to allosteric modulation. Many experimental and computational studies now suggest that the ability of proteins to undergo conformational transitions results from the network of interactions between residues. The computation of a protein dynamic network containing dynamic correlations from simulated trajectories allows the approximation of the allosteric signal intensity associated with experimental observations.

Therefore, to understand the allosteric regulatory mechanism of ZINC5042, we performed 1500ns of conventional molecular dynamics simulations on the β 2AR-ZINC5042 complex system and the ligand-unbound apo- β 2AR, respectively, and constructed a kinetic correlation network for this system based on the covariance matrix. A network of residue communities is a network of nodes within communities that are highly inter-connected but loosely connected, nodes within a single community may communicate over multiple paths but communication between different communities must occur over a smaller number of critical edges or interactions. The community network indicates the community network, and as shown in a in fig. 6, community analysis was performed on APO- β 2AR, β 2AR-ZINC5042, and the three-dimensional structure of β 2AR and two-dimensional (2D) colony residue interaction network were shown, respectively. And the network communities are respectively colored according to the ID numbers. In a 2D plot a community is represented by a node, the size of which is determined by the number of residues in the community. Communities containing the allosteric modulator, ZINC5042, are represented by diamond-shaped nodes, others by circle-shaped nodes. The connecting edges are represented by gray lines whose width is proportional to the strength of the information flow between the connected communities. The apo-beta 2AR system is shown to have 10 communities and 8 isolated clusters, and the beta 2AR-ZINC5042 system has 13 communities and 3 isolated clusters. The protein ends and flexible loop regions are more flexible and therefore undergo drastic changes in conformation and less interactions with surrounding residues, and are therefore susceptible to independent clustering. The larger the number of communities, the looser the network, and the lower the information transmission efficiency of the system. Therefore, community analysis indicates that dynamic network communities of the receptor become looser when the allosteric ligand ZINC5042 is combined, the information transmission efficiency of each region of the receptor is reduced, and a negative allosteric regulation effect is presented.

The Floyd-Warshall algorithm was used to study the shortest path between critical extracellular binding pocket residues in different systems to intracellular key molecular switches as shown in fig. 6B, which is a schematic diagram of the shortest path from the extracellular ligand binding pocket to the intracellular domain of the receptor for APO- β 2AR and β 2AR-ZINC5042 systems, the shortest path generally being considered to be the most likely or biologically most relevant path. The shortest allosteric path of the Apo-beta 2AR system is N312 ^7.39 -Y316 ^7.43 -G320 ^7.47 -P323 ^7.50 -Y326 ^7.53 And shortest path N312 for β 2AR-ZINC5042 architecture ^7.39 -Y316 ^7.43 -G320 ^7.47 -N322 ^7.49 -Y326 ^7.53 The shortest path between both systems is the direct signal transmission into the cell via TM7 and differs by only one constituent residue. But the shortest path length for apo is 35 and the shortest path length for β 2 AR-zip 5042 is 88. The path length is inversely proportional to the signaling efficiency, which indicates that the ZINC5042 binding does not significantly change the shortest signaling path inside and outside the receptor, but greatly weakens the signaling efficiency inside and outside the receptor, which is not favorable for the activation of the receptor, and is consistent with the conclusion of community analysis. Therefore, we analyzed by dynamic networks that ZINC5042 binding did not alter the shortest signal transduction pathway inside and outside the receptor, but since this ligand binding made the protein network more loose, thus reducing the efficiency of intracellular and extracellular signaling, which was detrimental to receptor activation, ZINC5042 exhibited negative allosteric controlThus, it is a negative allosteric modulator of β 2 AR.

Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

Claims

1. A method for identifying protein allosteric modulators based on deep learning and computational modeling, comprising:

s100, acquiring an MD simulation track of a protein compound combined with an endogenous agonist by using Gaussian accelerated molecular dynamics simulation;

step S300, using the MD simulation track and the clustering label to train a classification model MDCNN based on a Convolutional Neural Network (CNN), identifying different conformational states from the MD simulation track by the CNN model, finding a key structure and a key residue of each conformational state while identifying a functional state by a model interpreter LIME in the MDCNN, and selecting a valuable conformational state for predicting a subsequent allosteric site by the aid of the key residue fed back by the LIME;

step S400, selecting a valuable conformational state input site prediction tool FTSite for allosteric site prediction according to key residues of each conformational state in MDCNN, and considering the site with the highest score except orthosteric sites as a potential allosteric site;

s500, aiming at the predicted potential allosteric site, obtaining a potential allosteric modulator which is most stably combined by using a virtual screening method based on a structure, and outputting a protein complex structure;

step S600, revealing an allosteric regulation mechanism of the potential drug molecules by means of dynamic network analysis, and confirming the properties of the potential drug molecules.

2. The method for identifying a protein allosteric modulator based on deep learning and computational simulation according to claim 1, wherein the step S100 specifically comprises:

step S110, acquiring an inactivated crystal structure of the target protein;

step S120, deleting other components except the target protein in the crystal structure, and reconstructing a structural region missing in the crystal structure to ensure that the structure of the target protein is complete;

s140, protonating the protein and the ligand in the physiological environment of the target protein, and constructing a simulation system similar to the physiological environment;

and S150, aiming at the constructed simulation system, after system minimization and heating, performing unconstrained dynamics simulation cMD under an NPT ensemble to operate the simulation system to a relatively balanced state, taking the final structure after cMD balance as an initial structure of Gaussian acceleration molecular dynamics simulation, and starting to operate a Gaussian acceleration molecular dynamics simulation program.

3. The method for identifying a protein allosteric modulator based on deep learning and computational simulation of claim 2, wherein the step S200 specifically comprises:

step S210, extracting protein conformation from a Gaussian acceleration MD track of a protein compound at intervals to form a protein conformation set representing the whole track, and calculating conformation characteristics for distinguishing conformation states;

and S220, clustering the protein conformation by using the conformation characteristics as clustering indexes and using an unsupervised clustering analysis algorithm, and selecting the optimal clustering result as a label of the protein conformation set.

4. The method for identifying protein allosteric modulators based on deep learning and computational simulation of claim 3, wherein the step S300 specifically comprises:

step S310, data processing: using the protein conformation set obtained in S210, using protein C alpha atom superposition to eliminate overall rotation and translation, deleting all hydrogen atoms, and then converting the coordinates of other atoms into RGB coordinates, thereby obtaining a data set;

step S320, adding labels: reading in the tags of the protein conformation sets obtained in the step S220 as data tags, wherein the data sets correspond to the data tags one by one and are used for indicating the type of conformation in the data sets;

step S330, data set division: grouping data to eliminate the influence of a simulation time sequence, then randomly dividing the data into a training set and a verification set according to a preset proportion, and carrying out K-fold division on the data set to obtain a K-fold cross verification data set, wherein K is more than or equal to 1 and less than 10, and K is an integer;

step S340, model construction: training a classification model MDCNN based on a Convolutional Neural Network (CNN) by taking a data set as input for protein conformation state classification and identification, wherein K-fold cross validation is adopted in the training and validation process, and the performance of a classifier is evaluated by using accuracy ACC;

step S350, constructing a model interpreter: constructing a LIME interpreter to interpret the prediction result of the MDCNN in a local linear fitting mode, and searching a key structure and a key residue of each conformational state;

5. The method for identifying protein allosteric modulators based on deep learning and computational modeling according to claim 4, wherein the step S400 specifically comprises:

step S410, projecting important residues of each conformational state reflected by MDCNN into a protein structure, and visually selecting a valuable conformational intermediate state, namely a target conformational intermediate state by using Pymol;

and S420, extracting a representative structure from the target conformation intermediate state track, storing the representative structure in a pdb format, uploading the representative structure to a server of FTSite, obtaining three potential ligand binding site prediction results, and taking the site with the highest score except the orthosteric site as a potential allosteric site for virtual screening.

6. The method for identifying a protein allosteric modulator based on deep learning and computational simulation of claim 5, wherein the step S500 comprises:

step S510, preparing a small molecule dataset for virtual screening, generating a docking box to completely cover all residues in a potential allosteric site, and starting virtual screening based on a structure;

and S520, selecting the molecule which is most stably combined with the protein as a potential allosteric modulator, and outputting a protein-allosteric modulator compound structure.

7. The method for identifying protein allosteric modulators based on deep learning and computational modeling according to claim 6, wherein the step S600 specifically comprises:

s610, constructing a simulation system for the protein-allosteric modulator compound in a physiological environment, and obtaining a section of simulation track by using molecular dynamics simulation;

and step S630, analyzing an allosteric regulation mechanism of the potential drug molecules by using a dynamic network.