CN116386764A

CN116386764A - Method for predicting three-dimensional folding and drug molecule binding model of G protein coupled receptor protein

Info

Publication number: CN116386764A
Application number: CN202211727118.5A
Authority: CN
Inventors: 袁曙光; 胡祯全; 王艳萍
Original assignee: Shenzhen Alpha Molecular Technology Co ltd
Current assignee: Shenzhen Alpha Molecular Technology Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-07-04

Abstract

The invention discloses a method for predicting a three-dimensional folding and drug molecule binding model of a G protein coupled receptor protein. The method comprises the following steps: selecting a plurality of target GPCR initial models by utilizing three-dimensional sequence comparison and a template three-dimensional structure, and optimizing a flexible region and a drug structural site of the GPCR initial models to obtain target GPCR models of molecular docking and drug design; the optimal GPCR-drug molecule binding mode is selected by cross-verifying the predictive performance of multiple types of GPCR-drug molecule artificial intelligence models. The invention improves the prediction precision of the three-dimensional folding of the G protein coupled receptor protein of the drug target and the combination mode of the GPCR drug molecules, and can also accurately capture the interaction mode of the related drug molecules and the GPCR target.

Description

Method for predicting three-dimensional folding and drug molecule binding model of G protein coupled receptor protein

Technical Field

The invention belongs to the technical field of medicines, and particularly relates to a method for predicting a three-dimensional folding and drug molecule binding model of a G protein coupled receptor protein (GPCR).

Background

Membrane proteins are a large lipid-soluble protein that is embedded in the cell membrane, as opposed to water-soluble proteins. Membrane proteins include GPCRs, ion channels, transport proteins, core proteins, ion pumps, and the like. The water-soluble proteins include enzymes, kinases, chaperones, etc. According to statistics, the membrane protein accounts for more than 70% of the whole drug target on the market, the water-soluble protein accounts for only about 20%, and other targets are nucleic acid macromolecules. GPCRs, also known as seven-fold broadmembrane helices, are the most important membrane protein drug targets. Recently, nearly 40% of marketed drugs are developed for GPCRs.

Proteins are the basis of life, and their three-dimensional structure directly determines their physiological functions. Therefore, the analysis and prediction of the three-dimensional structure of the protein become an important fundamental link essential in the development of modern new drugs. Although traditional experimental methods (including cryo-electron microscopy, X-diffraction, nuclear magnetic resonance) have made great progress in the field of structural analysis of water-soluble proteins, structural analysis of membrane proteins has been very slow. This is mainly due to the fact that the membrane protein expression and purification techniques and conditions are currently not yet mature. Resolving a completely new membrane protein structure often takes years. With the rapid development of biotechnology and Artificial Intelligence (AI) technology, the use of computers to predict protein folding and three-dimensional structure has determined a great breakthrough. For example, the protein folding prediction tool AlphaFold2 developed by google improves the prediction accuracy. In addition, the protein three-dimensional structure can be accurately predicted by the RosettaFold which is another protein folding tool.

However, with the release of the AlphaFold2 tool and the release of the relevant predictive target structural model, more and more structural biologists found that AlphaFold2 was predicted to be better for smaller water-soluble proteins, but very poor for membrane proteins. A review article entitled "The protein-folding problem" Not solved by Notyet solution "by The national journal of Top-grade academy of sciences, international journal of sciences, nature, by The association of The American academy of sciences, alex T.Brunger et al, 2 nd year 2021, discloses that they do Not agree that The protein-folding problem declared by The AlphaFold2 team is all solved by AI. Subsequently, michael J.E.Sternberg et al published under the journal Journal of Molecular Biology journal of molecular biology under the article The AlphaFold Database of Protein Structures: A biology's Guide. The article clearly states that the prediction of AlphaFold2 has no reliable correlation with experimental structure in the evaluation of multiple membrane proteins. In month 4 2022, the well-known structural biologists and pharmacologists in the GPCR field taught by Brian Roth in the journal of Nature under the heading of What's next for AlphaFold and the AI protein-folding revolution. He indicates in this paper that: "there are tens of GPCR structures resolved in his laboratory but not yet published. There is half of the structure alpha fold prediction results, horse tiger, but there is a majority of the structure alpha fold prediction that is of no value. For example, some structures AlphaFold themselves predict that reliability is high, but actual experimental structures prove that their predictions are completely erroneous. In month 7 of 2022, the teaching of Xu Huajiang (ericxu) by Shanghai institute of medicine, department of Chinese sciences was published in journal Acta PharmacologicaSinica, and the academic paper entitled "alphaFold 2 versus experimental structures: evaluation on G protein-coupled receptors" states that "AlphFold2 only predicts the general skeleton trend of a GPCR target protein, and that the prediction of the critical transmembrane helix of GPCR and the pocket of the molecular structure of the drug differ significantly from the experimental structure. The prediction of GPCR membrane protein structure by AlphFold2 cannot be applied to drug development guidance related development work. "

In summary, existing schemes for predicting the three-dimensional structure of GPCRs and predicting the binding pattern of drug molecules to GPCRs need to be improved, and further improvement of reliability and application range of predictions is needed.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a method for predicting a three-dimensional folding and drug molecule binding model of a G protein coupled receptor protein. The method comprises the following steps:

selecting a plurality of target GPCR initial models by utilizing three-dimensional sequence comparison and a template three-dimensional structure, and optimizing a flexible region and a drug structural site of the GPCR initial models to obtain target GPCR models of molecular docking and drug design;

selecting an optimal GPCR and drug molecule combination mode by cross-verifying the predictive performance of multiple types of GPCR-drug molecule artificial intelligent models;

wherein the input of the GPCR-drug molecule artificial intelligence model is the three-dimensional interaction fingerprint characteristic of a ligand in a receptor, each column represents the number of active bonds generated by the corresponding ligand and the amino acid in the current column, and the output is a Root Mean Square Deviation (RMSD) value for measuring the distance between a predicted 3D conformation and a real conformation.

Compared with the prior art, the method has the advantages that the three-dimensional folding of the G protein coupled receptor protein (GPCR) and the medicine molecule combination mode of the GPCR, which are the most important medicine targets, can be accurately predicted, the three-dimensional structure of the GPCR can be accurately predicted, the combination mode of the medicine molecule and the GPCR can be accurately predicted, and a powerful guarantee is provided for accurate GPCR medicine design.

Other features of the present invention and its advantages will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart of a method for predicting a model of three-dimensional folding of a G-protein coupled receptor protein (GPCR) and drug molecule binding in accordance with one embodiment of the invention;

FIG. 2 is a schematic illustration of a calculation process for prediction and optimization of a three-dimensional folding model of a GPCR according to one embodiment of the invention;

FIG. 3 is a schematic diagram of a GPCR-drug molecule binding AI model building process in accordance with one embodiment of the invention;

FIG. 4 is a schematic representation of renumbering and extending the relative positions of amino acids of GPCRs in accordance with one embodiment of the invention;

FIG. 5 is a schematic diagram of generating an interaction fingerprint matrix according to one embodiment of the invention;

FIG. 6 is a schematic representation of predicted results for 5 brand-new GPCR structures according to an embodiment of the present invention;

FIG. 7 is a graph comparing predictors with the AlphaFold2 and RosettaFold predictors according to one embodiment of the present invention;

FIG. 8 is a schematic representation of a model for accurately predicting binding of a complex drug small molecule to a GPCR receptor in accordance with one embodiment of the invention;

FIG. 9 is a graph comparing three-dimensional fold predictions for the orphan receptor GPR158 with the prior art, according to one embodiment of the invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Referring to fig. 1, a method for predicting a three-dimensional folding and drug molecule binding model of a G protein coupled receptor protein (GPCR) is provided comprising: step S110, selecting a plurality of target GPCR initial models by utilizing three-dimensional sequence comparison and a template three-dimensional structure, and optimizing a flexible region and a medicine structure site of the GPCR initial models to obtain target GPCR models of molecular docking and medicine design; step S120, selecting the optimal combination mode of GPCRs and drug molecules by cross-verifying the prediction performance of the artificial intelligence model of the GPCRs and drug molecules of various types. In brief, the technical scheme of the invention mainly comprises the processes of GPCR folding model prediction and optimization, GPCR-drug molecule AI model construction, GPCR-drug molecule combination mode prediction and the like. Hereinafter, the detailed description will be made with reference to the accompanying drawings.

In particular, provided methods for predicting a model of three-dimensional folding of G-protein coupled receptor proteins (GPCRs) and drug molecule binding include the following steps.

Step S1.1, predicting and optimizing a GPCR three-dimensional folding model

Referring to FIG. 2, to obtain an accurate three-dimensional folding model of GPCRs, the present invention employs the following calculation method.

Step S1.1.1, template selection and three-dimensional sequence alignment

First, an experimental structure with high homology is searched for as a homology template for computer modeling at a protein structure database website (www.rcsb.org). After determining the primary amino acid sequence of the target GPCR, the secondary structure of the target GPCR is predicted.

To determine the helical secondary structure, the polypeptide chain atom of each amino acid is first found, and then the hydrogen bonds between amino acids i and i+4 on the polypeptide chain are calculated. The helical structure is supported by hydrogen bonding vortices. Plus the central carbon atom (carbon alpha) on all amino acid backbones is defined as "CA" in the structural document. CA is surrounded by three main functional groups: secondary amine groups, carboxyl groups and side chains. It is understood that both the carboxyl group and the side chain are covalently bonded to CA through a carbon atom. The side chain bound carbon is noted as "CB" in the structural document. This information allows distinguishing between the carbon of the side chain and the carbon of the carboxyl group. The atoms on the polypeptide chain are thus confirmed by calculating the covalent distance between the carbon of the carboxyl group and the nitrogen of the secondary amine and carbon alpha.

Three-dimensional information of hydrogen, nitrogen and oxygen in the polypeptide chain is obtained, and after ensuring that the sequence length of the protein is longer than 4 amino acids, calculation of the hydrogen bond of the polypeptide chain can be performed. Because the experimental structure lacks information on hydrogen atoms, nitrogen can be used in place of hydrogen in calculating hydrogen bonds. Of all amino acids, proline belongs to a specific structure of amino acids. When calculating the helix sequence, the proline can be automatically added into the helix sequence. Finally, the calculated spiral length must be greater than 2 to be marked as a spiral.

The transmembrane helix region, extracellular region and intracellular region in the primary sequence structure are determined. And analyzing the hydrophobic and hydrophilic part information of a transmembrane helix region (TM) in the primary sequence as a constraint condition of subsequent modeling. The position of the disulfide bond between the conserved amino acid of each transmembrane helix and the flexible loop region of extracellular ECL2 was determined. And (3) carrying out three-dimensional sequence alignment on the primary sequence of the target GPCR and a homologous experimental structure to ensure that all the conserved amino acids are correctly aligned.

Step S1.1.2, generating a structural model

Initial models of, for example, 5-10 ten thousand target GPCRs are generated using three-dimensional sequence alignment and template three-dimensional structure. And evaluating the structural stability of the CHARMM molecular force field energy scoring function, and selecting the model with the lowest energy as an initial model to be optimized.

Step S1.1.3 optimizing the Flexible region and the drug Structure site

Since the non-transmembrane regions of GPCRs present a number of flexible regions and are in close proximity to the binding site of the drug molecule, it is necessary to optimize these regions with respect to structure. First, the output initial model to be optimized in step S1.1.2 is used for constructing a cell membrane physiological environment through Gromacs software. The relevant GPCRs will be embedded in the cell membrane. The cell membrane is filled with water molecules in the cell and outside the cell, contains 0.15 mole of NaCl salt molecules, and simulates the condition of a related system under a real physiological state. Then, position restriction was added to the backbone portion of the transmembrane helical region of the GPCR and a long-time scale full-atom molecular dynamics simulation was performed (simulation duration 500-1000ns nanoseconds). The simulated molecular force field may be the CHARMM or Amber force field. Finally, cluster analysis is carried out on the idea of the last 50ns in the simulation process, and a representative structure in the largest cluster structure is selected for energy minimization optimization and is used as a final model of target GPCRs for subsequent molecular docking and drug design.

Step S1.2, constructing a GPCR-drug molecule AI model

Referring to FIG. 3, the construction of the GPCR-drug molecule AI model includes the following steps:

step S1.2.1, data acquisition, washing and amplification

First, the experimental structure of all GPCRs and their drug molecule complexes is downloaded from the PDB public database, and the UniprotID for the corresponding eight points is downloaded. Deleting polypeptide molecules with the number of amino acids greater than 10 in the related structure. All GPCRs were structurally aligned to one GPCR structure and the selected folded template PDB was numbered: 5C1M.

Downloading all molecular structure information reported by related GPCR targets and their bioactivity information (including Kd, ki, EC) from Pubchem, CHEMBL, PDBBIND, bindingDB and some commercial databases ₅₀ 、IC ₅₀ Numerical value). Molecules with an activity of more than 1. Mu.M in the relevant database were deleted, and 148733 compounds remained.

Step S1.2.2, preparing model data

Step S1.2.2.1, molecular clustering

The related molecules are butted into the experimental structure by a computer cross molecular butt joint mode (a plurality of molecular butt joint software are butted at the same time and take the intersection of the results). The molecules of the original experimental structure can be used as reference ligands for molecular docking. The variance RMSD (root mean square deviation) of the direct distances of the new molecules from the atoms of the molecules in the experimental structure was calculated. Clustering was performed according to RMSD values, yielding 20+1 different classifications. Finally, the structure of the new molecule and related GPCR complex is obtained and energy minimization is performed using Gromacs.

Step S1.2.2.2, reprogramming the GPCR amino acid

Amino acid numbering of GPCRs is typically performed using the Ballesteros-Weinstein numbering, which redefines the position of amino acids as sequences in the transmembrane rather than the entire protein sequence. The amino acid closest to the middle of the transmembrane is preferably selected as the first amino acid starting from the motif, annotated at position 50 in the transmembrane. The numbering is then extended into the entire transmembrane sequence, as shown in FIG. 4. Tight control of the transmembrane length is required. When the length of the transmembrane sequence is too short, amino acids are added in the initial transmembrane region or the terminal transmembrane region, depending on the limitation of the number of amino acids before and after the intermediate residue. Conversely, a too long transmembrane sequence should delete some amino acids in the beginning or end transmembrane region according to similar restrictions.

Step S1.2.2.3, calculating GPCR-drug molecule interaction fingerprint

The molecular interactions between the receptor and the ligand can be characterized by various parameters, including: type of interaction, amino acids involved in the interaction, the atoms of the amino acids involved and the ligands, and the three-dimensional positions of the atoms (x, y and z). The types of interactions include: hydrophobic contact, hydrogen bonding, water bridging, salt bridging, pi stacking interactions, pi-cation interactions, and halogen bonding.

After the calculation is finished, two interaction fingerprint matrix files are generated. These data are broadly divided into three types of information: complex structure information identifier, interaction type count, and amino acid count of interactions. The two result files differ in the display of the type of amino acid in the third type of information, see FIG. 5, where the interaction fingerprint is numbered with Ballesteris-Weinstein, the type of interaction including amino acids with or without interactions.

Step S1.2.3, establishing an AI model

Step S1.2.3.1, selecting a feature

The input X is 3Dinteraction fingerprint (three-dimensional interaction fingerprint) of a ligand in a receptor, and each column represents the number of acting bonds generated by the ligand and the amino acid in the current column and is a continuous integer; the output y is an RMSD value representing the distance of the 3D constellation from the true constellation, and is a continuous floating point number. According to the molecular fingerprint features generated by 83852 conjugates generated in the step S1.2.2.3, the molecular fingerprint features are sparse, and in order to enable the machine learning model to learn the fingerprint features better, the experiment performs brief feature screening and dimension reduction. The final data dimension is: 83852 row, 326 column.

Step S1.2.3.1.1 screening features based on standard deviation

After normalizing the data, calculating standard deviation for each column of features, wherein the standard deviation reflects the discrete degree of each sample feature, and the smaller standard deviation represents that the feature has smaller difference in all samples, is close to an average value, and is difficult to specifically influence a target value. For example, the columns in which features having standard deviation less than 0.05 are located are eliminated with 0.05 as a threshold.

Step S1.2.3.1.2, screening features based on relevance

The degree of correlation between every two features can be determined by calculating the magnitude of the correlation coefficient between the features, and the value interval is between [ -1,1 ]. For example, the pearson correlation coefficient is adopted to calculate the correlation coefficient of each column of the molecular fingerprint, and if the correlation coefficient is larger (for example, larger than 0.9), the two columns of fingerprints are considered to be strongly correlated, and one column is removed. The calculation method is as follows:

wherein X, Y is two rows of fingerprints, mu is the average value of the fingerprint rows, sigma is the standard deviation of the fingerprint rows, and E is the expected value.

Step S1.2.3.1.3, Z-score normalization of the screened samples

And normalizing the screened sample fingerprint x, wherein the processed data mean value is 0 and the standard deviation is 1. For example, the normalization method is:

where μ is the mean of the fingerprint columns and σ is the standard deviation of the fingerprint columns.

Finally, a total of 322 columns of fingerprints are screened from 875 columns of fingerprints as features of the machine learning model input.

Step S1.2.3.2, establishing a regression model

In one embodiment, the regression model may be modeled using the Scikit-learn, XGBoost, auto-Sklean, H2O, pycaret tool.

Scikit-learn contains a variety of classification, regression and clustering algorithms, including machine learning algorithm models such as support vector machines, random forests, K-means, and the like.

Auto-sklearn includes a machine learning model in 12: adaBoost, ard regression, decision _tree, extra_ trees, gaussian _ process, gradient _boosting, k_nearest_ neighbors, liblinear _svr, libsvm_svr, mlp (Multi-layer Perceptron), random_forest, sgd (Stochastic Gradient Descent).

AutoML of H2O can be used to automatically train and adjust many models within user-specified time constraints. H2O provides a number of interpretable approaches for AutoML objects (model sets) and individual models. The interpretation can be automatically generated and a simple interface provided to explore and interpret the AutoML model. To construct a predictive model using H2O, the training set accepts a regression module of the H2O toolkit. The model was constructed using the handout strategy and the data was divided into 80% training set (where the training set also used a 5-fold cross-validation approach), 20% test set, run time set for 1 hour, 2 hours, 3 hours, 12 hours and 24 hours.

PyCaret is a low-code Python machine learning library, multiple machine learning methods are integrated based on popular R Caret library, short code lines can finish data preprocessing with minimum manpower, and modeling is carried out to finally perform model preprocessing. Furthermore, the ability to compare and adjust many models using simple commands can simplify efficiency and production efficiency while reducing the time to create useful models. The PyCaret team adds NVIDIA GPU support in version 2.2, including all the latest and most-maximized versions in RAPIS. Using GPU acceleration, pyCaret modeling time can be 2 to 200 times faster, depending on the workload. To construct a predictive model using PyCaret, the entire dataset was passed to the regression module of PyCaret 2.2, which by default split the dataset into a training set and a test set, containing 80% (67081) and 20% (16771) records, respectively. All 19 regression models in the available machine learning library and framework were trained on the training set and ranked according to their R2 scores.

Step S1.2.3.2.1, splitting data

To more accurately build RMSD predictive regression models for GPCR complexes, existing data is split for training and validation models. In the test, the data can be split and trained by adopting quintuple cross-validation. The method comprises the following steps: the data are equally divided into 5 groups, each group of data is used as a verification set for one time to verify, the other 4 groups are used as model training sets, 4 models are obtained by circulating for 4 times, and the average RMSD error is obtained by error calculation means obtained by the 4 models.

In execution, the KFOLD command in the scikit-learn toolkit is used to complete. Wherein the parameter shuffle is set to True to represent shuffling of data to be performed at each division to ensure reliability of the result, and the random_state parameter is set to ensure reproducibility of the experiment.

Step S1.2.3.2.2, adjusting parameters

The parameters of the model are adjusted by adopting Grid Search (Grid Search) in Scikit-learn, which is a parameter adjustment method of exhaustive Search, namely, each possibility is tried through cyclic traversal in all candidate parameters, and the best-performing parameter set is obtained as the final parameter of the model. The candidate parameter ranges are defined by routine experience, and the parameters and ranges to be adjusted are different for different models, and the specific model and the selected parameter ranges are described in the following model selection steps.

Step S1.2.3.2.3, selecting a model

According to the object of the present invention, the RMSD values of GPCR complexes are predicted from calculated molecular fingerprints, and a regression model is built for fitting the relationship between the eigenvalues and the target values. Based on the linear regression model, a variety of regression models including ridge regression, lasso, support vector machines, decision trees, random forests, K neighbors, and XGBoost models were tried. The following is a specific description of each type of model.

(a) Linear regression (Linear Regression)

Linear regression is a regression analysis that models the relationship between molecular fingerprints and RMSD using a least squares function, which is the simplest and most rapid, and is used as a benchmark for comparison.

Executing tools: sklearn. Linear_model. Linear rRegulation

Parameter adjustment range: without any means for

(b) Support vector machine (SVR)

The support vector machine is a supervised learning model, can be used for classification and regression problems, and has the basic principle of fitting a hyperplane to maximize data on a surface.

Executing tools: sklearn.svm.LinearSVR

(c) Ridge regression (Ridge) and LASSO regression

The Ridge and the LASSO are multiple linear regression models, and the difference is that the LASSO regression model changes the penalty term of the loss function from L2 norm to L1 norm, so that the unimportant regression coefficient can be reduced to 0, the purpose of eliminating variables is achieved, and the prediction is more accurate.

Executing tools: sklearn. Linear_model. RidgeSklearn. Linear_model. Lasso

(d) Decision Tree (precision Tree)

The decision tree is a basic classification and regression method taking the tree as a structure, the regression decision tree mainly refers to CART (classification and regression tree) algorithm, the values of the internal node characteristics are 'yes' and 'no', and the tree is in a binary tree structure.

Executing tools: sklearn. Tree. DecisionTreeRegremor

(e) K Nearest Neighbor (KNN)

The K neighbor regression algorithm is used for finding K neighbor samples closest to the compound, and the characteristic attributes of the compound are averagely assigned to the compound to obtain the corresponding RMSD value of the compound.

Executing tools: sklearn.neighbor.KNEIGHIBORSREgressor

(f) Random Forest (Random Forest)

The random forest is composed of a plurality of independent decision trees, a prediction result is obtained in a parallel mode by randomly extracting a compound and fingerprints, and a regression prediction result of the whole forest is obtained by integrating the results of all the trees and taking an average value.

Executing tools: sklearn. Ensable. Random fortsetregresolvers

(g)XGBoost

XGBoost is also an integrated model based on decision regression trees, jointly predicted by multiple associated decision trees, unlike random forests, in XGBoost the next decision tree input sample is related to training and prediction of previous decision trees.

Executing tools: XGBRegresor (Python api)

Step S1.2.3.2.4, selecting an evaluation index for measuring the performance of the model

(a) Determining coefficient (R) ² Scoring

The decision coefficient is a commonly used score in a regression model, can be used as a standard for measuring the prediction capability of the model, and represents the square percentage of the correlation degree between the predicted value and the actual value of the target variable, and the closer to 1, the better the prediction effect is represented, and the calculation method is as follows:

wherein,,

to predict the RMSD value, y _i Is true value +.>

Is the true average.

(b) Mean absolute error (Mean Absolute Error, MAE)

The average absolute error refers to the average value of the distance between the model predicted value and the real sample value, and the smaller the value is, the better the model predicted effect is represented, and the calculation method is as follows:

(c) Mean square error (Mean Square Error, MSE)

The mean square error refers to the mean value of the square of the distance between the model predicted value and the real sample value, and the smaller the value is, the better the model predicted effect is represented, and the calculation method is as follows:

step S1.2.3.2.5, analyzing the test results

Test results see Table 1, in all test models, pycaret, H2O, auto-sklearn, XGBoost at R ² Both MAE and MSE scores perform very well.

Table 1: experimental results

	train R ²	test R ²	MAE	MSE
					auto-sklearn	0.9377	0.8317	1.0488	2.4206
H2O	0.9676	0.8485	1.0230	2.1430
					Pycaret	0.9970	0.8543	0.9106	2.0761
xgboost(baseline model)	0.9968	0.8351	0.2675	0.1649
					Linear Regressor	0.5060	0.5007	0.5564	0.4992
SVR linear	0.4929	0.4878	0.5501	0.5121
					Ridge	0.5060	0.5008	0.5564	0.4992
Lasso	0.4843	0.4812	0.5676	0.5187
					Decison Tree	0.7215	0.6594	0.4129	0.3405
RF	0.7716	0.7204	0.3830	0.2796
					KNN	0.8603	0.7797	0.2997	0.2203

Step S1.3, GPCR-drug molecule binding mode prediction

Step S1.3.1 prediction Process

First, a reliable three-dimensional folding model of the relevant GPCR target is obtained using the algorithm and procedure in fig. 2. After the chemical structure of the relevant study drug small molecule was obtained, it was molecular-docked by the method in fig. 3 and 100 different conformations were generated. Finally, cross prediction is carried out through AI models such as pycaret, H2O, auto-sklearn, XGBoost and the like in S1.2, and an intersection is selected as a final optimal prediction combination mode.

Step S1.3.2, predicting the result

In order to further verify whether the three-dimensional structural folding of GPCRs and the combination mode of related drug molecules and GPCRs can be accurately predicted or not, theoretical guidance is provided for GPCR drug development, related structural predictions are carried out on 5 brand-new GPCR targets (including APJ, GPR139, KOR and NPY1R, NMUR2 receptors), the average variance (RMSD, the lower the numerical value, the higher the precision) of the atomic-atomic distance is very small in comparison with the experimental phase, and the model precision is very high. RMSD of the overall structural backbone was:

RMSD of the backbone in the critical transmembrane helical region was then: />

FIG. 6 is a prediction of 5 brand new GPCR structures, wherein the dark model is the experimental structure and the light model is the prediction of the present invention. As can be seen from fig. 6, the relevant prediction model is completely consistent with the experimental structure after being overlaid.

FIG. 7 is a comparison of the predicted results of the present invention with the alpha Fold2 and RosettaFold predicted results. Downloaded from the AlphaFold2 official model data website (https:// alphafold.ebi.ac.uk) and the RosettaFold GPCR model functional network (http:// www.rosettaGPCR.org) and compared to the experimental structure. Compared with the RMSD of the related model, the result of predicting the brand new target structures of 5 GPCRs is improved by 60% compared with the alpha fold2 precision, and is improved by 100% compared with the Rosettafold.

In addition, using the AI model of the present invention, a model of drug molecule binding to GPCRs is also accurately predicted. The prediction of KOR receptors is currently very challenging because of the very large size of the drug small molecules. And it has been verified that the present invention successfully captures all important interactions of the complex drug molecule, the predicted results of which are highly consistent with the experimental structure, as shown in fig. 8, where all critical interactions in the experimental structure (left panel) are captured by the predictions of the present invention (right panel).

In order to further verify the reliability of the present invention, relevant predictions were made for more complex, brand-new Class C GPCR receptors, see the three-dimensional folding prediction results for orphan receptor GPR158 shown in fig. 9, where the dark model represents experimental structure, the light model represents predicted structure, the left graph corresponds to the present invention, the middle graph corresponds to AlphaFold2, and the right graph corresponds to RosettaFold. Experimental results show that the model predicted by the invention and the RMSD of the experimental structure are only

Is a minor difference in (2); whereas RMSDs for AlphaFold2 and RosettaFold predictions are up to: />

And->

The difference from the experimental structure is great, and even the secondary structure predicts errors.

In summary, the present invention provides novel methods for predicting the three-dimensional folding of G-protein coupled receptor proteins (GPCRs) and the binding modes of GPCR drug molecules for the most important drug targets. Experiments prove that the prediction accuracy of the invention is more accurate than that of Google alpha Fold2, and the interaction mode of related drug molecules and GPCR targets can be accurately captured, thereby providing a solid foundation for the structure-based drug design of the GPCR targets.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++, python, and the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A method for predicting a model of three-dimensional folding of G protein-coupled receptor proteins and drug molecule binding comprising the steps of:

2. The method of claim 1, wherein the molecular docking and drug design target GPCR model is obtained according to the steps of:

selecting an experimental structure as a homology template for computer modeling according to homology, comparing a primary sequence of a target GPCR with the homology experimental structure in a three-dimensional sequence, determining a primary sequence of target GPCR amino acid, and predicting a secondary structure of the primary sequence of the target GPCR amino acid, so as to obtain a plurality of target GPCR initial models;

evaluating the structural stability of the GPCR initial model through a CHARMM molecular force field energy scoring function, and selecting the model with the lowest energy as an initial model to be optimized;

and aiming at the initial model to be optimized, carrying out full-atom molecular dynamics simulation, carrying out clustering analysis on ideas in a set time range in the simulation process, and selecting out a representative structure as a target GPCR model for molecular docking and drug design, wherein a simulated molecular force field is CHARMM or Amber force field.

3. The method of claim 1, wherein the GPCR-drug molecule artificial intelligence model is constructed according to the steps of:

obtaining the experimental structure of the GPCR and the drug molecular compound thereof and the molecular structure information and the bioactivity information of related GPCR targets by using a public database, and screening a plurality of sample compounds based on the bioactivity information;

calculating a fingerprint of the interaction of the GPCR and the drug molecules for the sample compound to obtain molecular interaction information between the receptor and the ligand;

based on the molecular interaction information of the receptor and ligand, an artificial intelligence model of the GPCR-drug molecule is established to determine the input and output of the model.

4. The method of claim 1, wherein the GPCR-drug molecule artificial intelligence model comprises one or more of a ridge regression model, a Lasso model, a support vector machine model, a decision tree model, a random forest model, a K-nearest neighbor model, and an XGBoost model.

5. The method of claim 1, wherein the input features of the GPCR-drug molecule artificial intelligence model are determined according to the steps of:

preliminary feature screening and dimension reduction are carried out on molecular fingerprint features generated by a plurality of conjugates, and dimension-reduced molecular fingerprint feature data are obtained

Calculating standard deviation for each column of features after carrying out data normalization on the dimension-reduced molecular fingerprint feature data, and eliminating columns where features with standard deviation smaller than a set standard deviation threshold value are located to obtain feature data of first screening;

calculating the correlation coefficient between the features aiming at the feature data after the first screening, and removing a row of the correlation coefficient which is larger than a set correlation coefficient threshold value to obtain feature data of the second screening, wherein the correlation coefficient is used for representing the correlation degree between every two features, and the value interval is between [ -1,1 ];

characteristic data for the second screening, after normalization by Z-score, was used as input features for the GPCR-drug molecule artificial intelligence model.

6. The method of claim 5, wherein the correlation coefficient threshold is set to 0.9 and the correlation coefficient is a pearson correlation coefficient calculated by:

7. The method of claim 1, wherein in cross-validation of the GPCR-drug molecule artificial intelligence model, the predictive performance of the model is assessed using the following criteria:

determining a coefficient representing a percentage of the square of the degree of correlation between the predicted value and the actual value of the target variable;

average absolute error, representing average value of distance between model predicted value and real sample value;

mean square error, which represents the mean of the square of the distance between the model predicted value and the true sample value.

8. A method according to claim 3 wherein for the GPCR and drug molecule interactions fingerprint, the molecular interactions between receptor and ligand are characterized by the type of interaction, the amino acids involved in the interaction, the atoms involved in the amino acids and ligand and the three-dimensional positions of the atoms, wherein the type of interaction comprises: hydrophobic contact, hydrogen bonding, water bridging, salt bridging, pi stacking interactions, pi-cation interactions, and halogen bonding.

9. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor realizes the steps of the method according to any of claims 1 to 8.

10. A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 8 when the computer program is executed.