CN111383708B

CN111383708B - Small molecular target prediction algorithm based on chemical genomics and application thereof

Info

Publication number: CN111383708B
Application number: CN202010165489.3A
Authority: CN
Inventors: 曹东升; 杨素青
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2023-05-12
Anticipated expiration: 2040-03-11
Also published as: CN111383708A

Abstract

The invention discloses a micromolecular target prediction algorithm based on chemical genomics, wherein a model established by the algorithm can be used for target prediction of micromolecules, and the construction method of the prediction model comprises modeling data collection, batched distribution of positive and negative sets of data, combination of ligand protein characteristics, model construction and the like. The small molecule target prediction algorithm provided by the invention is that a predicted molecule is given, a predicted ranking list of targets is obtained through a model, and the probability that targets which are ranked higher in the list become real targets is higher. The small molecular target prediction algorithm can obtain stable and healthy target prediction performance by combining a plurality of models related to different aspects of information to establish a consensus model. The method is applied to the prediction of small molecular targets, and the prediction accuracy is high.

Description

Small molecular target prediction algorithm based on chemical genomics and application thereof

Technical Field

The invention relates to the technical field of agricultural biology, in particular to a small molecular target prediction algorithm based on chemical genomics and application thereof.

Background

The interaction between the drug and the macromolecules such as protein is an important precondition for the drug molecules to play a role. In the drug discovery and development stage, target determination is the basis of modern new drug development. The determination of drug targets provides a more thorough elucidation of the clinical use of drugs. This is especially important for natural products of Chinese medicine, and will be favorable to globalization development of Chinese medicine. For example, the Chinese medicine bufalin has anti-tumor effect, and is proved to be due to the fact that the main component bufalin inhibits Na ⁺ /K ⁺ -atpase. The determination of the drug off-target is beneficial to the structural modification of the drug, optimizes the selectivity of the drug, and provides a larger development space for the development of the drug. The discovery of new targets for drug action facilitates the redirection of drugs. Since new drugs are marketedDifficulty, drug redirection has become the most cost effective means of modern drug marketing. For example, sildenafil originally developed for treating angina pectoris has a pharmacological effect of curing male dysfunction due to inhibition of PDE5 target, and has been developed as a widely used vanity.

The experimental methods of target validation have become difficult due to the high time and economic costs involved in the detection of large amounts of protein targets. In contrast, a computer target prediction method as an auxiliary means has been favored in recent years. By finding out a few proteins with large probability of acting with a predicted compound from a huge protein space, the method realizes protein enrichment on the premise of ensuring higher recovery rate, and reduces the load of experiments. The calculation method mainly comprises two types of methods based on protein structures and ligand structures. The method based on the protein structure utilizes the interaction of the compound and the protein to search a target, but the necessity of the three-dimensional structure of the protein makes the application range of the method small. Ligand structure-based methods map targets by similarity between ligands, but such methods do not consider the impact of protein information on predictions. Meanwhile, if the number or structural diversity of active molecules of the target is insufficient, the prediction result is unreliable. Thus, there is a need to develop a new method to compensate for these drawbacks of ligand-based methods.

More complex chemical genomics has evolved in recent years. It predicts the substance-target interactions by combining the characteristics of the compound with the protein. The method simultaneously explores the space between the small molecules and the protein, ligand data of similar targets are shared by the whole model, and the addition of protein characteristics also shows that the protein has voting weight in target prediction. These features remedy the deficiencies of the above-described methods. However, the existing chemical genomics is only limited to drug-target action relation pair prediction, and cannot provide a prediction ranking list of targets, so that the development of the method in the field of drug target prediction is greatly limited. Accordingly, the present invention aims to exploit the value of chemical genomics in target prediction and provide the target prediction performance of this approach to a broad range of researchers.

Disclosure of Invention

The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a small molecular target prediction algorithm based on chemical genomics and application thereof. The method comprises the steps of determining a target prediction model of a target, wherein the model can be established for target prediction of small molecules by using the algorithm, the target prediction model is given one predicted molecule, a predicted ranking list of the target is obtained through the model, and the probability that the target which is ranked higher in the list becomes a real target is higher. The invention overcomes the defects of the existing computer-aided drug target prediction method, realizes the prediction process of small molecules on human protein targets, and establishes an innovative protein target prediction method by using a high-quality humanized protein target data set.

In order to achieve the above purpose, the invention provides a small molecular target prediction algorithm based on chemical genomics, which is constructed by adopting the following method:

s1, collecting modeling data: collecting human protein data, taking as a modeling sample a ligand-protein interaction pair formed by a ligand that interacts with the collected human protein;

s2, batched division of positive and negative data sets: taking ligand-protein interaction pairs with activity values lower than 0.1 mu m in the modeling sample as modeling positive samples and ligand-protein interaction pairs with activity values higher than 0.1 mu m as negative samples;

s3, combination of ligand protein characteristics: selecting an ECFP4 fingerprint, a MACCS fingerprint, and a Mol2d descriptor as a representation of the ligand; selecting the Proa and Prob characteristics as a characterization of the protein; combining the three ligand characterization and the two protein characterization two by two to respectively construct 6 characterizations;

s4, constructing a model: respectively constructing models for the samples adopting the 6 characterizations by using an XGBoost algorithm; and combining the 6 models to establish a consensus model, wherein the result of the consensus model is the average result of the 6 models.

The small molecule target prediction algorithm described above, further, the collection of human protein data in S1 is derived from a single human protein of the ChEMBL database.

The small molecule target prediction algorithm described above, further, the ligand-protein interaction pair in S1 is derived from the ChEMBL and BingdingDB databases.

The small molecule target prediction algorithm further comprises the step of characterizing the activity intensity of the activity value in the S2 as half dissociation concentration Ki of the drug.

In the small molecule target prediction algorithm, in the step S3, the Proa characteristic includes a structure of an amino acid sequence, a physicochemical property characteristic and a protein chemometrics modeling descriptor; the Prob signature contains information on the identity and sequence similarity of the gene bodies between proteins.

The small molecular target prediction algorithm further adopts a Resnik algorithm to calculate the gene ontology similarity of three domains of cell components, molecular functions and biological processes between every two proteins; the sequence similarity of proteins was calculated using the BLOS M62 local algorithm.

Based on a general technical concept, the invention provides application of the small molecular target prediction algorithm in predicting a small molecular target.

The application described above, further comprising the steps of:

(1) Respectively calculating ECFP4 fingerprint, MACCS fingerprint and Mol2d descriptor of the molecule to be detected;

(2) Combining the three molecular descriptors with Proa characteristics and Prob characteristics of each protein in a small molecular target prediction model to obtain 6 types of characteristics;

(3) Inputting the prediction samples related to the 6 types of characteristics into the target prediction model to obtain the prediction probability values of the consensus model on all targets;

(4) And sequencing all targets according to the probability value to obtain a drug target prediction list of the small molecules.

In the application, in the step (1), ECFP4 fingerprint and MACS fingerprint of the molecule to be detected are calculated respectively by using RDkit package; the Mol2d descriptor of the molecule was calculated using the PybioMed package.

The use as described above, further comprising (2) step (2) wherein each of the 6 classes of features comprises 1X 859 predicted samples of ligand-protein interactions.

In the application, in the step (3), 859 predicted samples related to each feature are respectively input into corresponding models to obtain predicted probability values of 859 targets by a consensus model, and the 859 targets are sequenced according to the probability values to obtain a predicted target list of small molecules.

Compared with the prior art, the invention has the advantages that:

(1) The invention provides a small molecular target prediction algorithm based on chemical genomics, and a model established by the algorithm is used for the first time in the field of target prediction.

(2) The invention provides a micromolecular target prediction algorithm based on chemical genomics, wherein a model established by the algorithm can obtain stable target prediction performance through combining a plurality of models related to different aspects of information to establish a consensus model.

(3) The invention provides an application method of a chemical genomics-based small molecular target prediction algorithm, which can be effectively applied to the protein target prediction of a compound. The method is applied to targets of small molecules in a prediction test set, 36.22% of real targets are located at the first position of a prediction list, 56.44% of real targets are located at the fifth position of the prediction list, and 64.61% of real targets are located at the first ten positions in the prediction list; the method is applied to molecules from NPASS and PDSP Ki databases, and on average, more than 50% of real targets are located in the first ten positions in a prediction list, so that the prediction accuracy is high.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart showing the construction of a small molecule target prediction algorithm in example 1 of the present invention.

FIG. 2 is a distribution of protein classes and protein-ligand interactions involved in the small molecule target prediction algorithm of example 1 of the present invention.

Fig. 3 is an application process of the small molecule target prediction algorithm in embodiment 3 of the present invention.

Fig. 4 shows the application result of the small molecule target prediction algorithm in example 3 of the present invention to the test set.

Detailed Description

The invention is further described below in connection with specific preferred embodiments, but it is not intended to limit the scope of the invention.

Examples

The materials and instruments used in the examples below are all commercially available.

Example 1:

the invention relates to a model established by a small molecular target prediction algorithm based on chemical genomics, which is constructed by referring to FIG. 1, and specifically comprises the following steps:

s1, collecting modeling data: the chumbl database collects single human proteins, and then ligand-protein interaction relations related to the collected human proteins are collected from the chumbl and BingdingDB databases, and the ligand-protein interaction activity intensity is half dissociation concentration Ki value of the drug. Wherein the protein should have a defined sequence identity and the ligand small molecule should have a defined molecular structure identity. Finally, 153,281 pair ligand-protein target interactions involving 859 proteins and 93,282 ligands were collected as modeling samples. The information collected is detailed in table 1.

Table 1: collected data information

Number of targets	Number of ligand molecules	Ligand-protein interactionsQuantity of action
			859	93,282	153,281

Whereas information for 859 proteins is given in table 2 below.

Table 2: model all protein information (Uniprot ID)

/>

FIG. 2 is a protein-ligand interaction relationship involved in a protein target, where FIG. 2a is a class of targets. FIG. 2b is the number of ligands per target. As can be seen from the data, the protein targets are mostly enzymes, mainly including protein kinases, G protein coupled receptors, proteases and other enzymes, and the balance of ion channels, transporters, transcription factors, and the like.

S2, batch division of data sets: all ligand-protein interaction samples were divided into training and test sets at a ratio of about 9:1, namely: from all targets containing a ligand number of 5 or more, 10% of ligand-protein interaction pairs were extracted as test sets, and the remaining ligand-protein interactions were used as training sets. The training set contained 107,441 relationship pairs involving 859 protein targets and the test set contained 15,280 relationship pairs involving 623 proteins. The data of the training set is used for constructing the model, and the data of the testing set is used for evaluating the performance of the model.

S3, dividing positive and negative sets in batches: ligand-protein interaction pairs with an activity value Ki below 0.1 μm were used as positive samples for modeling, ligand-protein interaction pairs with an activity value Ki above 0.1 μm were used as negative samples, and the number of positive and negative samples is shown in table 3. The number of positive and negative samples in the training set and the test set are relatively balanced.

Table 3: counting the number of positive and negative samples

	Positive sample number	Negative sample number
			Training set	72,582	65,419
Test set	8,026	7,254

S4, characteristic representation of ligand and protein: 1024-dimensional ECFP4 (Extended-Connectivity Fingerprints) ring fingerprint derived from RDkit package and 166-dimensional MACS substructure fingerprint, 188-dimensional Mol2d descriptor derived from PybioMed package based on combination of molecular physicochemical properties, topology and other types of characteristics are used for characteristic representation of ligand. Some of the features of the Mol2d descriptor are discarded due to infinity or null values, and the descriptors used specifically are shown in table 4.

Table 4: mol2d descriptor specific features

/>

/>

The characteristics Proa and Prob derived from the protr package are used for characterization of proteins and are specifically described below:

proa comprises structural and physicochemical characteristics of the amino acid sequence and a protein chemometrics modeling descriptor;

prob is a characteristic of similarity in the sequence and the gene ontology GO (Gene Ontology) between 859 proteins.

The renik algorithm was used to calculate the GO similarity of the cellular components (CC, cellular Component), molecular functions (MF, molecular Function), biological processes (BP, biological Process) in three domains between proteins. The sequence similarity of proteins was calculated using the BLOSUM62 local algorithm. To reduce memory requirements and time complexity, a PCA (principal components analysis, principal component analysis) algorithm is used to dimension down each class of protein features, with descriptors having dimensions greater than 50 being reduced to 50 dimensions. The post-dimensionality reduction Proa descriptor is 762 total dimensions, and the Prob is 200 total dimensions. The interpretable variance of the specific protein profile and principal component analysis (the percentage of explained variance,% VAR) is shown in table 5.

Table 5: protein characteristics and principal component characteristic information after PCA dimension reduction

S5, ligand protein characteristic combination: six combinations of three-macromolecule characterization and two-macromolecule characterization, namely ECFP4_Proa, ECFP4_Prob, mol2d_Proa, mol2d_Prob, MACS_Proa and MACS_Prob, are combined. With these 6 combinations, each ligand-protein interaction sample contained 1786,1044,950,388,928,366 dimensional features, respectively.

S6, constructing a model: models were built separately for the training set samples represented by the 6 combined features described above using the XGBoost (eXtreme Gradient Boosting) algorithm, with the final build parameter for each model being Eta 0.3,Gamma:0,Max depth:6,Number of boost rounds:500. Then, the above 6 models are combined to build a Consensus model (Consensus model), and the result of the Consensus model is the average prediction result of the 6 models.

Example 2:

in order to measure the classification performance of the model, the small molecular target prediction algorithm of the embodiment 1 of the invention is evaluated, and the specific method is as follows:

from the modeling sample of example 1, 50 validation and test sets were randomly drawn for evaluation of the individual and consensus models constructed in example 1. Wherein the validation set is extracted from the training set in the same way as the test set is extracted from the whole data set in order to maintain consistency. After data extraction, the rest of the training set is used to build the model, and the extracted verification set is used for accuracy assessment of the model. The entire process was repeated fifty times and the performance of the model on the validation set was an average of 50 performances. The training set is then further used for model construction and the test set is used for model evaluation.

Model performance was assessed by the accuracy accurcy= (tp+tn)/(tp+tn+fp+fn), sensitivity sensitivity=tp/(tp+fn), specificity=tn/(tn+fp), and area AUC (the Area Under the Curve) under ROC curve (Receiver Operating Characteristic Curve, subject operating curve). Here, true Positive (TP) refers to the number of samples for which Positive class is predicted as Positive class, false Negative (FN) refers to the number of samples for which Positive class is predicted as Negative class, false Positive (FP) refers to the number of samples for which Negative class is predicted as Positive class, and True Negative (TN) refers to the number of samples for which Negative class is predicted as Negative class. AUC is an index for judging the overall performance of the two-classification prediction model, the AUC value can directly indicate the quality of the model, and when the AUC value is lower than 0.5, the prediction result of the classifier is equal to or lower than the random guess result, which indicates that the classifier does not play a role; above 0.9, the classifier has excellent prediction results.

The results of the classification properties of the model are shown in Table 6.

Table 6: performance of the model

From the results in table 3, it is demonstrated that the model constructed in example 1 has excellent predictive performance in classification prediction of drug-target interactions. The average performance of a single model based on 6 sets of different features is: the accuracy for the validation set was 0.851 and auc was 0.928; the accuracy for the test set was 0.854 and the auc was 0.929. The high predictive performance of each model ensures the reliability of the consensus model. The predictive performance of the consensus model is better than that of the single model: accuracy for the validation set was 0.833 and auc was 0.912; the accuracy for the test set was 0.826 and auc was 0.949. These results all indicate that consensus models are able to more accurately distinguish between molecular-target interactions and non-interactions. Namely, the small molecular target prediction algorithm has excellent classification performance.

Example 3:

the practical application of the small molecule target prediction algorithm of the embodiment 1 in predicting compound targets in a test set comprises the following specific implementation steps and expression results:

the predicted set is the positive set of molecules of the test set of S2 in example 1, together with 8,025 pair molecule-protein interactions involving 7,719 active molecules. These molecules interact with 421 of the 859 proteins in the database, with 7432 interacting with only a specific one of the protein targets, 269 interacting with two proteins, 17 interacting with three proteins, and 1 interacting with four proteins. The specific method of application is shown in fig. 3, and specifically comprises the following steps:

(1) Respectively extracting ECFP4 and MACS fingerprints of the molecules by using a chemoinformatics RDkit package; the Mol2d descriptor of the molecule was extracted using the PybioMed package.

(2) For each molecule, the three types of features of the molecule are combined with the Proa and Prob features of each protein in the model in pairs to obtain 6 types of combined features of ECFP4_Proa, ECFP4_Prob, mol2d_Proa, mol2d_Prob, MACS_Proa and MACS_Prob.

(3) For each model, the predictor resulted in 859 predicted samples. And inputting the sample into each model to obtain a predicted probability value, wherein the obtained average probability value is the predicted result of the consensus model, namely the predicted probability value of the molecule pair 859 targets.

(4) And sequencing 859 protein targets according to the probability, so as to obtain a predicted target spectrum of each small molecule.

The results of the application of the model are shown in fig. 4:36.22% of the real targets are in the first position of the prediction list, 56.44% of the real targets are in the fifth position of the prediction list, and 64.61% of the real targets are in the first ten positions in the prediction list. This result suggests that our algorithm has potential in target prediction of molecules.

Example 4:

application of the small molecule target prediction algorithm of example 1 in predicting protein targets of other foreign compounds, specific data sources and performance results are:

compounds were obtained from the PDSP Ki (Psychoactive Drugs Screening Programme Ki Database) and the natural product NPASS (Natural Product Activity & Species Source Database) databases, respectively. 87 pairs of molecular-protein interactions involving 56 molecules in the PDSP Ki database were used for model applications. These molecules interact with 24 proteins of the database, of which 36,9, 11 interact with 1,2,3 targets, respectively. The 44 pair molecule-protein interactions involving 36 molecules in the NPASS database were used for model applications. These molecules interact with 34 proteins of the database, of which 30 interact with only 1 protein target, 3 interact with 2 proteins, 1 interact with 3 proteins, and 1 interact with 5 proteins.

The specific method for target prediction application is the same as in example 3, with results shown in table 7.

Table 7: prediction result of model on PDSP Ki and NPASS foreign molecules

From the results in table 7, it can be seen that: for the molecules of the PDSP Ki database, 22 real targets were located first in the predicted target list with a recovery of 25.29%; the 48 real targets are positioned in the first ten positions in the predicted target list, and the recovery rate is 55.17%; for the molecules of the NPASS database, 12 real targets were located first in the predicted target list, with a recovery of 27.27%; the 21 real targets are positioned in the first ten positions in the predicted target list, and the recovery rate is 47.73%; this result suggests that our algorithm has potential in target prediction of foreign molecules (even natural product molecules).

The above description is only of the preferred embodiment of the present invention, and is not intended to limit the present invention in any way. While the invention has been described in terms of preferred embodiments, it is not intended to be limiting. Any person skilled in the art can make many possible variations and modifications to the technical solution of the present invention or equivalent embodiments using the method and technical solution disclosed above without departing from the spirit and technical solution of the present invention. Therefore, any simple modification, equivalent substitution, equivalent variation and modification of the above embodiments according to the technical substance of the present invention, which do not depart from the technical solution of the present invention, still fall within the scope of the technical solution of the present invention.

Claims

1. The method for constructing the small molecular target prediction model based on chemical genomics is characterized in that the small molecular target prediction algorithm is constructed by adopting the following method:

s2, batch division of data sets: dividing all ligand-protein interaction samples into a training set and a testing set according to the proportion;

s3, batched division of positive and negative data sets: taking ligand-protein interaction pairs with activity values lower than 0.1 mu m in the modeling sample as modeling positive samples and ligand-protein interaction pairs with activity values higher than 0.1 mu m as negative samples;

s4, combination of ligand protein characteristics: selecting an ECFP4 fingerprint, a MACCS fingerprint, and a Mol2d descriptor as a representation of the ligand; selecting the Proa and Prob characteristics as a characterization of the protein; combining the three ligand characterization and the two protein characterization two by two to obtain 6 characterizations respectively;

wherein the Proa features comprise the structure of an amino acid sequence, a physicochemical property feature, and a protein chemometrics modeling descriptor; the Prob characteristics comprise gene ontology similarity and sequence similarity information between every two proteins;

s5, constructing a model: respectively constructing models for the samples adopting the 6 characterizations by using an XGBoost algorithm; and combining the 6 models to establish a consensus model, wherein the result of the consensus model is the average result of the 6 models.

2. The method according to claim 1, wherein the collection of human protein data in S1 is derived from a single human protein of the ChEMBL database.

3. The method of claim 1, wherein the ligand-protein interaction pair in S1 is derived from a ChEMBL and BingdingDB database.

4. The method according to claim 1, wherein the activity intensity of the activity value in S3 is characterized by half dissociation concentration Ki of the drug.

5. The method for constructing a small molecular target prediction model according to claim 1, wherein the Resnik algorithm is adopted to calculate the gene ontology similarity of three domains of cell components, molecular functions and biological processes between every two proteins; the sequence similarity of proteins was calculated using the BLOS m62 local algorithm.

6. A chemical genomics-based small molecule target prediction method implemented using the model constructed by any one of the methods of claims 1-5.

7. The chemical genomics-based small molecule target prediction method of claim 6, wherein the application includes the steps of:

8. The method of chemical genomics-based small molecule target prediction according to claim 7, wherein (1) in step (1), the ECFP4 fingerprint and MACCS fingerprint of the test molecule are calculated using the RDKit package, respectively; the Mol2d descriptor of the molecule was calculated using the PybioMed package.

9. The chemical genomics-based small molecule target prediction method of claim 7, wherein in the 6 classes of features in step (2), each class of features contains 1 x 859 ligand-protein interaction prediction samples;

(3) In the step, 859 prediction samples related to each feature are respectively input into corresponding models to obtain prediction probability values of 859 targets by a consensus model, and the 859 targets are sequenced according to the probability values to obtain a predicted target list of small molecules.