CN111383708B - Small molecular target prediction algorithm based on chemical genomics and application thereof - Google Patents

Small molecular target prediction algorithm based on chemical genomics and application thereof Download PDF

Info

Publication number
CN111383708B
CN111383708B CN202010165489.3A CN202010165489A CN111383708B CN 111383708 B CN111383708 B CN 111383708B CN 202010165489 A CN202010165489 A CN 202010165489A CN 111383708 B CN111383708 B CN 111383708B
Authority
CN
China
Prior art keywords
protein
model
target prediction
ligand
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010165489.3A
Other languages
Chinese (zh)
Other versions
CN111383708A (en
Inventor
曹东升
杨素青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202010165489.3A priority Critical patent/CN111383708B/en
Publication of CN111383708A publication Critical patent/CN111383708A/en
Application granted granted Critical
Publication of CN111383708B publication Critical patent/CN111383708B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a micromolecular target prediction algorithm based on chemical genomics, wherein a model established by the algorithm can be used for target prediction of micromolecules, and the construction method of the prediction model comprises modeling data collection, batched distribution of positive and negative sets of data, combination of ligand protein characteristics, model construction and the like. The small molecule target prediction algorithm provided by the invention is that a predicted molecule is given, a predicted ranking list of targets is obtained through a model, and the probability that targets which are ranked higher in the list become real targets is higher. The small molecular target prediction algorithm can obtain stable and healthy target prediction performance by combining a plurality of models related to different aspects of information to establish a consensus model. The method is applied to the prediction of small molecular targets, and the prediction accuracy is high.

Description

Small molecular target prediction algorithm based on chemical genomics and application thereof
Technical Field
The invention relates to the technical field of agricultural biology, in particular to a small molecular target prediction algorithm based on chemical genomics and application thereof.
Background
The interaction between the drug and the macromolecules such as protein is an important precondition for the drug molecules to play a role. In the drug discovery and development stage, target determination is the basis of modern new drug development. The determination of drug targets provides a more thorough elucidation of the clinical use of drugs. This is especially important for natural products of Chinese medicine, and will be favorable to globalization development of Chinese medicine. For example, the Chinese medicine bufalin has anti-tumor effect, and is proved to be due to the fact that the main component bufalin inhibits Na + /K + -atpase. The determination of the drug off-target is beneficial to the structural modification of the drug, optimizes the selectivity of the drug, and provides a larger development space for the development of the drug. The discovery of new targets for drug action facilitates the redirection of drugs. Since new drugs are marketedDifficulty, drug redirection has become the most cost effective means of modern drug marketing. For example, sildenafil originally developed for treating angina pectoris has a pharmacological effect of curing male dysfunction due to inhibition of PDE5 target, and has been developed as a widely used vanity.
The experimental methods of target validation have become difficult due to the high time and economic costs involved in the detection of large amounts of protein targets. In contrast, a computer target prediction method as an auxiliary means has been favored in recent years. By finding out a few proteins with large probability of acting with a predicted compound from a huge protein space, the method realizes protein enrichment on the premise of ensuring higher recovery rate, and reduces the load of experiments. The calculation method mainly comprises two types of methods based on protein structures and ligand structures. The method based on the protein structure utilizes the interaction of the compound and the protein to search a target, but the necessity of the three-dimensional structure of the protein makes the application range of the method small. Ligand structure-based methods map targets by similarity between ligands, but such methods do not consider the impact of protein information on predictions. Meanwhile, if the number or structural diversity of active molecules of the target is insufficient, the prediction result is unreliable. Thus, there is a need to develop a new method to compensate for these drawbacks of ligand-based methods.
More complex chemical genomics has evolved in recent years. It predicts the substance-target interactions by combining the characteristics of the compound with the protein. The method simultaneously explores the space between the small molecules and the protein, ligand data of similar targets are shared by the whole model, and the addition of protein characteristics also shows that the protein has voting weight in target prediction. These features remedy the deficiencies of the above-described methods. However, the existing chemical genomics is only limited to drug-target action relation pair prediction, and cannot provide a prediction ranking list of targets, so that the development of the method in the field of drug target prediction is greatly limited. Accordingly, the present invention aims to exploit the value of chemical genomics in target prediction and provide the target prediction performance of this approach to a broad range of researchers.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a small molecular target prediction algorithm based on chemical genomics and application thereof. The method comprises the steps of determining a target prediction model of a target, wherein the model can be established for target prediction of small molecules by using the algorithm, the target prediction model is given one predicted molecule, a predicted ranking list of the target is obtained through the model, and the probability that the target which is ranked higher in the list becomes a real target is higher. The invention overcomes the defects of the existing computer-aided drug target prediction method, realizes the prediction process of small molecules on human protein targets, and establishes an innovative protein target prediction method by using a high-quality humanized protein target data set.
In order to achieve the above purpose, the invention provides a small molecular target prediction algorithm based on chemical genomics, which is constructed by adopting the following method:
s1, collecting modeling data: collecting human protein data, taking as a modeling sample a ligand-protein interaction pair formed by a ligand that interacts with the collected human protein;
s2, batched division of positive and negative data sets: taking ligand-protein interaction pairs with activity values lower than 0.1 mu m in the modeling sample as modeling positive samples and ligand-protein interaction pairs with activity values higher than 0.1 mu m as negative samples;
s3, combination of ligand protein characteristics: selecting an ECFP4 fingerprint, a MACCS fingerprint, and a Mol2d descriptor as a representation of the ligand; selecting the Proa and Prob characteristics as a characterization of the protein; combining the three ligand characterization and the two protein characterization two by two to respectively construct 6 characterizations;
s4, constructing a model: respectively constructing models for the samples adopting the 6 characterizations by using an XGBoost algorithm; and combining the 6 models to establish a consensus model, wherein the result of the consensus model is the average result of the 6 models.
The small molecule target prediction algorithm described above, further, the collection of human protein data in S1 is derived from a single human protein of the ChEMBL database.
The small molecule target prediction algorithm described above, further, the ligand-protein interaction pair in S1 is derived from the ChEMBL and BingdingDB databases.
The small molecule target prediction algorithm further comprises the step of characterizing the activity intensity of the activity value in the S2 as half dissociation concentration Ki of the drug.
In the small molecule target prediction algorithm, in the step S3, the Proa characteristic includes a structure of an amino acid sequence, a physicochemical property characteristic and a protein chemometrics modeling descriptor; the Prob signature contains information on the identity and sequence similarity of the gene bodies between proteins.
The small molecular target prediction algorithm further adopts a Resnik algorithm to calculate the gene ontology similarity of three domains of cell components, molecular functions and biological processes between every two proteins; the sequence similarity of proteins was calculated using the BLOS M62 local algorithm.
Based on a general technical concept, the invention provides application of the small molecular target prediction algorithm in predicting a small molecular target.
The application described above, further comprising the steps of:
(1) Respectively calculating ECFP4 fingerprint, MACCS fingerprint and Mol2d descriptor of the molecule to be detected;
(2) Combining the three molecular descriptors with Proa characteristics and Prob characteristics of each protein in a small molecular target prediction model to obtain 6 types of characteristics;
(3) Inputting the prediction samples related to the 6 types of characteristics into the target prediction model to obtain the prediction probability values of the consensus model on all targets;
(4) And sequencing all targets according to the probability value to obtain a drug target prediction list of the small molecules.
In the application, in the step (1), ECFP4 fingerprint and MACS fingerprint of the molecule to be detected are calculated respectively by using RDkit package; the Mol2d descriptor of the molecule was calculated using the PybioMed package.
The use as described above, further comprising (2) step (2) wherein each of the 6 classes of features comprises 1X 859 predicted samples of ligand-protein interactions.
In the application, in the step (3), 859 predicted samples related to each feature are respectively input into corresponding models to obtain predicted probability values of 859 targets by a consensus model, and the 859 targets are sequenced according to the probability values to obtain a predicted target list of small molecules.
Compared with the prior art, the invention has the advantages that:
(1) The invention provides a small molecular target prediction algorithm based on chemical genomics, and a model established by the algorithm is used for the first time in the field of target prediction.
(2) The invention provides a micromolecular target prediction algorithm based on chemical genomics, wherein a model established by the algorithm can obtain stable target prediction performance through combining a plurality of models related to different aspects of information to establish a consensus model.
(3) The invention provides an application method of a chemical genomics-based small molecular target prediction algorithm, which can be effectively applied to the protein target prediction of a compound. The method is applied to targets of small molecules in a prediction test set, 36.22% of real targets are located at the first position of a prediction list, 56.44% of real targets are located at the fifth position of the prediction list, and 64.61% of real targets are located at the first ten positions in the prediction list; the method is applied to molecules from NPASS and PDSP Ki databases, and on average, more than 50% of real targets are located in the first ten positions in a prediction list, so that the prediction accuracy is high.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart showing the construction of a small molecule target prediction algorithm in example 1 of the present invention.
FIG. 2 is a distribution of protein classes and protein-ligand interactions involved in the small molecule target prediction algorithm of example 1 of the present invention.
Fig. 3 is an application process of the small molecule target prediction algorithm in embodiment 3 of the present invention.
Fig. 4 shows the application result of the small molecule target prediction algorithm in example 3 of the present invention to the test set.
Detailed Description
Detailed Description
The invention is further described below in connection with specific preferred embodiments, but it is not intended to limit the scope of the invention.
Examples
The materials and instruments used in the examples below are all commercially available.
Example 1:
the invention relates to a model established by a small molecular target prediction algorithm based on chemical genomics, which is constructed by referring to FIG. 1, and specifically comprises the following steps:
s1, collecting modeling data: the chumbl database collects single human proteins, and then ligand-protein interaction relations related to the collected human proteins are collected from the chumbl and BingdingDB databases, and the ligand-protein interaction activity intensity is half dissociation concentration Ki value of the drug. Wherein the protein should have a defined sequence identity and the ligand small molecule should have a defined molecular structure identity. Finally, 153,281 pair ligand-protein target interactions involving 859 proteins and 93,282 ligands were collected as modeling samples. The information collected is detailed in table 1.
Table 1: collected data information
Number of targets Number of ligand molecules Ligand-protein interactionsQuantity of action
859 93,282 153,281
Whereas information for 859 proteins is given in table 2 below.
Table 2: model all protein information (Uniprot ID)
Figure BDA0002407299060000051
Figure BDA0002407299060000061
/>
Figure BDA0002407299060000071
FIG. 2 is a protein-ligand interaction relationship involved in a protein target, where FIG. 2a is a class of targets. FIG. 2b is the number of ligands per target. As can be seen from the data, the protein targets are mostly enzymes, mainly including protein kinases, G protein coupled receptors, proteases and other enzymes, and the balance of ion channels, transporters, transcription factors, and the like.
S2, batch division of data sets: all ligand-protein interaction samples were divided into training and test sets at a ratio of about 9:1, namely: from all targets containing a ligand number of 5 or more, 10% of ligand-protein interaction pairs were extracted as test sets, and the remaining ligand-protein interactions were used as training sets. The training set contained 107,441 relationship pairs involving 859 protein targets and the test set contained 15,280 relationship pairs involving 623 proteins. The data of the training set is used for constructing the model, and the data of the testing set is used for evaluating the performance of the model.
S3, dividing positive and negative sets in batches: ligand-protein interaction pairs with an activity value Ki below 0.1 μm were used as positive samples for modeling, ligand-protein interaction pairs with an activity value Ki above 0.1 μm were used as negative samples, and the number of positive and negative samples is shown in table 3. The number of positive and negative samples in the training set and the test set are relatively balanced.
Table 3: counting the number of positive and negative samples
Positive sample number Negative sample number
Training set 72,582 65,419
Test set 8,026 7,254
S4, characteristic representation of ligand and protein: 1024-dimensional ECFP4 (Extended-Connectivity Fingerprints) ring fingerprint derived from RDkit package and 166-dimensional MACS substructure fingerprint, 188-dimensional Mol2d descriptor derived from PybioMed package based on combination of molecular physicochemical properties, topology and other types of characteristics are used for characteristic representation of ligand. Some of the features of the Mol2d descriptor are discarded due to infinity or null values, and the descriptors used specifically are shown in table 4.
Table 4: mol2d descriptor specific features
Figure BDA0002407299060000081
/>
Figure BDA0002407299060000091
/>
Figure BDA0002407299060000101
The characteristics Proa and Prob derived from the protr package are used for characterization of proteins and are specifically described below:
proa comprises structural and physicochemical characteristics of the amino acid sequence and a protein chemometrics modeling descriptor;
prob is a characteristic of similarity in the sequence and the gene ontology GO (Gene Ontology) between 859 proteins.
The renik algorithm was used to calculate the GO similarity of the cellular components (CC, cellular Component), molecular functions (MF, molecular Function), biological processes (BP, biological Process) in three domains between proteins. The sequence similarity of proteins was calculated using the BLOSUM62 local algorithm. To reduce memory requirements and time complexity, a PCA (principal components analysis, principal component analysis) algorithm is used to dimension down each class of protein features, with descriptors having dimensions greater than 50 being reduced to 50 dimensions. The post-dimensionality reduction Proa descriptor is 762 total dimensions, and the Prob is 200 total dimensions. The interpretable variance of the specific protein profile and principal component analysis (the percentage of explained variance,% VAR) is shown in table 5.
Table 5: protein characteristics and principal component characteristic information after PCA dimension reduction
Figure BDA0002407299060000111
S5, ligand protein characteristic combination: six combinations of three-macromolecule characterization and two-macromolecule characterization, namely ECFP4_Proa, ECFP4_Prob, mol2d_Proa, mol2d_Prob, MACS_Proa and MACS_Prob, are combined. With these 6 combinations, each ligand-protein interaction sample contained 1786,1044,950,388,928,366 dimensional features, respectively.
S6, constructing a model: models were built separately for the training set samples represented by the 6 combined features described above using the XGBoost (eXtreme Gradient Boosting) algorithm, with the final build parameter for each model being Eta 0.3,Gamma:0,Max depth:6,Number of boost rounds:500. Then, the above 6 models are combined to build a Consensus model (Consensus model), and the result of the Consensus model is the average prediction result of the 6 models.
Example 2:
in order to measure the classification performance of the model, the small molecular target prediction algorithm of the embodiment 1 of the invention is evaluated, and the specific method is as follows:
from the modeling sample of example 1, 50 validation and test sets were randomly drawn for evaluation of the individual and consensus models constructed in example 1. Wherein the validation set is extracted from the training set in the same way as the test set is extracted from the whole data set in order to maintain consistency. After data extraction, the rest of the training set is used to build the model, and the extracted verification set is used for accuracy assessment of the model. The entire process was repeated fifty times and the performance of the model on the validation set was an average of 50 performances. The training set is then further used for model construction and the test set is used for model evaluation.
Model performance was assessed by the accuracy accurcy= (tp+tn)/(tp+tn+fp+fn), sensitivity sensitivity=tp/(tp+fn), specificity=tn/(tn+fp), and area AUC (the Area Under the Curve) under ROC curve (Receiver Operating Characteristic Curve, subject operating curve). Here, true Positive (TP) refers to the number of samples for which Positive class is predicted as Positive class, false Negative (FN) refers to the number of samples for which Positive class is predicted as Negative class, false Positive (FP) refers to the number of samples for which Negative class is predicted as Positive class, and True Negative (TN) refers to the number of samples for which Negative class is predicted as Negative class. AUC is an index for judging the overall performance of the two-classification prediction model, the AUC value can directly indicate the quality of the model, and when the AUC value is lower than 0.5, the prediction result of the classifier is equal to or lower than the random guess result, which indicates that the classifier does not play a role; above 0.9, the classifier has excellent prediction results.
The results of the classification properties of the model are shown in Table 6.
Table 6: performance of the model
Figure BDA0002407299060000131
From the results in table 3, it is demonstrated that the model constructed in example 1 has excellent predictive performance in classification prediction of drug-target interactions. The average performance of a single model based on 6 sets of different features is: the accuracy for the validation set was 0.851 and auc was 0.928; the accuracy for the test set was 0.854 and the auc was 0.929. The high predictive performance of each model ensures the reliability of the consensus model. The predictive performance of the consensus model is better than that of the single model: accuracy for the validation set was 0.833 and auc was 0.912; the accuracy for the test set was 0.826 and auc was 0.949. These results all indicate that consensus models are able to more accurately distinguish between molecular-target interactions and non-interactions. Namely, the small molecular target prediction algorithm has excellent classification performance.
Example 3:
the practical application of the small molecule target prediction algorithm of the embodiment 1 in predicting compound targets in a test set comprises the following specific implementation steps and expression results:
the predicted set is the positive set of molecules of the test set of S2 in example 1, together with 8,025 pair molecule-protein interactions involving 7,719 active molecules. These molecules interact with 421 of the 859 proteins in the database, with 7432 interacting with only a specific one of the protein targets, 269 interacting with two proteins, 17 interacting with three proteins, and 1 interacting with four proteins. The specific method of application is shown in fig. 3, and specifically comprises the following steps:
(1) Respectively extracting ECFP4 and MACS fingerprints of the molecules by using a chemoinformatics RDkit package; the Mol2d descriptor of the molecule was extracted using the PybioMed package.
(2) For each molecule, the three types of features of the molecule are combined with the Proa and Prob features of each protein in the model in pairs to obtain 6 types of combined features of ECFP4_Proa, ECFP4_Prob, mol2d_Proa, mol2d_Prob, MACS_Proa and MACS_Prob.
(3) For each model, the predictor resulted in 859 predicted samples. And inputting the sample into each model to obtain a predicted probability value, wherein the obtained average probability value is the predicted result of the consensus model, namely the predicted probability value of the molecule pair 859 targets.
(4) And sequencing 859 protein targets according to the probability, so as to obtain a predicted target spectrum of each small molecule.
The results of the application of the model are shown in fig. 4:36.22% of the real targets are in the first position of the prediction list, 56.44% of the real targets are in the fifth position of the prediction list, and 64.61% of the real targets are in the first ten positions in the prediction list. This result suggests that our algorithm has potential in target prediction of molecules.
Example 4:
application of the small molecule target prediction algorithm of example 1 in predicting protein targets of other foreign compounds, specific data sources and performance results are:
compounds were obtained from the PDSP Ki (Psychoactive Drugs Screening Programme Ki Database) and the natural product NPASS (Natural Product Activity & Species Source Database) databases, respectively. 87 pairs of molecular-protein interactions involving 56 molecules in the PDSP Ki database were used for model applications. These molecules interact with 24 proteins of the database, of which 36,9, 11 interact with 1,2,3 targets, respectively. The 44 pair molecule-protein interactions involving 36 molecules in the NPASS database were used for model applications. These molecules interact with 34 proteins of the database, of which 30 interact with only 1 protein target, 3 interact with 2 proteins, 1 interact with 3 proteins, and 1 interact with 5 proteins.
The specific method for target prediction application is the same as in example 3, with results shown in table 7.
Table 7: prediction result of model on PDSP Ki and NPASS foreign molecules
Figure BDA0002407299060000151
From the results in table 7, it can be seen that: for the molecules of the PDSP Ki database, 22 real targets were located first in the predicted target list with a recovery of 25.29%; the 48 real targets are positioned in the first ten positions in the predicted target list, and the recovery rate is 55.17%; for the molecules of the NPASS database, 12 real targets were located first in the predicted target list, with a recovery of 27.27%; the 21 real targets are positioned in the first ten positions in the predicted target list, and the recovery rate is 47.73%; this result suggests that our algorithm has potential in target prediction of foreign molecules (even natural product molecules).
The above description is only of the preferred embodiment of the present invention, and is not intended to limit the present invention in any way. While the invention has been described in terms of preferred embodiments, it is not intended to be limiting. Any person skilled in the art can make many possible variations and modifications to the technical solution of the present invention or equivalent embodiments using the method and technical solution disclosed above without departing from the spirit and technical solution of the present invention. Therefore, any simple modification, equivalent substitution, equivalent variation and modification of the above embodiments according to the technical substance of the present invention, which do not depart from the technical solution of the present invention, still fall within the scope of the technical solution of the present invention.

Claims (9)

1. The method for constructing the small molecular target prediction model based on chemical genomics is characterized in that the small molecular target prediction algorithm is constructed by adopting the following method:
s1, collecting modeling data: collecting human protein data, taking as a modeling sample a ligand-protein interaction pair formed by a ligand that interacts with the collected human protein;
s2, batch division of data sets: dividing all ligand-protein interaction samples into a training set and a testing set according to the proportion;
s3, batched division of positive and negative data sets: taking ligand-protein interaction pairs with activity values lower than 0.1 mu m in the modeling sample as modeling positive samples and ligand-protein interaction pairs with activity values higher than 0.1 mu m as negative samples;
s4, combination of ligand protein characteristics: selecting an ECFP4 fingerprint, a MACCS fingerprint, and a Mol2d descriptor as a representation of the ligand; selecting the Proa and Prob characteristics as a characterization of the protein; combining the three ligand characterization and the two protein characterization two by two to obtain 6 characterizations respectively;
wherein the Proa features comprise the structure of an amino acid sequence, a physicochemical property feature, and a protein chemometrics modeling descriptor; the Prob characteristics comprise gene ontology similarity and sequence similarity information between every two proteins;
s5, constructing a model: respectively constructing models for the samples adopting the 6 characterizations by using an XGBoost algorithm; and combining the 6 models to establish a consensus model, wherein the result of the consensus model is the average result of the 6 models.
2. The method according to claim 1, wherein the collection of human protein data in S1 is derived from a single human protein of the ChEMBL database.
3. The method of claim 1, wherein the ligand-protein interaction pair in S1 is derived from a ChEMBL and BingdingDB database.
4. The method according to claim 1, wherein the activity intensity of the activity value in S3 is characterized by half dissociation concentration Ki of the drug.
5. The method for constructing a small molecular target prediction model according to claim 1, wherein the Resnik algorithm is adopted to calculate the gene ontology similarity of three domains of cell components, molecular functions and biological processes between every two proteins; the sequence similarity of proteins was calculated using the BLOS m62 local algorithm.
6. A chemical genomics-based small molecule target prediction method implemented using the model constructed by any one of the methods of claims 1-5.
7. The chemical genomics-based small molecule target prediction method of claim 6, wherein the application includes the steps of:
(1) Respectively calculating ECFP4 fingerprint, MACCS fingerprint and Mol2d descriptor of the molecule to be detected;
(2) Combining the three molecular descriptors with Proa characteristics and Prob characteristics of each protein in a small molecular target prediction model to obtain 6 types of characteristics;
(3) Inputting the prediction samples related to the 6 types of characteristics into the target prediction model to obtain the prediction probability values of the consensus model on all targets;
(4) And sequencing all targets according to the probability value to obtain a drug target prediction list of the small molecules.
8. The method of chemical genomics-based small molecule target prediction according to claim 7, wherein (1) in step (1), the ECFP4 fingerprint and MACCS fingerprint of the test molecule are calculated using the RDKit package, respectively; the Mol2d descriptor of the molecule was calculated using the PybioMed package.
9. The chemical genomics-based small molecule target prediction method of claim 7, wherein in the 6 classes of features in step (2), each class of features contains 1 x 859 ligand-protein interaction prediction samples;
(3) In the step, 859 prediction samples related to each feature are respectively input into corresponding models to obtain prediction probability values of 859 targets by a consensus model, and the 859 targets are sequenced according to the probability values to obtain a predicted target list of small molecules.
CN202010165489.3A 2020-03-11 2020-03-11 Small molecular target prediction algorithm based on chemical genomics and application thereof Active CN111383708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010165489.3A CN111383708B (en) 2020-03-11 2020-03-11 Small molecular target prediction algorithm based on chemical genomics and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010165489.3A CN111383708B (en) 2020-03-11 2020-03-11 Small molecular target prediction algorithm based on chemical genomics and application thereof

Publications (2)

Publication Number Publication Date
CN111383708A CN111383708A (en) 2020-07-07
CN111383708B true CN111383708B (en) 2023-05-12

Family

ID=71218824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010165489.3A Active CN111383708B (en) 2020-03-11 2020-03-11 Small molecular target prediction algorithm based on chemical genomics and application thereof

Country Status (1)

Country Link
CN (1) CN111383708B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133367A (en) * 2020-08-17 2020-12-25 中南大学 Method and device for predicting interaction relation between medicine and target spot

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038348A (en) * 2017-05-04 2017-08-11 四川大学 Drug targets Forecasting Methodology based on protein ligands interaction finger-print
CN109872781A (en) * 2019-02-26 2019-06-11 哈尔滨工业大学 Drug target recognition methods based on Xgboost

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
BR112014032104A2 (en) * 2012-06-21 2017-08-01 Univ Georgetown method for identifying protein-drug interactions, and, computer product.
WO2016201575A1 (en) * 2015-06-17 2016-12-22 Uti Limited Partnership Systems and methods for predicting cardiotoxicity of molecular parameters of a compound based on machine learning algorithms
US20190050537A1 (en) * 2017-08-08 2019-02-14 International Business Machines Corporation Prediction and generation of hypotheses on relevant drug targets and mechanisms for adverse drug reactions

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038348A (en) * 2017-05-04 2017-08-11 四川大学 Drug targets Forecasting Methodology based on protein ligands interaction finger-print
CN109872781A (en) * 2019-02-26 2019-06-11 哈尔滨工业大学 Drug target recognition methods based on Xgboost

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Laurent Jacob, et.al.Protein-ligand interaction prediction: an improved chemogenomics approach.《Bioinformatics》.2008,第24卷(第24期),第2149-2156页. *
SMH Mahmub, et.al.iDti-CSsmoteB: identification of drug-target interaction based on drug chemical structure and protein sequence using XGBoost with over-sampling technique SMOTE.《IEEE Access》.2019,(第7期),第48699-48714页. *
朱木春.疾病蛋白质-配体数据库构建及药物-靶标相互作用的预测.《中国优秀硕士学位论文全文数据库 (医药卫生科技辑)》.2018,(第9期),第E059-20页. *
闫芳芳.配体—靶标蛋白作用机理的分子动力学研究.《中国优秀硕士学位论文全文数据库 (医药卫生科技辑)》.2019,(第9期),第16-23页. *

Also Published As

Publication number Publication date
CN111383708A (en) 2020-07-07

Similar Documents

Publication Publication Date Title
Su et al. Single cell proteomics in biomedicine: High‐dimensional data acquisition, visualization, and analysis
Liquet et al. A novel approach for biomarker selection and the integration of repeated measures experiments from two assays
Levine et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis
Zuber et al. Gene ranking and biomarker discovery under correlation
Sharan et al. Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data
Collins et al. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae
CN108830045B (en) Biomarker system screening method based on multiomics
Hutchinson et al. Models and machines: how deep learning will take clinical pharmacology to the next level
Schlatzer et al. Human biomarker discovery and predictive models for disease progression for idiopathic pneumonia syndrome following allogeneic stem cell transplantation
CN108763865A (en) A kind of integrated learning approach of prediction DNA protein binding sites
CN109801680B (en) Tumor metastasis and recurrence prediction method and system based on TCGA database
CN108121896B (en) Disease relation analysis method and device based on miRNA
de Oliveira et al. Comparing co-evolution methods and their application to template-free protein structure prediction
CN110890137A (en) Modeling method, device and application of compound toxicity prediction model
JP7126337B2 (en) Program, apparatus and method for predicting biological activity of compounds
CN112053742A (en) Method and device for screening molecular target protein, computer equipment and storage medium
Shi et al. Discovering potential cancer driver genes by an integrated network-based approach
CN111383708B (en) Small molecular target prediction algorithm based on chemical genomics and application thereof
Fava et al. The power of systems biology: insights on lupus nephritis from the accelerating medicines partnership
Rahman et al. Protein structure–based gene expression signatures
Boyeau et al. Deep generative modeling of sample-level heterogeneity in single-cell genomics
Zok et al. Building the library of RNA 3D nucleotide conformations using the clustering approach
CN111836906A (en) Classifier for identifying robust sepsis subtypes
CN112133367A (en) Method and device for predicting interaction relation between medicine and target spot
Jha et al. Qualitative assessment of functional module detectors on microarray and RNASeq data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant