CN106529205B

CN106529205B - It is a kind of based on drug minor structure, the drug targets Relationship Prediction method of molecule character description information

Info

Publication number: CN106529205B
Application number: CN201610953873.3A
Authority: CN
Inventors: 王建新; 严承; 王伟平; 李敏
Original assignee: Central South University
Current assignee: HUNAN CREATOR INFORMATION TECHNOLOGIES Co.,Ltd.
Priority date: 2016-11-03
Filing date: 2016-11-03
Publication date: 2019-03-26
Anticipated expiration: 2036-11-03
Also published as: CN106529205A

Abstract

The invention discloses a kind of based on drug minor structure, the drug targets Relationship Prediction method of molecule character description information, drug minor structure information, molecule character description information and known drug targets relationship are obtained by database first, then the similarity matrix between drug is constructed according to these drug minor structures, drug molecule character description information and known drug targets relationship, individually, then each similarity matrix of building is become into final drug similarity matrix according to weight sets；Also similar feature predicts the target relationship of drug for target finally based on similar drug targeting.The present invention only needs to construct similitude according to drug molecule character description information, minor structure information, the information such as the sequence independent of target, and target Relationship Prediction can be carried out to completely new medical compounds, avoid a large amount of manpower and material resources consumed by Biochemistry Experiment.The experimental results showed that this method being capable of accurate prediction drug targets relationship.

Description

A kind of drug targets relationship based on drug minor structure, molecule character description information is pre- Survey method

Technical field

The invention belongs to system biology fields, are related to a kind of medicine based on drug minor structure, molecule character description information Object target Relationship Prediction method.

Background technique

For at present, drug targets refer to has pharmacodynamic feature and can be by pharmaceutically-active large biological molecule, such as certain in vivo A little large biological molecules such as protein and nucleic acid, the gene of those Code targets albumen are also referred to as target gene.First determine targeting The relevant target molecules of specified disease are the bases of modern new drug development, therefore the identification of drug targets interaction has become One important foundation process of drug development.Although can be identified by bioassay to drug targets interaction, But its experimental method is very expensive for current drug development, is time-consuming and challenging.So with There is different computation models to predict extensive potential drug targets incidence relation in the development of computing technique.

Currently, mainly having 3 major class for drug targets Relationship Prediction:

(1) Bioexperiment measuring method

This traditional medicament research and development mode, which has, achieves certain success in early period.But into since the new century, this Kind is faced with always many intractable challenges based on " gene, a kind of drug, a kind of disease ", and such as high clinical proportion of goods damageds are opened It excessive cycle is sent out, needs a large amount of manpower, financial resources, material resources are tested.

(2) network-based prediction technique

This method, which is based on similar drug, can be applied to as similar target it is assumed that being integrated with drug similitude net The information such as network, target similitude network, existing DTI (drug targets relationship), drug side-effect relational network, can for us Quickly, it easily predicts potential drug targets relationship and provides important help for the reorientation of drug, be based on network Method have become prediction potential drug target incidence relation powerful.

Such as in NRWRH method, the similitude network between drug, protein-protein similitude network and known medicine are integrated Object target interactive network carries out drug to a heterogeneous network, by random walk method again in this heterogeneous network The prediction of target relationship, different with traditional random walk method is that it is integrated with three networks, can from drug to target, Target is predicted to drug both direction.

In addition, this 3 inference pattern methods of the typical DBSI also compared, TBSI, NBI, these methods are based respectively on The structural similarity that the SIMCOMP of drug is calculated, the net of Smith-Waterman score similitude and DTI based on target Network Topology Similarity infers drug targets incidence relation.In the SDTNBI method of newest proposition, provide to new drug The prediction model of object is closed, and is achieved good results, still, the integrated drug information of this method needs to be further improved.

(3) based on the prediction technique of machine learning

Currently, based on the method for machine learning by integrating the sequence information, known of the chemical structure of drug, target proteins Drug targets relationships predicted, be divided into the study and semi-supervised learning method of supervision.In supervised learning method, mesh The foundation that preceding positive negative sample determines is basis currently with the presence or absence of known incidence relation, however, this can have a negative sample This select permeability because from the data of experimental verification can only be confirmation its there are incidence relations, without can confirm that it is not present Incidence relation.Than more typical such as BLM method, predicted respectively from drug and target both direction using support vector machines Then value takes its average value to obtain final prediction score, but negative sample selection of such method its inaccuracy is largely The upper accuracy for influencing prediction.In order to solve the problems, such as above-mentioned negative sample, a small amount of label and a large amount of Unlabeled data collection are proposed At semi-supervised learning method, compare typically NetLapRLS method, this method integrates pharmaceutical chemistry information, target gene Information and known incidence relation predict new relationship, the method use label relationship and unmarked relationship rather than Pure label relationship improves predictablity rate.

Although these above-mentioned methods be successfully applied to presently, there are drug targets incidence relation prediction and medicine Object redirects in work, but the defect that its method provided has is that cannot new chemical entities be carried out with target association to close System's prediction, or target Relationship Prediction can be carried out to new compound, but need to integrate more drug informations to calculate Better prediction result, and this is very important drug development and further research.

Therefore, it is necessary to design a kind of new method for carrying out the prediction of target incidence relation towards new compound.

Summary of the invention

The technical problem to be solved by the present invention is to, in view of the deficiencies of the prior art, provide it is a kind of based on drug minor structure, The drug targets Relationship Prediction method of molecule character description information is capable of the target relationship of accurate prediction drug, is effectively kept away Exempt from a large amount of manpower and material resources consumed by Biochemistry Experiment.

The technical solution of invention is as follows:

It is a kind of based on drug minor structure, the drug targets Relationship Prediction method of molecule character description information, including following step It is rapid:

Step 1: according to each kernel texture information architecture drug Substructure similarity matrix S of all drugs_SubSim；

Step 2: drug molecule character description information is constructed according to the molecule character description information (Smiles) of all drugs Similarity matrix S_SmiSim；

Step 3: judging to need whether the drug predicted has known drug targets relationship；What it is if necessary to prediction is to have The drug for the drug targets relationship known, then the building of the drug targets relationship (DTI) according to known to drug drug targets relationship is similar Property matrix S_DTISim；If necessary to prediction be no known drug targets relationship drug, then do not construct drug targets relationship Similarity matrix S_DTISim；

Step 4: the various drug similarity matrixs of above-mentioned building are integrated into final drug similarity matrix S_Sim；

Step 5: according to drug targets relationship known to other drugs similar with the drug for needing to predict, calculating and need in advance Relationship score between the drug and target of survey；Ranking is carried out to score, if score ranking is higher, the drug targets are to presence A possibility that relationship, is bigger.

The detailed process of the step 1 are as follows:

Firstly, defining { drug_i, i=1,2 ..., m } be all drugs set, m be drug quantity；d_iFor i-th of medicine Object drug_iSub-structural features value vector, characteristic value number is equal to the dimension K of minor structure in feature value vector, if drug exists The minor structure, then otherwise it is 0 that corresponding characteristic value, which is 1,；

Then, according to the cosine related coefficient of sub-structural features value vector, drug drug is calculated_iAnd drug_jStructure it is similar PropertyCalculation formula is as follows:

Wherein, d_ikAnd d_jkRespectively indicate drug_iAnd drug_jSub-structural features value vector d_iAnd d_jIn k-th of characteristic value,； W_kFor the weight of k-th of minor structure, W_kCalculation it is as follows:

Wherein, f_kFor the frequency that k-th of minor structure occurs in all drugs, δ isStandard deviation, h is default Parameter (is set as 0.1) in our current research, the meaning of weight be so that the frequency of occurrences it is low minor structure it is higher than frequency son knot Structure occupies higher specific gravity when calculating drug Substructure similarity；

Finally, by allThe drug Substructure similarity matrix S of composition_SubSim；For S_SubSimThe element of i-th row jth column.

The detailed process of the step 2 are as follows:

Firstly, from the molecule character description information of drug isolate length be 4 LINGO Dictionary set (such as The molecule character description information of drug DB00217 is " CN/C (=N C)/NCc1ccccc1 " in DrugBank database, from it The LINGO Dictionary set isolated includes " CN/C ", " N/C (", "/C (=" etc.)；LINGO Dictionary set Each of element be calculated as a term；The LINGO Dictionary set of i-th of drug is denoted as D_i；All drugs LINGO Dictionary union of sets collection is denoted as D, it may be assumed that

D=D₁∪......∪D_m；

Then, the weight idf (t, D) of each term in D is calculated；Molecule character description information of the term in all drugs The frequency of middle appearance is higher, and weight is lower, and calculation formula is as follows:

Wherein, t is a specific term in set D；M is total molecule character description information number, and value is equal to medicine Object number；M is the molecule character description information number comprising the term；Thus the weight of all term in set D is obtained；

Any two drug drug is calculated further according to following formula_iAnd drug_jMolecule character description information similitude

Finally, by allConstitute the molecule character description information similarity matrix of drug S_SubSim；For S_SmiSimThe element of i-th row jth column.

In the step 3, the target relationship (DTI) according to known to drug constructs drug targets relationship similarity matrix S_DTISimDetailed process are as follows:

Firstly, defining { drug_i, i=1,2 ..., m } be all drugs set, m be drug quantity；{target_l,l =1,2 ..., n } be all targets set, n is the quantity of target；A is known drug targets relational matrix, the i-th row in A The element of l column is denoted as a_il, indicate i-th of drug drug_iWith first of target target_lBetween relation value；If drug_iWith target_lThere are relationship, then a_ilIt is 1, is otherwise 0；

Then, drug is calculated based on matrix A_iAnd drug_jDrug targets relationship similitudeCalculation formula is as follows:

Wherein, function sign (a_il,a_jl) meaning be a_ilAnd a_jlIn any one be 1, then returning the result is 1；Otherwise it returns Return result 0.

Finally, by allConstitute drug targets relationship similarity matrix S_DTISim；For S_DTISimThe element of i-th row jth column.

The detailed process of the step 4 are as follows: willWithIt is integrated with weight (α, β, 1- alpha-beta): When need to predict is drug (the completely new drug) of no known target relationship, no S_DTISimSimilarity data, therefore drug_iAnd drug_jFinal drug similitudeCalculation formula are as follows:

When need to predict is to have the drug of known target relationship, drug_iAnd drug_jFinal drug similitude Calculation formula are as follows:

Wherein, 0 < α, β < 1；

By allConstitute final drug similarity matrix S_Sim；For S_SimI-th row jth column Element.

The detailed process of the step 5 are as follows: according to final drug similarity matrix S_Sim, predict between drug and target Relationship score；

drug_iAnd target_lRelationship score are as follows:

Wherein, a_jlFor the drug in known drug targets relational matrix A_jAnd target_lBetween relation value；It needs pre- The drug drug of survey_iWith target target_lBetween relationship fractional root according to drug similar with its whether with target target_lIt deposits It is determined in relationship；Selection and drug when parameter Threshold and Ksim are for determining calculated relationship score_iSimilar drug Range, the former takes limitation and drug_iSimilitudeDrug greater than Threshold is calculated；The latter then takes and drug_i's SimilitudeRanking is calculated in preceding Ksim of drug；drug_iKsim is indicated and drug_iSimilitude ranking before Ksim The drug set of name；Two parameters participate in calculating as long as meeting one of them.Threshold and Ksim value can pass through intersection Verifying obtains.

The quantity of drug of the present invention is m, including needing 1 drug predicted and m-1 to have known drug targets relationship Drug；The drug for needing to predict is likely to be the drug of known drug targets relationship, it is also possible to be without known The drug (completely new drug) of drug targets relationship.Drug if necessary to prediction is the medicine for having known drug targets relationship Object then can construct drug targets relationship similarity matrix according to the known drug targets relationship of m drug in step 3 S_DTISim, in steps of 5 according to integrated drug targets relationship similarity matrix S_DTISimFinal drug similarity matrix calculate The unknown drug targets relationship score of the drug for needing to predict, to predict its unknown drug targets relationship.If necessary to pre- The drug of survey is the drug (completely new drug) of no known drug targets relationship, then does not construct drug targets in step 3 Relationship similarity matrix S_DTISim, in steps of 5 according to not integrated drug targets relationship similarity matrix S_DTISimFinal medicine Object similarity matrix calculates the potential drug targets relationship score of drug for needing to predict, to predict its potential drug targets Relationship.

The utility model has the advantages that

The present invention is based on current existing drug targets incidence relation, the molecule character description information of drug, minor structure letters Breath, proposes a kind of drug targets Relationship Prediction method, wherein molecule character description information (Smiles), refers to simplified molecular line Property input specification (Simplified molecular input line entery specification) be a kind of to use character The molecular structure specification for going here and there to describe.This method is according to known drug targets relationship, minor structure information, Smiles character string structure Build its drug similarity relationships matrix, integrated according to weight, according to similar drug towards target also it is similar this Feature is capable of the target relationship of accurate prediction drug.The present invention is only needed according to drug molecule character description information, sub- knot Structure information constructs similitude, the information such as sequence independent of target, and can carry out target to completely new medical compounds Relationship Prediction, the shortcomings that avoiding a large amount of manpower and material resources consumed by Biochemistry Experiment.Prediction for drug targets relationship It is divided into two classes, one kind is that there are the drugs of known drug target relationship, and another kind of is completely new medical compounds；The former is similar Property matrix relationship is constructed by known drug targets relationship, minor structure, Smiles string-similarity, and the latter only passes through son Structure, Smiles string-similarity construct.The present invention for there are the drug of known drug target relationship, can predict its with Relationship between other targets can predict itself and each target for completely new drug (drug without known drug targets relationship) Relationship between mark.

The present invention can make up for it the limitation that drug targets Relationship Prediction cannot be carried out to completely new compound, enrich The integrated information of drug in SDTNBI method further improves its prediction effect, is not required to rely on specific biochemistry reality It tests condition to predict its drug targets relationship, be provided for the redirection of drug and exploitation important needed for further research Reference information.Using the prediction model for being different from the prior art, the also similar principle of the target based on similar drug targeting It is predicted, reduces in the prior art due to the prediction deviation that attribute lacks and generates, obtained better prediction effect.

Detailed description of the invention

Drug targets Relationship Prediction overview flow chart of the Fig. 1 based on drug minor structure, molecule character description information；

Fig. 2 is present invention figure compared with ten times of cross validations of SDTNBI method；Fig. 2 (a)~Fig. 2 (e) is respectively this hair It is bright to compare figure with ten times cross validations of the SDTNBI method on data set GPCRs, Kinases, ICs, NRs, Global；

Fig. 3 is present invention figure compared with the external certificate of SDTNBI method；Fig. 3 (a)~Fig. 3 (b) be respectively the present invention with Ten times cross validations of the SDTNBI method on data set GPCRs, Kinases compare figure.

Specific embodiment

The present invention is described in further details below with reference to the drawings and specific embodiments:

Embodiment 1:

For the drug of known portions drug targets relationship, according to drug targets relationship, minor structure, molecule character description letter The mixing similitude of the building drug of breath, the final target relationship for predicting drug；It, will be according only to drug point to completely new compound Sub- character description information, chemical minor structure information construct mixing similitude, finally predict the target relationship of the compound.It is known In the prediction of drug, using a benchmark dataset, provide the minor structure of its all drug, Smiles string, it is known that target close It is information, goes out newly-increased drug targets relationship by integrating similitude model prediction；Prediction for novel compounds, using it Minor structure information and molecule character description information carry out Similarity measures, prediction with the Given information of drug in benchmark dataset Its target relationship.

Altogether five benchmark dataset GPCRs (g protein coupled receptor), Kinases (enzyme) (ion channel and nuclear receptor), ICs (ion channel), NRs (nuclear receptor), Global are collected in ChEMBL and BindingDB database, and completely new drug Target Relationship Prediction uses external data collection ExGPCRs and ExKinases, is collected in DrugBank database.

Based on drug minor structure, the whole flow process of the drug targets Relationship Prediction of molecule character description information as shown in Figure 1, Following steps can be divided into:

(1) according to the Substructure similarity matrix S of each kernel texture information architecture drug of all drugs_SubSim.Drug Minor structure information includes following seven kinds: CDK fingerprint (CDK), and CDK extends fingerprint (CDKExt), CDK only chart fingerprint (Graph), MACCS fingerprint (MACCS), PubChem database fingerprint (PubChem), sub fingerprint (FP4) and Klekota- Ross fingerprint (KR).Drug minor structure data used in this research are calculated by PaDEL-Descriptor (version 2 .18) software.

Define { drug_i, i=1,2 ..., m } be all drugs set, m be drug quantity；d_iFor i-th of drug drug_iSub-structural features value vector, characteristic value number is equal to dimension K (such as the MACCS knot of minor structure in feature value vector The dimension of structure is 153), if drug there are the minor structure, otherwise it is 0 that corresponding characteristic value, which is 1,；；drug_iAnd drug_jKnot Structure similitude isFor the cosine related coefficient of minor structure.Formula specific as follows:

Wherein, d_ikAnd d_jkRespectively indicate drug_iAnd drug_jSub-structural features value vector d_iAnd d_jIn k-th of characteristic value； W_kIt is

The weight of k minor structure, W_kCalculation it is as follows:

It in total include 4741 drugs, MACCS minor structure type dimension is 153, by above-mentioned in GPCRs data set Weight calculation { W_kAfter, the similitude of Drug105250 and Drug100109 are 0.0408.

(2) similitude S is calculated using the molecule character description information (Smiles) of drug_SmiSim, Smiles is isolated into length (the Smiles string of Drug81951 is " (NC (=O) OCC [N+] to the LINGO Dictionary set that degree is 4 such as in GPCR (C) (C) C "), then the LINGO Dictionary that isolates set include " (NC (", " NC (=", " C (=O " etc.).LINGO Each of Dictionary set element is calculated as term, and the LINGO Dictionary set of i-th of drug is denoted as D_i, institute There is drug term total collection to be denoted as D, for D in all drugs_iIn term union, is defined as:

D=D₁∪......∪D_m (3)

Finally, by allConstitute the molecule character description information similarity matrix of drug S_SubSim；For S_SmiSimThe element of i-th row jth column.When predicting the known drug there are drug targets relationship, Need to integrate its presently, there are drug targets relationship similitude.

Drug targets relational network is constructed before constructing DTI similitude, interacts and concentrates in drug targets, defines D ={ drug_i, i=1,2 ..., m } be all drugs set, m be drug quantity；T={ target_l, l=1,2 ..., n } be The set of all targets, n are the quantity of target；According to two Principles of Network, drug targets interaction can be expressed as two Drug targets network, wherein E={ e_il:drug_i∈Dr,target_l∈T}；If drug_iAnd target_lBetween exist test Determining interaction is connected with solid line (side) between them.According to mathematic(al) representation, two networks of drug targets can be expressed At the adjacency matrix { a of mn_il, if a in matrix_il=1 indicates drug_iAnd target_lBetween there is determining mutual of test It acts on, otherwise a_il=0.

Correlation result, drug are calculated using known DTI data_iAnd drug_jDrug targets relationship similitude Calculate following formula:

Wherein, function sign (a_il,a_jl) meaning be a_ilAnd a_jlIn any one be 1, then returning the result is 1；Otherwise it returns Return result 0.By allConstitute drug targets relationship similarity matrix S_DTISim；For S_DTISim The element of i-th row jth column.

Such as, in GPCR data set, total different targets in the drug targets relationship of drug82068 and drug82198 Marking number is 6, and public target number is 4, then its target relationship similitude is 0.6667.

(3) willSimilitude is integrated with weight (α, β, 1- alpha-beta).According to similarity data point Analysis and prediction case, calculation formula are as follows:

In the case where predicting completely new compound, no S_DTISimSimilarity data, therefore drug_iAnd drug_jFinal similitudeCalculation formula are as follows:

In the case where being predicted as existing drug, drug_iAnd drug_jFinal similitudeCalculation formula are as follows:

It (4) will be according to final drug similarity matrix S_Sim, based on drug similarity inference thought (if a drug It interacts with a target proteins, then drug similar with this drug is also likely to act on this target)；Therefore it predicts Drug drug_iWith target target_lBetween relationship score are as follows:

Wherein, a_jlFor the drug in known drug targets relational matrix A_jAnd target_lBetween relation value；It needs pre- The drug drug of survey_iWith target target_lBetween relationship fractional root according to drug similar with its whether with target target_lIt deposits It is determined in relationship；Selection and drug when parameter Threshold and Ksim are for determining calculated relationship score_iSimilar drug Range, the former takes limitation and drug_iSimilitudeDrug greater than Threshold is calculated；The latter then takes and drug_i SimilitudeRanking is calculated in preceding Ksim of drug；drug_iKsim is indicated and drug_iSimilitude ranking before Ksim drug set；Two parameters participate in calculating as long as meeting one of them.

For the validity of verification method, two kinds of verifyings, an internal verification, in five benchmark datasets have been carried out Prediction verifying is carried out by the way of intersecting in GPCRs, Kinases, ICs, NRs, Global using ten times；Another is tested for outside Two external data collection GPCRs and Kinases from DrugBank are concentrated in its corresponding reference data and are carried out completely newly by card The prediction of drug is verified.

Specific profile data set is as shown in table 1 below, and target is validation data set, N_d, N_t, N_dtRespectively each data set Middle drug, target, drug targets relationship number, Sparsity N_dtWith the ratio of all possible drug targets relationship number.

1 data set of table summarizes table

For the accuracy of assessment prediction method, to every a pair of of the training set and test set in cross validation, from training set The corresponding relation data of all nodes in middle deletion test set, by after the model prediction with the DTI relationship in test set into Row compares.Drug is assessed for every kind of participation, is ranked up according to the drug targets relationship score of prediction, then in test set DTI relationship is compared.In order to assess its performance, below several evaluation indexes difference display model methods precision and robust Property.Including accuracy rate (P), recall rate (R), accurate enhancing rate (e_p) and recall enhancing rate (e_r).Its details be briefly described as Under:

Wherein M and N is the drug and target number for participating in prediction, and X is the total DTI relationship number deleted in M drug, X_iFor drug_iThe DTI relationship number of deletion.X_iIt (L) is current drug_iPredicting list in preceding L be correct DTI relationship number Mesh.In addition, the performance also by calculating AUC (the areas under ROC curves) Lai Tixian algorithm.

Table 2 describes in ten times of cross validations, the Performance Evaluating Indexes value of each data set after prediction, using formula (6) Lai Jicheng similitude, wherein ICs, Kinases, NRs data lumped parameter are (α=0.1, β=0.1, Ksim in GPCRs =100, Threshold=0.5), the parameter used in Global data set for (α=0.1, β=0.1, Ksim=50, Threshold=0.5).Due to being analyzed by data relationship in ten times of cross validations,Relationship specific gravity is to final pre- Surveying result influences maximum, therefore taking its weighted value is 0.8, in additionIt is distributed as 0.1, is made an uproar to eliminate low similitude Sound obtains better prediction result, and taking threshold values is 0.5, in prediction model other than taking threshold values 0.5, is also provided with each drug Neighbour's number limitation of prediction is participated in, the number of drugs in GPCRs, ICs, Kinases, NRs is comparatively bigger than in Global, It is described in Table 1, therefore it is 100 that Ksim value, which is arranged, in the former, the latter 50.

Algorithm performance index in 20 times of cross validations of table

In GPCRs data set, indices P, R, e_p, e_r, AUC is more stable in seven sub- structures, and difference is not Greatly, wherein AUC has reached and has reached 0.953 or more, and mean value 0.962 obtains good prediction effect.In Global data Its average index ratio GPCRs is low on collection, wherein AUC value is worst on FP4 minor structure, only 0.817, but in other sub- knots Above structure between 0.923 to 0.938, on IC data set, AUC peak has reached 0.971, minimum also to have 0.947, On Kinases data set, verification result is more stable, and AUC value is all 0.961 or more, peak 0.963, in NRs data On collection, performance is opposite to want difference, and AUC value is between 0.909 and 0.928.

Table 3 describes in external certificate and (uses the GPCRs from DrugBank database, Kinases data set), face To the evaluation index value of the completely new each data set prediction result of compound, using formula (5) Lai Jicheng similitude, wherein outside GPCRs data lumped parameter in portion is that (α=0.5, Ksim=0, Threshold=0.1, wherein Ksim=0 is indicated regardless of before ranking How many data be involved in prediction calculate), the parameter used in Kinases data set for (α=0.5, Ksim=0, Threshold=0.0, wherein Ksim=0 and Threshold=0.0 indicates to be involved in prediction regardless of the value and ranking of similitude It calculates).It is rightSimilitude is integrated at the same scale, is set as 0 to neighbour Ksim, all with no restrictions, Setting to threshold values Threshold be in GPCRs data set be 0 (with no restrictions) in 0.1, Kinases data set.

The algorithm performance index of 3 external certificate of table

Seen in table 3 since the prediction to new compound lacks drug targets relationship known to it, is compared in AUC value Ten times of cross validations in table 2 want low, this side light importance of known DTI relationship.It is verified in GPCRs data set In, AUC has reached peak 0.841 in the integrated prediction of FP4+smi, and in Kinases data set, the highest of AUC Value obtains in the integrated prediction of KR+smi, and other indices are currently being also to have extremely high value.

The data set disclosed in SDTNBI method of the data set as used in the present invention, using is network Distribution model, at the same define with above identical Performance Evaluating Indexes, we in the present invention estimated performance with SDTNBI method compares, due to accuracy rate (P), recall rate (R), and accurate enhancing rate (e_p) and recall enhancing rate (e_r) this Four indexs are related to specific parameter L=20 in verifying, than be less to meet very much with overall performance, still comparing When ignore this four indexs, only than being less related to the AUC value of any parameter.

In fig. 2 it is possible to find out the AUC maximum value for integrating different minor structure predictions in Global data set, minimum value Result all than the prediction of SDTNBI method is poor, and mean value is 0.928 in SDTNBI, is 0.913 in new method, this can Can be related with the density of drug targets relationship existing in Global data set, because in the ten times of intersections predicted known drug It is integrated with known drug target relationship analog information in verifying, the known medicine in only Global data set is described in table 1 The density of object target relationship is lower than 1%, is 0.54%, therefore the similarity data constructed on the present invention causes influence, Jin Erying Ring final prediction result.

And in other four group data sets GPCRs, Kinases, ICs, NRs, the mean value of different minor structure predictions is integrated, Minimum value is all substantially better than SDTNBI method, and is not much different in maximum value.This demonstrates new on this four group data set The prediction result of method is stablized than original SDTNBI method, especially in the multiple types minor structure information ratio for obtaining drug Under more difficult situation, new method can be more more reliable than SDTNBI method.

Verification result ratio in Fig. 3, in completely new drug prediction validation test, in Kinases data set SDTNBI is slightly poor, and mean value AUC ratio is 0.853:0.848, and maximum AUC ratio is 0.863:0.856, and minimum AUC ratio is 0.847: 0.841.And in GPCGs data set validation test, the estimated performance ratio SDTNBI of new method improves, Value AUC ratio is 0.817:0.766, and in addition maximum AUC value and minimum AUC value are similarly better than the latter.

Pass through the comparison of above-mentioned two aspect, it was demonstrated that in the case where known drug target relationship is met certain condition, New integrated approach can provide more accurate prediction result than original SDTNBI method, redirect and new medicine for drug Object exploitation provides significant further Research foundation.

Claims

1. a kind of based on drug minor structure, the drug targets Relationship Prediction method of molecule character description information, which is characterized in that packet Include following steps:

Step 2: drug molecule character description information similarity matrix is constructed according to the molecule character description information of all drugs S_SmiSim；

Step 3: judging to need whether the drug predicted has known drug targets relationship；If necessary to prediction be have it is known The drug of drug targets relationship, then the drug targets relationship according to known to drug constructs drug targets relationship similarity matrix S_DTISim；If necessary to prediction be no known drug targets relationship drug, then do not construct drug targets relationship similitude Matrix S_DTISim；

Step 5: according to drug targets relationship known to other drugs similar with the drug for needing to predict, calculating what needs were predicted Relationship score between drug and target；Ranking is carried out to score, if score ranking is higher, the drug targets are to there are relationships A possibility that it is bigger.

2. the drug targets Relationship Prediction side according to claim 1 based on drug minor structure, molecule character description information Method, which is characterized in that the detailed process of the step 1 are as follows:

Firstly, defining { drug_i, i=1,2 ..., m } be all drugs set, m be drug quantity；d_iFor i-th of drug drug_iSub-structural features value vector, characteristic value number is equal to the dimension K of minor structure in feature value vector, if exist should for drug Minor structure, then otherwise it is 0 that corresponding characteristic value, which is 1,；

Then, according to the cosine related coefficient of sub-structural features value vector, drug drug is calculated_iAnd drug_jStructural similarityCalculation formula is as follows:

Wherein, d_ikAnd d_jkRespectively indicate drug_iAnd drug_jSub-structural features value vector d_iAnd d_jIn k-th of characteristic value,；W_kFor The weight of k-th of minor structure, W_kCalculation it is as follows:

Wherein, f_kFor the frequency that k-th of minor structure occurs in all drugs, δ isStandard deviation, h is parameter preset；

Finally, by allConstitute drug Substructure similarity matrix S_SubSim, i, j=1,2 ..., m；For S_SubSim The element of i-th row jth column.

3. the drug targets Relationship Prediction side according to claim 2 based on drug minor structure, molecule character description information Method, which is characterized in that the detailed process of the step 2 are as follows:

Gather firstly, isolating the LINGO Dictionary that length is 4 from the molecule character description information of drug；LINGO Each of Dictionary set element is calculated as a term；The LINGO Dictionary set of i-th of drug is denoted as D_i；The LINGO Dictionary union of sets collection of all drugs is denoted as D, it may be assumed that

D=D₁∪......∪D_m；

Then, in set of computations D each term weight idf (t, D)；Calculation formula is as follows:

Wherein, t is a specific term in set D；M is the molecule character description information number comprising the term；

Finally, by allConstitute the molecule character description information similarity matrix S of drug_SmiSim, i, j=1,2 ..., m；For S_SmiSimThe element of i-th row jth column.

4. the drug targets Relationship Prediction side according to claim 3 based on drug minor structure, molecule character description information Method, which is characterized in that in the step 3, the target relationship according to known to drug constructs drug targets relationship similarity matrix S_DTISimDetailed process are as follows:

Firstly, defining { drug_i, i=1,2 ..., m } be all drugs set, m be drug quantity；{target_l, l=1, 2 ..., n } be all targets set, n be target quantity；A is known drug targets relational matrix, the i-th row l column in A Element be denoted as a_il, indicate i-th of drug drug_iWith first of target target_lBetween relation value；If drug_iWith target_lThere are relationship, then a_ilIt is 1, is otherwise 0；

Wherein, function sign (a_il,a_jl) meaning be a_ilAnd a_jlIn any one be 1, then returning the result is 1；Otherwise knot is returned Fruit 0；

5. the drug targets Relationship Prediction side according to claim 4 based on drug minor structure, molecule character description information Method, which is characterized in that the detailed process of the step 4 are as follows: willWithWith weight (α, β, 1- alpha-beta) into Row integration: when need to predict is the drug of no known target relationship, drug_iAnd drug_jFinal drug similitudeCalculation formula are as follows:

When need to predict is to have the drug of known target relationship, drug_iAnd drug_jFinal drug similitudeMeter Calculate formula are as follows:

Wherein, 0 < α, β < 1；

By allConstitute final drug similarity matrix S_Sim；For S_SimThe member of i-th row jth column Element.

6. the drug targets Relationship Prediction side according to claim 5 based on drug minor structure, molecule character description information Method, which is characterized in that the detailed process of step 5 are as follows: according to final drug similarity matrix S_Sim, predict drug and target it Between relationship score；

drug_iAnd target_lRelationship score are as follows:

Wherein, a_jlFor the drug in known drug targets relational matrix A_jAnd target_lBetween relation value；Parameter Selection and drug when Threshold and Ksim is for determining calculated relationship score_iThe range of similar drug, the former limit take with drug_iSimilitudeDrug greater than Threshold is calculated；The latter then takes and drug_iSimilitudeRanking exists Preceding Ksim of drug is calculated；drug_iKsim is indicated and drug_iSimilitude ranking before Ksim drug set；Two Parameter participates in calculating as long as meeting one of them.