CN109872781A

CN109872781A - Drug target recognition methods based on Xgboost

Info

Publication number: CN109872781A
Application number: CN201910141417.2A
Authority: CN
Inventors: 胡杨; 逄龙; 程亮; 张凝一
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-02-26
Filing date: 2019-02-26
Publication date: 2019-06-11

Abstract

The present invention provides the drug target recognition methods based on Xgboost, belong to drug target identification field.The present invention is based on the drug target recognition methods specific steps of Xgboost are as follows: constituent analysis: calculating the average percent of drug targets and non-drug target every kind of amino acid in 20 kinds of amino acid；Dissociation constant: by 20 kinds of amino acid according to its respective hydrophily by Amino acid score at 6 class groupuscules；The area PEST: according to the protein domain PEST potential in Epestfind procedure identification amino acid；According to Step 1: step 2 and step 3 extract 3 kinds of features of drug targets；The identification of drug target is carried out to the feature extracted in step 4 using Xgboost algorithm.A kind of drug target recognition methods based on Xgboost of the present invention, can high speed, identification potential drug target spot efficiently, inexpensive；It was found that potential drug target can not only push disease mechanism of action and pharmaceutical research, tutorial message can also be provided for the potential side effect of drug and the commercialization of drug.

Description

Drug target recognition methods based on Xgboost

Technical field

The present invention relates to the drug target recognition methods based on Xgboost, belong to drug target identification field.

Background technique

Binding site between drug and large biological molecule is drug target.Drug target is related to receptor, enzyme, and ion leads to Road, transport protein, immune system, gene etc..For existing drug more than 50% using receptor as target, receptor becomes main and most heavy The target spot wanted.Since drug targeting research is the source of modern medicines research, it can be mentioned for the prevention and treatment of major disease For important information, make the new drug development based on fresh target that there is great social and economic benefit.Therefore, drug targets become doctor The hot spot in field.

Most protein drug is g protein coupled receptor (GPCR) (23%) and enzyme (50%).Some researchers are pre- It surveys, has more than 2000 kinds of pharmaceutical grade proteins.However it is reported that only hundreds of drug targets.The number of clinical verification pharmaceutical target It measures still seldom.Partly cause is the accumulation with redundant data, and simple analysis method has been unable to meet extensive high throughput The needs of data analysis.But due to handling capacity, the limitation of precision and cost, experimental method, using being difficult to carry out extensively.Make For handle mass data quick and inexpensive method, based on machine learning pharmaceutical target prediction more and more attention has been paid to.

The basic sequence of the conjugated proteins such as Huang Chen, two stages structure and subcellular localization, predict ion channel by SVM In potential drug target.Hopkins A L et al. be based on sequence homology and structure domain analysis known drug target and by its Applied to searching novel targets.3D structure based on protein, the researchs such as Kinnings S L can be in conjunction with medical compounds Bond area.Campillos M predicts potential drug targets based on the similitude of side effect.Zheng et al. has found drug bound site Point has certain structure and physicochemical property always.In addition,Kleywegt G uses hydrophobic amino acid Percentage predicts drug targets.Tala M.BakheeT and Andrew J.Doig analyze the pharmaceutical target of 9 attributes, he Not only by the difference of this 9 Attribute Discovery between drug targets and non-drug target, but also identify medicine using SVM Object target.

Although researcher achieves great achievement in terms of identifying drug targets, huge and complicated acid sequence is identified Need a kind of algorithm with Computationally efficient and high recognition accuracy.Chen T proposed a kind of entitled limit ladder in 2004 The new method of degree enhancing (Xgboost), he improves boost algorithm, its multi-threaded parallel and regularization term not only improves The accuracy of algorithm, and shorten runing time.Therefore, Xgboost is a kind of conjunction for solving the problems, such as drug targets identification Suitable algorithm.

Summary of the invention

The purpose of the present invention is to solve the above-mentioned problems of the prior art, and then provide the medicine based on Xgboost Object target spot recognition methods.

The purpose of the present invention is what is be achieved through the following technical solutions:

Drug target recognition methods based on Xgboost, the drug target recognition methods based on Xgboost specifically walk Suddenly are as follows:

Step 1: constituent analysis: calculate drug targets and non-drug target in 20 kinds of amino acid every kind of amino acid it is flat Equal percentage；

Step 2: dissociation constant: by 20 kinds of amino acid according to its respective hydrophily by Amino acid score at 6 class groupuscules；

Step 3: the area PEST: according to the protein domain PEST potential in Epestfind procedure identification amino acid；

Step 4: according to Step 1: step 2 and step 3 extract 3 kinds of features of drug targets；

Step 5: the identification of drug target is carried out to the feature extracted in step 4 using Xgboost algorithm.

The present invention is based on the drug target recognition methods of Xgboost, the Xgboost algorithm specifically:

Objective function includes loss function and regularization term:

Obj (Θ)=L (θ)+Ω (Θ)

Wherein, L (θ) is loss function, and Ω (Θ) is regularization term；

According to the model of following formula building T tree are as follows:

The basic classification device of Xgboost is CART, and objective function can be such that

Target is the parameter f for obtaining each tree_i, t tree is had trained according to (t-1) tree before

Therefore, t-th of objective function is

Loss function L (θ) is subjected to the second Taylor series

By decision tree is defined as:

f_t(x)=w_q(x),w∈R^M,q:R^d→{1,2,…,M}；

W records the score of each leaf node, and q is a function, determines which node is each input sample finally fall on；

In Xgboost, by regularization parameter is defined as:

λ and γ is the parameter of Controlling model complexity；

So the objective function of t-th of tree are as follows:

Define G_j=∑ g_iAnd H_j=∑ h_i, it is then available:

Here, w_jIndependently of other, the optimal score of j-th of node and optimal obj are as follows:

Finally, cut tree according to certain rules；

The present invention is based on the drug target recognition methods of Xgboost, can high speed, the potential medicine of identification efficiently, inexpensive Object target spot；It was found that potential drug target can not only push disease mechanism of action and pharmaceutical research, it can also be latent for drug Side effect and drug commercialization provide tutorial message.

Detailed description of the invention

Fig. 1 is feature extraction block diagram of the invention.

Fig. 2 is the amino acid composition of drug target and non-drug target spot.

Fig. 3 is accuracy rate curve.

Specific embodiment

Below in conjunction with attached drawing, the present invention is described in further detail: the present embodiment is being with technical solution of the present invention Under the premise of implemented, give detailed embodiment, but protection scope of the present invention is not limited to following embodiments.

Embodiment one: as shown in Figs. 1-2, the drug target recognition methods based on Xgboost involved in the present embodiment, institute State the drug target recognition methods specific steps based on Xgboost are as follows:

Constituent analysis: since the composition of real drug targets and the composition of non-drug target are entirely different, these The frequency of occurrences of all 20 kinds of amino acid may differ widely in target.In order to find out between drug targets and non-drug target Difference draws the picture of average amino acid composition, as shown in Figure 1.Therefore, every kind of ammonia in drug targets and non-drug target is calculated The average percent of base acid.

Calculate the average amino acid composition of 2596 kinds of drug targets and non-drug target.Just as seen, medicine Object target is very high in ' L' in most abundant, and ' G', ' A', ' V', the composition of ' E', ' S'.

In short, between the composition and non-drug target of drug targets, there are significant differences.Therefore, it is used as identification drug The function of target.

Dissociation constant: hydrophobic residue and the form of hydrophilic residue are for determining that protein structure is extremely important.Due to The hydrophily range of amino acid is wider, can according to its respective hydrophily by Amino acid score at groupuscule, therefore in drug targets and There must be very big difference on non-drug target.Table 1 shows six groups in 20 amino acid.

1. amino acid of table is divided into 6 classes

Therefore, the sequence of each drug targets can be transferred in this 6 groups.Each dimension is being averaged for one of this six groups Composition.

The area PEST: 1986, RechsteinerM and Rogers SW was made that it is assumed that i.e. ' P', ' E', ' S' and ' T' Amino acid can be used as proteolysis signal.More and more reports confirm that the sequence containing the region PEST can lead to egg now The fast degradation of white matter.Epestfind program can be used to identify all bad and potential PEST protein sequence.It only will be potential The protein domain PEST as identification drug targets feature.Calculate the quantity of potentially harmful biotic district in each sequence.

Therefore, we are extracted 3 kinds of features, i.e., 27 pharmaceutical targets tieed up to determine non-drug target.

The quantity of suitable drug target is still limited at present.For unknown drug target, it is known that drug Target spot only tip of the iceberg.The selection of target spot plays a crucial role in entire drug development process.Modern medicine In object research, the foundation of novel targets is often new drug precondition for innovation and guarantee.With the development of modern molecular biology technique With the completion of the Human Genome Project, there is the novel molecular target spot largely for therapy intervention, but not all target Point can become Effective target site related with disease, therefore carry out discovery and verifying to New Target point to become be very important Work.Tradition is not only with high costs using the method for Bioexperiment but also inefficiency, the Xgboost that the present invention develops identify medicine Object target spot method, can high speed, identification potential drug target spot efficiently, inexpensive.It was found that potential drug target not only can be with Disease mechanism of action and pharmaceutical research are pushed, guidance letter can also be provided for the potential side effect of drug and the commercialization of drug Breath.

Embodiment two: as shown in Figure 1, the drug target recognition methods based on Xgboost involved in the present embodiment, described Xgboost algorithm specifically:

Objective function includes loss function and regularization term:

Obj (Θ)=L (θ)+Ω (Θ)

Wherein, L (θ) is loss function, and Ω (Θ) is regularization term；

According to the model of following formula building T tree are as follows:

Therefore, t-th of objective function is

Loss function L (θ) is subjected to the second Taylor series

By decision tree is defined as:

f_t(x)=w_q(x),w∈R^M,q:R^d→{1,2,…,M}；

In Xgboost, by regularization parameter is defined as:

λ and γ is the parameter of Controlling model complexity；

So the objective function of t-th of tree are as follows:

Define G_j=∑ g_iAnd H_j=∑ h_i, it is then available:

Finally, cut tree according to certain rules；

Extreme Gradi-ent Boosting (Xgboost) improves traditional gradient and promotes decision tree (GBDT). Traditional GBDT algorithm is in optimization using only first derivative information of loss function.Xgboost executes two to loss function Rank Taylor expansion, and use the information of single order and second dervative.In addition, xgboost can make automatically with the help of Open MP Use CPU.The multi-core parallel concurrent of CPU calculates, and substantially increases the speed of service.Secondly, different from GBDT algorithm, Xgboost supports dilute Dredge Input matrix.Xgboost defines a new data matrix DMatrix, and training set will be located in advance when training starts Reason, therefore the efficiency of each iteration of training process can be improved, reduce the model training time.

The process of GBDT is as follows:

Objective function is commonly used in measuring the quality of different models.It is always made of two parts: loss function and just Then change item.

Obj (Θ)=L (θ)+Ω (Θ)

L (θ) is loss function.If we only use the quality that loss function carrys out assessment models, model is easy to Overfitting.Therefore, it is considered as regularization parameter.It represents the complexity of model.Therefore, final mask should be in loss letter Balance is obtained between several and regularization term.

If having trained T tree, model can be constructed in the following way:

The basic classification device of Xgboost and GBDT is all CART, therefore objective function can be as follows

Target is the parameter f for obtaining each tree_iWe have trained t tree according to (t-1) tree before.

Therefore, t-th of objective function is

Then, loss function is subjected to the second Taylor series

Then, it would be desirable to calculate regularization term.Firstly, we are by decision tree is defined as:

f_t(x)=w_q(x),w∈R^M,q:R^d→{1,2,…,M}

W records the score of each leaf node.Q is a function, in that case it can be decided which section is each input sample finally fall in Point on.In Xgboost, regularization parameter is defined as follows by we:

λ and γ is the parameter of Controlling model complexity.So the objective function of t-th of tree is as follows:

We can define G_j=∑ g_iAnd H_j=∑ h_i, then we are available:

Here, w_jIndependently of other, the optimal score of our available j-th of node and optimal obj.

Finally, we should cut tree according to certain rules.

It will be seen that branch had better not be added if the gain after division is less than γ.

Embodiment three: as indicated at 3, the drug target recognition methods based on Xgboost involved in the present embodiment, the base In the experimental verification process of the drug target recognition methods of Xgboost it is that we obtain 2596 real pharmaceutical targets, and And we produce 2596 pseudo- pharmaceutical targets.In order to verify validity of the Xgboost in terms of identifying drug targets, Wo Menjin Ten cross validations are gone.

This 5192 sequences are randomly divided into 10 groups by us.For each group, we select 519 sequences as test Collection, remaining 4673 sequence is as training set.So we have carried out ten experiments in total.In addition, each sequence becomes instruction Practice collection and test set.It sets the parameter of Xgboost to described in table 2.

The parameter setting of table 2.Xgboost

We assess performance of the Xgboost in terms of identifying drug targets using four kinds of appraisal procedures.We are by ten The result of experiment is placed in table 3.Test 5190 sequences in total.

The result of table 3. 10 times experiments

Then Accuracy=99.13%, Precision=99.04%, Recall=99.23% can be calculated, Specificity=99.04%；In our current research, false drug targets are 0, drug targets 1.The accuracy rate of 10 experiments is bent Line is as shown in Figure 2.

The foregoing is only a preferred embodiment of the present invention, these specific embodiments are all based on the present invention Different implementations under general idea, and scope of protection of the present invention is not limited thereto, it is any to be familiar with the art Technical staff in the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of, should all cover of the invention Within protection scope.Therefore, the scope of protection of the invention shall be subject to the scope of protection specified in the patent claim.

Claims

1. the drug target recognition methods based on Xgboost, which is characterized in that the drug target identification based on Xgboost Method specific steps are as follows:

Step 1: constituent analysis: calculate drug targets and non-drug target in 20 kinds of amino acid every kind of amino acid average hundred Divide ratio；

2. the drug target recognition methods according to claim 1 based on Xgboost, which is characterized in that the Xgboost The specific steps of algorithm are as follows:

Objective function includes loss function and regularization term:

Obj (Θ)=L (θ)+Ω (Θ)

Wherein, L (θ) is loss function, and Ω (Θ) is regularization term；

According to the model of following formula building T tree are as follows:

Therefore, t-th of objective function is

Loss function L (θ) is subjected to the second Taylor series

By decision tree is defined as:

f_t(x)=w_q(x),w∈R^M,q:R^d→{1,2,…,M}；

In Xgboost, by regularization parameter is defined as:

λ and γ is the parameter of Controlling model complexity；

So the objective function of t-th of tree are as follows:

Define G_j=∑ g_iAnd H_j=∑ h_i, it is then available:

Finally, cut tree according to certain rules；