CN109872781A - Drug target recognition methods based on Xgboost - Google Patents

Drug target recognition methods based on Xgboost Download PDF

Info

Publication number
CN109872781A
CN109872781A CN201910141417.2A CN201910141417A CN109872781A CN 109872781 A CN109872781 A CN 109872781A CN 201910141417 A CN201910141417 A CN 201910141417A CN 109872781 A CN109872781 A CN 109872781A
Authority
CN
China
Prior art keywords
xgboost
drug
drug target
amino acid
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910141417.2A
Other languages
Chinese (zh)
Inventor
胡杨
逄龙
程亮
张凝一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201910141417.2A priority Critical patent/CN109872781A/en
Publication of CN109872781A publication Critical patent/CN109872781A/en
Pending legal-status Critical Current

Links

Landscapes

  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The present invention provides the drug target recognition methods based on Xgboost, belong to drug target identification field.The present invention is based on the drug target recognition methods specific steps of Xgboost are as follows: constituent analysis: calculating the average percent of drug targets and non-drug target every kind of amino acid in 20 kinds of amino acid;Dissociation constant: by 20 kinds of amino acid according to its respective hydrophily by Amino acid score at 6 class groupuscules;The area PEST: according to the protein domain PEST potential in Epestfind procedure identification amino acid;According to Step 1: step 2 and step 3 extract 3 kinds of features of drug targets;The identification of drug target is carried out to the feature extracted in step 4 using Xgboost algorithm.A kind of drug target recognition methods based on Xgboost of the present invention, can high speed, identification potential drug target spot efficiently, inexpensive;It was found that potential drug target can not only push disease mechanism of action and pharmaceutical research, tutorial message can also be provided for the potential side effect of drug and the commercialization of drug.

Description

Drug target recognition methods based on Xgboost
Technical field
The present invention relates to the drug target recognition methods based on Xgboost, belong to drug target identification field.
Background technique
Binding site between drug and large biological molecule is drug target.Drug target is related to receptor, enzyme, and ion leads to Road, transport protein, immune system, gene etc..For existing drug more than 50% using receptor as target, receptor becomes main and most heavy The target spot wanted.Since drug targeting research is the source of modern medicines research, it can be mentioned for the prevention and treatment of major disease For important information, make the new drug development based on fresh target that there is great social and economic benefit.Therefore, drug targets become doctor The hot spot in field.
Most protein drug is g protein coupled receptor (GPCR) (23%) and enzyme (50%).Some researchers are pre- It surveys, has more than 2000 kinds of pharmaceutical grade proteins.However it is reported that only hundreds of drug targets.The number of clinical verification pharmaceutical target It measures still seldom.Partly cause is the accumulation with redundant data, and simple analysis method has been unable to meet extensive high throughput The needs of data analysis.But due to handling capacity, the limitation of precision and cost, experimental method, using being difficult to carry out extensively.Make For handle mass data quick and inexpensive method, based on machine learning pharmaceutical target prediction more and more attention has been paid to.
The basic sequence of the conjugated proteins such as Huang Chen, two stages structure and subcellular localization, predict ion channel by SVM In potential drug target.Hopkins A L et al. be based on sequence homology and structure domain analysis known drug target and by its Applied to searching novel targets.3D structure based on protein, the researchs such as Kinnings S L can be in conjunction with medical compounds Bond area.Campillos M predicts potential drug targets based on the similitude of side effect.Zheng et al. has found drug bound site Point has certain structure and physicochemical property always.In addition,Kleywegt G uses hydrophobic amino acid Percentage predicts drug targets.Tala M.BakheeT and Andrew J.Doig analyze the pharmaceutical target of 9 attributes, he Not only by the difference of this 9 Attribute Discovery between drug targets and non-drug target, but also identify medicine using SVM Object target.
Although researcher achieves great achievement in terms of identifying drug targets, huge and complicated acid sequence is identified Need a kind of algorithm with Computationally efficient and high recognition accuracy.Chen T proposed a kind of entitled limit ladder in 2004 The new method of degree enhancing (Xgboost), he improves boost algorithm, its multi-threaded parallel and regularization term not only improves The accuracy of algorithm, and shorten runing time.Therefore, Xgboost is a kind of conjunction for solving the problems, such as drug targets identification Suitable algorithm.
Summary of the invention
The purpose of the present invention is to solve the above-mentioned problems of the prior art, and then provide the medicine based on Xgboost Object target spot recognition methods.
The purpose of the present invention is what is be achieved through the following technical solutions:
Drug target recognition methods based on Xgboost, the drug target recognition methods based on Xgboost specifically walk Suddenly are as follows:
Step 1: constituent analysis: calculate drug targets and non-drug target in 20 kinds of amino acid every kind of amino acid it is flat Equal percentage;
Step 2: dissociation constant: by 20 kinds of amino acid according to its respective hydrophily by Amino acid score at 6 class groupuscules;
Step 3: the area PEST: according to the protein domain PEST potential in Epestfind procedure identification amino acid;
Step 4: according to Step 1: step 2 and step 3 extract 3 kinds of features of drug targets;
Step 5: the identification of drug target is carried out to the feature extracted in step 4 using Xgboost algorithm.
The present invention is based on the drug target recognition methods of Xgboost, the Xgboost algorithm specifically:
Objective function includes loss function and regularization term:
Obj (Θ)=L (θ)+Ω (Θ)
Wherein, L (θ) is loss function, and Ω (Θ) is regularization term;
According to the model of following formula building T tree are as follows:
The basic classification device of Xgboost is CART, and objective function can be such that
Target is the parameter f for obtaining each treei, t tree is had trained according to (t-1) tree before
Therefore, t-th of objective function is
Loss function L (θ) is subjected to the second Taylor series
By decision tree is defined as:
ft(x)=wq(x),w∈RM,q:Rd→{1,2,…,M};
W records the score of each leaf node, and q is a function, determines which node is each input sample finally fall on;
In Xgboost, by regularization parameter is defined as:
λ and γ is the parameter of Controlling model complexity;
So the objective function of t-th of tree are as follows:
Define Gj=∑ giAnd Hj=∑ hi, it is then available:
Here, wjIndependently of other, the optimal score of j-th of node and optimal obj are as follows:
Finally, cut tree according to certain rules;
The present invention is based on the drug target recognition methods of Xgboost, can high speed, the potential medicine of identification efficiently, inexpensive Object target spot;It was found that potential drug target can not only push disease mechanism of action and pharmaceutical research, it can also be latent for drug Side effect and drug commercialization provide tutorial message.
Detailed description of the invention
Fig. 1 is feature extraction block diagram of the invention.
Fig. 2 is the amino acid composition of drug target and non-drug target spot.
Fig. 3 is accuracy rate curve.
Specific embodiment
Below in conjunction with attached drawing, the present invention is described in further detail: the present embodiment is being with technical solution of the present invention Under the premise of implemented, give detailed embodiment, but protection scope of the present invention is not limited to following embodiments.
Embodiment one: as shown in Figs. 1-2, the drug target recognition methods based on Xgboost involved in the present embodiment, institute State the drug target recognition methods specific steps based on Xgboost are as follows:
Step 1: constituent analysis: calculate drug targets and non-drug target in 20 kinds of amino acid every kind of amino acid it is flat Equal percentage;
Step 2: dissociation constant: by 20 kinds of amino acid according to its respective hydrophily by Amino acid score at 6 class groupuscules;
Step 3: the area PEST: according to the protein domain PEST potential in Epestfind procedure identification amino acid;
Step 4: according to Step 1: step 2 and step 3 extract 3 kinds of features of drug targets;
Step 5: the identification of drug target is carried out to the feature extracted in step 4 using Xgboost algorithm.
Constituent analysis: since the composition of real drug targets and the composition of non-drug target are entirely different, these The frequency of occurrences of all 20 kinds of amino acid may differ widely in target.In order to find out between drug targets and non-drug target Difference draws the picture of average amino acid composition, as shown in Figure 1.Therefore, every kind of ammonia in drug targets and non-drug target is calculated The average percent of base acid.
Calculate the average amino acid composition of 2596 kinds of drug targets and non-drug target.Just as seen, medicine Object target is very high in ' L' in most abundant, and ' G', ' A', ' V', the composition of ' E', ' S'.
In short, between the composition and non-drug target of drug targets, there are significant differences.Therefore, it is used as identification drug The function of target.
Dissociation constant: hydrophobic residue and the form of hydrophilic residue are for determining that protein structure is extremely important.Due to The hydrophily range of amino acid is wider, can according to its respective hydrophily by Amino acid score at groupuscule, therefore in drug targets and There must be very big difference on non-drug target.Table 1 shows six groups in 20 amino acid.
1. amino acid of table is divided into 6 classes
Therefore, the sequence of each drug targets can be transferred in this 6 groups.Each dimension is being averaged for one of this six groups Composition.
The area PEST: 1986, RechsteinerM and Rogers SW was made that it is assumed that i.e. ' P', ' E', ' S' and ' T' Amino acid can be used as proteolysis signal.More and more reports confirm that the sequence containing the region PEST can lead to egg now The fast degradation of white matter.Epestfind program can be used to identify all bad and potential PEST protein sequence.It only will be potential The protein domain PEST as identification drug targets feature.Calculate the quantity of potentially harmful biotic district in each sequence.
Therefore, we are extracted 3 kinds of features, i.e., 27 pharmaceutical targets tieed up to determine non-drug target.
The quantity of suitable drug target is still limited at present.For unknown drug target, it is known that drug Target spot only tip of the iceberg.The selection of target spot plays a crucial role in entire drug development process.Modern medicine In object research, the foundation of novel targets is often new drug precondition for innovation and guarantee.With the development of modern molecular biology technique With the completion of the Human Genome Project, there is the novel molecular target spot largely for therapy intervention, but not all target Point can become Effective target site related with disease, therefore carry out discovery and verifying to New Target point to become be very important Work.Tradition is not only with high costs using the method for Bioexperiment but also inefficiency, the Xgboost that the present invention develops identify medicine Object target spot method, can high speed, identification potential drug target spot efficiently, inexpensive.It was found that potential drug target not only can be with Disease mechanism of action and pharmaceutical research are pushed, guidance letter can also be provided for the potential side effect of drug and the commercialization of drug Breath.
Embodiment two: as shown in Figure 1, the drug target recognition methods based on Xgboost involved in the present embodiment, described Xgboost algorithm specifically:
Objective function includes loss function and regularization term:
Obj (Θ)=L (θ)+Ω (Θ)
Wherein, L (θ) is loss function, and Ω (Θ) is regularization term;
According to the model of following formula building T tree are as follows:
The basic classification device of Xgboost is CART, and objective function can be such that
Target is the parameter f for obtaining each treei, t tree is had trained according to (t-1) tree before
Therefore, t-th of objective function is
Loss function L (θ) is subjected to the second Taylor series
By decision tree is defined as:
ft(x)=wq(x),w∈RM,q:Rd→{1,2,…,M};
W records the score of each leaf node, and q is a function, determines which node is each input sample finally fall on;
In Xgboost, by regularization parameter is defined as:
λ and γ is the parameter of Controlling model complexity;
So the objective function of t-th of tree are as follows:
Define Gj=∑ giAnd Hj=∑ hi, it is then available:
Here, wjIndependently of other, the optimal score of j-th of node and optimal obj are as follows:
Finally, cut tree according to certain rules;
Extreme Gradi-ent Boosting (Xgboost) improves traditional gradient and promotes decision tree (GBDT). Traditional GBDT algorithm is in optimization using only first derivative information of loss function.Xgboost executes two to loss function Rank Taylor expansion, and use the information of single order and second dervative.In addition, xgboost can make automatically with the help of Open MP Use CPU.The multi-core parallel concurrent of CPU calculates, and substantially increases the speed of service.Secondly, different from GBDT algorithm, Xgboost supports dilute Dredge Input matrix.Xgboost defines a new data matrix DMatrix, and training set will be located in advance when training starts Reason, therefore the efficiency of each iteration of training process can be improved, reduce the model training time.
The process of GBDT is as follows:
Objective function is commonly used in measuring the quality of different models.It is always made of two parts: loss function and just Then change item.
Obj (Θ)=L (θ)+Ω (Θ)
L (θ) is loss function.If we only use the quality that loss function carrys out assessment models, model is easy to Overfitting.Therefore, it is considered as regularization parameter.It represents the complexity of model.Therefore, final mask should be in loss letter Balance is obtained between several and regularization term.
If having trained T tree, model can be constructed in the following way:
The basic classification device of Xgboost and GBDT is all CART, therefore objective function can be as follows
Target is the parameter f for obtaining each treeiWe have trained t tree according to (t-1) tree before.
Therefore, t-th of objective function is
Then, loss function is subjected to the second Taylor series
Then, it would be desirable to calculate regularization term.Firstly, we are by decision tree is defined as:
ft(x)=wq(x),w∈RM,q:Rd→{1,2,…,M}
W records the score of each leaf node.Q is a function, in that case it can be decided which section is each input sample finally fall in Point on.In Xgboost, regularization parameter is defined as follows by we:
λ and γ is the parameter of Controlling model complexity.So the objective function of t-th of tree is as follows:
We can define Gj=∑ giAnd Hj=∑ hi, then we are available:
Here, wjIndependently of other, the optimal score of our available j-th of node and optimal obj.
Finally, we should cut tree according to certain rules.
It will be seen that branch had better not be added if the gain after division is less than γ.
Embodiment three: as indicated at 3, the drug target recognition methods based on Xgboost involved in the present embodiment, the base In the experimental verification process of the drug target recognition methods of Xgboost it is that we obtain 2596 real pharmaceutical targets, and And we produce 2596 pseudo- pharmaceutical targets.In order to verify validity of the Xgboost in terms of identifying drug targets, Wo Menjin Ten cross validations are gone.
This 5192 sequences are randomly divided into 10 groups by us.For each group, we select 519 sequences as test Collection, remaining 4673 sequence is as training set.So we have carried out ten experiments in total.In addition, each sequence becomes instruction Practice collection and test set.It sets the parameter of Xgboost to described in table 2.
The parameter setting of table 2.Xgboost
We assess performance of the Xgboost in terms of identifying drug targets using four kinds of appraisal procedures.We are by ten The result of experiment is placed in table 3.Test 5190 sequences in total.
The result of table 3. 10 times experiments
Then Accuracy=99.13%, Precision=99.04%, Recall=99.23% can be calculated, Specificity=99.04%;In our current research, false drug targets are 0, drug targets 1.The accuracy rate of 10 experiments is bent Line is as shown in Figure 2.
The foregoing is only a preferred embodiment of the present invention, these specific embodiments are all based on the present invention Different implementations under general idea, and scope of protection of the present invention is not limited thereto, it is any to be familiar with the art Technical staff in the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of, should all cover of the invention Within protection scope.Therefore, the scope of protection of the invention shall be subject to the scope of protection specified in the patent claim.

Claims (2)

1. the drug target recognition methods based on Xgboost, which is characterized in that the drug target identification based on Xgboost Method specific steps are as follows:
Step 1: constituent analysis: calculate drug targets and non-drug target in 20 kinds of amino acid every kind of amino acid average hundred Divide ratio;
Step 2: dissociation constant: by 20 kinds of amino acid according to its respective hydrophily by Amino acid score at 6 class groupuscules;
Step 3: the area PEST: according to the protein domain PEST potential in Epestfind procedure identification amino acid;
Step 4: according to Step 1: step 2 and step 3 extract 3 kinds of features of drug targets;
Step 5: the identification of drug target is carried out to the feature extracted in step 4 using Xgboost algorithm.
2. the drug target recognition methods according to claim 1 based on Xgboost, which is characterized in that the Xgboost The specific steps of algorithm are as follows:
Objective function includes loss function and regularization term:
Obj (Θ)=L (θ)+Ω (Θ)
Wherein, L (θ) is loss function, and Ω (Θ) is regularization term;
According to the model of following formula building T tree are as follows:
The basic classification device of Xgboost is CART, and objective function can be such that
Target is the parameter f for obtaining each treei, t tree is had trained according to (t-1) tree before
Therefore, t-th of objective function is
Loss function L (θ) is subjected to the second Taylor series
By decision tree is defined as:
ft(x)=wq(x),w∈RM,q:Rd→{1,2,…,M};
W records the score of each leaf node, and q is a function, determines which node is each input sample finally fall on;
In Xgboost, by regularization parameter is defined as:
λ and γ is the parameter of Controlling model complexity;
So the objective function of t-th of tree are as follows:
Define Gj=∑ giAnd Hj=∑ hi, it is then available:
Here, wjIndependently of other, the optimal score of j-th of node and optimal obj are as follows:
Finally, cut tree according to certain rules;
CN201910141417.2A 2019-02-26 2019-02-26 Drug target recognition methods based on Xgboost Pending CN109872781A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910141417.2A CN109872781A (en) 2019-02-26 2019-02-26 Drug target recognition methods based on Xgboost

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910141417.2A CN109872781A (en) 2019-02-26 2019-02-26 Drug target recognition methods based on Xgboost

Publications (1)

Publication Number Publication Date
CN109872781A true CN109872781A (en) 2019-06-11

Family

ID=66919180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910141417.2A Pending CN109872781A (en) 2019-02-26 2019-02-26 Drug target recognition methods based on Xgboost

Country Status (1)

Country Link
CN (1) CN109872781A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110265085A (en) * 2019-07-29 2019-09-20 安徽工业大学 A kind of protein-protein interaction sites recognition methods
CN110791543A (en) * 2019-09-30 2020-02-14 中国海洋大学 Method for identifying action target of natural product medicine
CN111383708A (en) * 2020-03-11 2020-07-07 中南大学 Small molecule target prediction algorithm based on chemical genomics and application thereof
CN112837743A (en) * 2021-02-04 2021-05-25 东北大学 Medicine repositioning method based on machine learning

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110265085A (en) * 2019-07-29 2019-09-20 安徽工业大学 A kind of protein-protein interaction sites recognition methods
CN110791543A (en) * 2019-09-30 2020-02-14 中国海洋大学 Method for identifying action target of natural product medicine
CN111383708A (en) * 2020-03-11 2020-07-07 中南大学 Small molecule target prediction algorithm based on chemical genomics and application thereof
CN111383708B (en) * 2020-03-11 2023-05-12 中南大学 Small molecular target prediction algorithm based on chemical genomics and application thereof
CN112837743A (en) * 2021-02-04 2021-05-25 东北大学 Medicine repositioning method based on machine learning
CN112837743B (en) * 2021-02-04 2024-03-26 东北大学 Drug repositioning method based on machine learning

Similar Documents

Publication Publication Date Title
CN109872781A (en) Drug target recognition methods based on Xgboost
US20210407622A1 (en) Neural network architectures for linking biological sequence variants based on molecular phenotype, and systems and methods therefor
Merkley et al. Applications and challenges of forensic proteomics
Thireou et al. Bidirectional long short-term memory networks for predicting the subcellular localization of eukaryotic proteins
CN112086129B (en) Method and system for predicting cfDNA of tumor tissue
CN105849279A (en) Methods and systems for identifying disease-induced mutations
Webb-Robertson et al. A support vector machine model for the prediction of proteotypic peptides for accurate mass and time proteomics
Gewehr et al. SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles
JP6644672B2 (en) Characterization of biological materials using unassembled sequence information, stochastic methods, and trait-specific database catalogs
Webb-Robertson et al. A support vector machine model for the prediction of proteotypic peptides for accurate mass and time proteomics
Liu et al. MetaDecoder: a novel method for clustering metagenomic contigs
CN112837743B (en) Drug repositioning method based on machine learning
CN110400605A (en) A kind of the ligand bioactivity prediction technique and its application of GPCR drug targets
WO2006129401A1 (en) Screening method for specific protein in proteome comprehensive analysis
Jain et al. Quantitative proteomic analysis of formalin fixed paraffin embedded oral HPV lesions from HIV patients
Palviainen et al. Kidney-derived proteins in urine as biomarkers of induced acute kidney injury in sheep
Murphy et al. Self-supervised learning of cell type specificity from immunohistochemical images
Wani et al. Raw sequence to target gene prediction: An integrated inference pipeline for ChIP-seq and RNA-seq datasets
Washburn The H-Index of ‘an approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database’
CN114822690A (en) Multi-class multifunctional intelligent classification method applied to whole genome expression profile data
Wibowo et al. XGB5hmC: Identifier based on XGB model for RNA 5-hydroxymethylcytosine detection
US7805257B2 (en) Comparison of molecules using field points
Li et al. Fast and accurate classification of meta-genomics long reads with deSAMBA
Bangert et al. Pattern Recognition for Mass-Spectrometry-Based Proteomics
CN112041933A (en) System and method for interpreting transcript expression levels of RNA sequencing data using locally unique features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190611

RJ01 Rejection of invention patent application after publication