CN104992079A

CN104992079A - Sampling learning based protein-ligand binding site prediction method

Info

Publication number: CN104992079A
Application number: CN201510368016.2A
Authority: CN
Inventors: 胡俊; 何雪; 李阳; 於东军; 沈红斌; 杨静宇
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2015-06-29
Filing date: 2015-06-29
Publication date: 2015-10-21
Anticipated expiration: 2035-06-29
Also published as: CN104992079B

Abstract

The invention provides a sampling learning based protein-ligand binding site prediction method. The method comprises the steps of: firstly, utilizing PSI-BLAST and PSIPRED programs to obtain evolutionary information and secondary structure information of protein, and using a slide window technology to extract characteristics of each amino acid residue (sample); secondly, utilizing a random down-sampling technology to perform random down-sampling on non-binding site samples, and using obtained non-binding site sample subsets and binding site sample set to train an SVM for predicting all to-be-predicted samples; thirdly, according to characteristic information of each to-be-predicted sample, utilizing a KNN dynamic sampling learning technology to perform sampling learning on binding site samples and the non-binding site samples respectively, and combining binding site sample subsets and the non-binding site sample subsets after sampling to train a specific SVM for predicting the to-be-predicted samples; and finally, using a threshold based integration technology to integrate the two trained SVMs. The method has the advantages that: firstly, the use of the random down-sampling and KNN dynamic sampling learning technologies can effectively reduce the scale of training sets and accelerate the model training speed; secondly, the use of the KNN dynamic sampling learning technology can train different SVM models for different to-be-predicted samples and effectively infuse the difference among the to-be-predicted samples; and thirdly, the use of the SVM integration technology effectively reduces the information loss caused by sampling learning and improves the model prediction precision.

Description

Based on the protein-ligand bindings bit point prediction method of sampling study

Technical field

The present invention relates to Bioinformatics Prediction protein-ligand binding field, site, in particular to a kind of based on sampling study protein-ligand bindings bit point prediction method, particularly a kind of based on random down-sampling, KNN dynamic sampling learning art, support vector machine ensembles strategy there is high-precision protein-ligand bindings bit point prediction method.

Background technology

In vital movement, big and small part serves indispensable effect, as atriphos (ATP), vitamin etc.; Wherein ATP is a kind of important biomacromolecule, for the film transmission in biosome, contraction of muscle, signal transmission, cell movement, DNA replication dna and transcribe and other vital movements significant.These part great majority are by protein-ligand binding site and protein interaction, perform various biochemical function by the function such as transport, decomposition by protein.In addition, the binding site of protein and some parts is also the antibacterial target spot important with cancer therapy drug.Therefore, the protein-ligand binding site quickly and accurately in positioning protein matter sequence is significant.

But the binding site determining between protein and part by the method for Bioexperiment needs time and the fund of at substantial, and efficiency is lower; And, along with the develop rapidly of sequencing technologies and the continuous propelling of mankind's Structural genomics, in proteomics, have accumulated the protein sequence not carrying out protein-ligand binding site in a large number and demarcate.Therefore the relevant knowledge of applying biological information science, research and development can directly from protein sequence carry out protein-ligand binding site fast and accurately Intelligent Forecasting have active demand, and for discovery and understanding protein structure and physiological function have great significance.

At present, the forecast model for the protein-ligand binding site based on sequence information is also short of very much.By consulting pertinent literature, can find, current specialized designs carries out having based on the computation model of the protein-ligand bindings bit point prediction of sequence information: ATPint, ATPsite, GTPbinder, NsitePred, TargetATP, TargetATPsite, TargetS and TargetSOS etc.Wherein ATPint (J.S.Chauhan, N.K.Mishra, and G.P.Raghava, " Identification of ATP bindingresidues of a protein from its primary sequence, " BMC Bioinformatics, vol.10, pp.434, 2009) with ATPsite (K.Chen, M.J.Mizianty, and L.Kurgan, " ATPsite:sequence-based prediction ofATP-binding residues, " Proteome Sci, vol.9Suppl 1, pp.S4, 2011.) be the forecast model that two protein-ATP based on sequence information comparatively early bind site.GTPbinder (Chauhan, J.S., et al. (2010) Prediction of GTPinteracting residues, dipeptides and tripeptides in a protein from its evolutionary information.BMCBioinformatics, 11,301.) be that specialized designs is used for predicted protein matter-GTP and binds the computation model in site.TargetATP (Dong-Jun Yu, Jun Hu, Zhen-Min Tang, Hong-Bin Shen, Jian Yang, and Jing-Yu Yang.ImprovingProtein-ATP Binding Residues Prediction by Boosting SVMs with Random Under-Sampling.Neurocomputing.2013, 104:180-190.) with TargetATPsite (Dong-Jun Yu, Jun Hu, Yan Huang, Hong-Bin Shen, Yong Qi, Zhen-Min Tang and Jing-Yu Yang:TargetATPsite:A Template-freeMethod for ATP Binding Sites Prediction with Residue Evolution Image Sparse Representation andClassifier Ensemble, Journal of Computational Chemistry.2013, also be 34:974-985.) that specialized designs is used for predicted protein matter-ATP and binds the computation model in site.NsitePred (Chen K, Mizianty M J, Kurgan L.Predictionand analysis of nucleotide-binding residues using sequence and sequence-derived structuraldescriptors.Bioinformatics, 2012, 28 (3): 331-341.) with TargetSOS (Jun Hu, Xue He, Dong-Jun Yu*, Xi-Bei Yang, Jing-Yu Yang, and Hong-Bin Shen.A New Supervised Over-Sampling Algorithm withApplication to Protein-Nucleotide Binding Residues Prediction, PLOS ONE.2014, 9 (9): e107676) be that design is used for predicted protein matter and nucleotide (ATP, ADP, AMP, GTP and GDP) bind the forecast model in site.TargetS (Dong-Jun Yu, Jun Hu, Jing Yang, Hong-Bin Shen, Jinhui Tang, and Jing-Yu Yang.Designing template-free predictor for targeting protein-ligand binding sites with classifier ensembleand spatial clustering, IEEE/ACM Transactions on Computational Biology and Bioinformatics.2013, 10 (4): 994-1008.) be one can predicted protein matter and nucleotide (ATP, ADP, AMP, GTP and GDP), with metallic ion (Ca ²⁺, Mg ²⁺, Mn ²⁺, Fe ³⁺with Zn ²⁺) binding site computation model.

But the kind of part has a lot, the computation model in predicted protein matter recited above-part binding site is not all considered comprehensively.And protein-ligand bindings bit point prediction is traditional uneven problem concerning study, although the impact using random down-sampling technology to overcome a part of unbalanced data in some computation models to bring, but different samples to be predicted is not treated with a certain discrimination, do not excavate the otherness between sample to be predicted.Thus cause the poor problem of the interpretation of protein-ligand bindings bit point prediction model to have to be overcome; And can find that the practical application of precision of prediction distance also has larger gap, in the urgent need to further raising.

Summary of the invention

In order to solve in above-mentioned protein-ligand bindings bit point prediction problem because otherness between not strong, the different sample to be predicted of versatility of the incomplete initiation of ligand species is not caused precision of prediction distance practical application gap comparatively large and the shortcoming that interpretation is poor by taking into full account, the object of the invention is to propose a kind of in conjunction with random down-sampling, the study of KNN dynamic sampling and integrated technology, there is the protein-ligand bindings bit point prediction method based on sampling study that precision of prediction is high, model interpretation is strong.

For reaching above-mentioned purpose, the technical solution adopted in the present invention is as follows:

Based on a protein-ligand bindings bit point prediction method for sampling study, comprise the following steps:

Step 1: feature extraction, is converted to numeric form by each amino acid residue in protein sequence to be predicted and represents.For the protein that one is made up of n amino acid, this protein position-specific scoring matrices (Position Specific Scoring Matrix can be obtained by PSI-BLAST program, PSSM), this matrix size is n × 20 (n capable 20 arranges); First use sigmoid function s (x)=1/ (1+e ^-x) standardization is line by line carried out to this PSSM matrix, the moving window that then use length is winsize obtains the evolution information matrix of each amino acid residue; Evolution information matrix is pulled into the proper vector that length is 20 × winsize: wherein i represents i-th residue in protein sequence; Protein sequence is input to PSIPRED program, secondary structure prediction probability matrix (the Predicted Secondary Structure of protein sequence can be obtained, PSS), size is n × 3 (n capable 3 arranges), use onesize moving window, obtain the secondary structure information matrix of each amino acid residue; Secondary structure information matrix is pulled into the proper vector that length is 3 × winsize: finally, the proper vector serial combination of two kinds of information is finally used for the proper vector predicted.

Step 2: use random down-sampling technology, carries out random down-sampling to the sample in unbundling site; The unbundling site sample set obtained and bindings bit point sample set are formed a training set, to close training SVM at the training set built.In the training set built by this method, the harmony of positive negative sample can be kept.But, computation model also can be caused insensitive to the otherness between difference sample to be predicted.For this reason, KNN dynamic sampling learning art will be utilized in next step to compensate.

Step 3: for each sample to be predicted, first step 1 is used to carry out feature extraction, then KNN dynamic sampling learning art is used to sample to bindings bit point sample and unbundling site sample respectively, finally, a SVM being used for predicting this sample to be predicted is specially trained after the bindings bit point sample set after sampling and unbundling site sample set being merged.Guarantee that the otherness between different samples to be predicted obtains maximum reservation.Such process makes computation model can tackle more ligand classes.

Step 4: adopt the integrated technology based on threshold value to carry out SVM integrated, in above-mentioned steps 2 and step 3 train two each and every one SVM obtained, the integrated technology applied based on threshold value carries out integrated.To the Output rusults integrated, use the method for Threshold segmentation, determine whether each residue belongs to binding site.

From the above technical solution of the present invention shows that, beneficial effect of the present invention is:

1. improve the precision of prediction of model: employ the strategy that random down-sampling combines with KNN dynamic sampling learning art, make computation model have unitarity between different sample to be predicted and otherness information simultaneously, how effective sample distribution information can be excavated further, improve the precision of prediction of the computation model in predicted protein matter-part binding site;

2. the interpretation of lift scheme: the use of KNN dynamic sampling learning art makes computation model can for the special forecast model of different sample trainings to be predicted, while incorporating sample variation to be predicted, also make to predict that the result obtained has more fairness and rationality, improve the interpretation of model.

Accompanying drawing explanation

Fig. 1 is in conjunction with random down-sampling, the study of KNN dynamic sampling and the schematic diagram based on the protein-ligand bindings bit point prediction method of the integrated technology of threshold value.

Embodiment

In order to more understand technology contents of the present invention, below in conjunction with accompanying drawing, the present invention is further illustrated.

Fig. 1 gives Forecasting Methodology system architecture schematic diagram of the present invention.Shown in composition graphs 1, according to embodiments of the invention, a kind of protein-ligand bindings bit point prediction method based on sampling study, includes following steps:

First, PSI-BLAST and PSIPRED program is used to obtain evolution information matrix (the PositionSpecific Scoring Matrix of training protein respectively, and secondary structure prediction probability matrix (Predicted Secondary Structure, PSS) PSSM); Secondly, use sliding window technique, build the proper vector of each amino acid residue from PSSM matrix and secondary structure prediction probability matrix, then the proper vector serial combination of aforementioned two kinds of information is finally used for the proper vector predicted; Again, use random down-sampling technology, down-sampling is carried out to unbundling site residue, by the unbundling site sample set that obtains and bindings bit point composition of sample training set, this training set trains a SVM; Then, use KNN dynamic sampling learning art, respectively down-sampling is carried out to binding site residue and unbundling residue, the bindings bit obtained some sample set and unbundling site sample set are formed a training set, this training set trains a SVM; Finally, the Integrated Strategy based on threshold value is used to carry out integrated to two SVM obtained above.

Shown in accompanying drawing, more specifically aforementioned process is described.

Step 1: feature extraction

For the protein that one is made up of n amino acid residue, can obtain position-specific scoring matrices PSSM by PSI-BLAST program, size is n × 20 (n capable 20 arranges), and protein sequence information is changed into matrix form, as follows:

Each value in PSSM is normalized:

s (x) = \frac{1}{1 + e^{- x}} - - - (2)

Use the moving window that size is winsize, extract the PSSM eigenmatrix of each amino acid residue:

Then, the eigenmatrix of this amino acid residue is pulled into the proper vector that dimension is 20 × winsize:

x_{p s s m}^{i} = {({pssm}_{i - \frac{w i n s i z e - 1}{2}, 1}^{n o r m a l i z e d}, {pssm}_{i - \frac{w i n s i z e - 1}{2}, 2}^{n o r m a l i z e d}, ..., {pssm}_{i - \frac{w i n s i z e - 1}{2}, 20}^{n o r m a l i z e d})}^{T} - - - (4)

For the protein sequence that is made up of n amino acid residue, can obtain its secondary structure prediction probability matrix (PSS) by PSIPRED program, size is n × 3 (n capable 3 arranges):

Use above-mentioned onesize sliding window technique, the PSS eigenmatrix of each amino acid residue can be obtained:

Then, the PSS eigenmatrix of this amino acid residue is pulled into the proper vector that dimension is 3 × winsize:

x_{p s s}^{i} = {({pss}_{i - \frac{w i n s i z e - 1}{2}, 1}, {pss}_{i - \frac{w i n s i z e - 1}{2}, 2}, ..., {pss}_{i + \frac{w i n s i z e - 1}{2}, 3})}^{T} - - - (7)

Finally, formula (4) and formula (7) serial combination are got up, obtains the proper vector for the sample to be predicted predicted.

Step 2: use random down-sampling technology, carries out down-sampling to the sample in unbundling site, by the unbundling site subset that obtains and the bindings bit point composition of sample training set of sampling, to close training SVM at this training set.

In the training set built by this method, the harmony of positive negative sample can be kept.But, computation model also can be caused insensitive to the otherness between difference sample to be predicted.For this reason, KNN dynamic sampling learning art will be utilized in next step to compensate.

Step 3: use KNN dynamic sampling learning art to carry out down-sampling to bindings bit point sample and unbundling site sample respectively, bindings bit point sample set after sampling and unbundling site sample set are formed one train and gather, then to close training SVM at this training set.

If original amino acid residue training set, wherein represent the proper vector of i-th sample, represent whether i-th sample is binding site (-1 represents unbundling site, and 1 represents it is binding site); for being numbered the amino acid residue to be predicted of j.

In order to make KNN dynamic sampling learning art to sample to bindings bit point sample and unbundling site sample respectively, we first need use formula (8) according to be whether the state in binding site by bindings bit point sample and unbundling site sample from S ^trin separately.

(S_{b i n d i n g}^{t r}, S_{n o n - b i n d i n g}^{t r}) = D i v i d e D a t a s e t (S^{t r}) - - - (8)

Wherein for bindings bit point sample set, for unbundling site sample set.

Then, exist respectively with in set, according to sample information to be predicted use the neighbour of KNN algorithm search sample to be predicted in bindings bit point sample set and the neighbour in the sample set of unbundling site:

{neighbor}_{j}^{b i n d i n g} = K N N S e l e c t i o n (x_{j}^{t s t}, S_{b i n d i n g}^{t r}) - - - (9)

{neighbor}_{j}^{n o n - b i n d i n g} = K N N S e l e c t i o n (x_{j}^{t s t}, S_{n o n - b i n d i n g}^{t r}) - - - (10)

Again by two neighbour's set with be combined formation one to be used for specially predicting training set

n e i g h b orSe t = U n i o n ({neighbor}_{j}^{b i n d i n g}, {neighbor}_{j}^{n o n - b i n d i n g}) - - - (9)

Train a SVM being used for predicting this sample to be predicted specially.

Step 4: use the integrated technology based on threshold value, step 2 is integrated with the SVM in step 3.

If pro_rand and pro_dynamic is step 2 with the SVM in step 3 to same sample to be predicted respectively prediction probability, we use as follows based on the integrated technology of threshold value:

{pro}_{e n s e m b l e} = \underset{p &Element; {p r o_r a n d, p r o_d y n a m i c}}{argmax} | p - c t h r e s | - - - (9)

Wherein cthres is the threshold parameter that can regulate, and its range of adjustment is 0 to 1.

Finally in the method using Threshold segmentation, determine whether each residue belongs to binding site:

f (x_{j}^{t s t}) = \{\begin{matrix} - 1, & i f {pro}_{e n s e m b l e} &GreaterEqual; T \\ 1, & o t h e r w i s e \end{matrix} - - - (9)

Wherein, T is the threshold value of setting, and this threshold value span is 0 ~ 1, the following condition of demand fulfillment: the geneva related coefficient predicted the outcome is maximized.

In sum, the present invention is compared with existing Forecasting Methodology, its remarkable advantage is: this method has the ability solving the unbalanced data study of protein-ligand binding site, there is the ability that the degree of depth excavates otherness between each sample to be predicted, this not only can make to distinguish the difference between different ligands to greatest extent, make forecast model not only interpretation enhancing simultaneously, and improve the precision of prediction of model.

Although the present invention with preferred embodiment disclose as above, so itself and be not used to limit the present invention.Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when being used for a variety of modifications and variations.Therefore, protection scope of the present invention is when being as the criterion depending on those as defined in claim.

Claims

1., based on a protein-ligand bindings bit point prediction method for sampling study, it is characterized in that, comprise the following steps:

Step 1: feature extraction, use evolution information and the secondary structure information of PSI-BLAST and PSIPRED Program extraction protein to be predicted, and on this basis, use sliding window technique, amino acid residue in protein sequence is converted to proper vector form represent, then the proper vector serial combination of two kinds of information is finally used for the proper vector predicted;

Step 2: use random down-sampling technology, carries out random down-sampling to the sample in unbundling site; The unbundling site sample set obtained and bindings bit point sample set are formed a training set, a training SVM on the training set built;

Step 3: for each sample to be predicted, first the mode of step 1 is used to carry out feature extraction, then KNN dynamic sampling learning art is used to sample to bindings bit point sample and unbundling site sample respectively, finally, a SVM being used for predicting this sample to be predicted is specially trained after the bindings bit point sample set after sampling and unbundling site sample set being merged; And

Step 4: use the integrated technology based on threshold value to carry out integrated to obtain in step 2 and step 3 two SVM.

2. the protein-ligand bindings bit point prediction method based on sampling study according to claim 1, it is characterized in that: in above-mentioned step 1, for the protein sequence that is made up of n amino acid, by the position-specific scoring matrices PSSM using PSI-BLAST Program extraction to obtain this protein, the size of this matrix is n × 20; Carry out standardization line by line to described position-specific scoring matrices PSSM again, the moving window that then use length is winsize obtains the Evolution matrix of each amino acid residue, and Evolution matrix is pulled into the proper vector that length is 20 × winsize.

3. the protein-ligand bindings bit point prediction method based on sampling study according to claim 2, it is characterized in that: in above-mentioned step 1, the protein sequence that one is made up of n amino acid is input to PSIPRED program, obtain the secondary structure prediction probability matrix PSS of protein sequence, matrix size is n × 3; Re-use and aforementioned onesize moving window, obtain the secondary structure information matrix of each amino acid residue; Finally secondary structure information matrix is pulled into the proper vector that length is 3 × winsize.

4. the protein-ligand bindings bit point prediction method based on sampling study according to claim 1, it is characterized in that: in above-mentioned steps 3, the KNN dynamic sampling learning art of use is sampled to bindings bit point sample set and unbundling site sample set respectively.

5. the protein-ligand bindings bit point prediction method based on sampling study according to claim 1, it is characterized in that: in above-mentioned steps 4, described integrated SVM, uses the method for Threshold segmentation, determines whether each amino acid residue belongs to binding site.

6. the protein-ligand bindings bit point prediction method based on sampling study according to claim 5, it is characterized in that: when using the method for Threshold segmentation to determine whether each amino acid residue belongs to binding site, this threshold value span selected is 0 ~ 1, and meets the following conditions: the geneva related coefficient predicted the outcome is maximized.