CN110265085A

CN110265085A - A kind of protein-protein interaction sites recognition methods

Info

Publication number: CN110265085A
Application number: CN201910686641.XA
Authority: CN
Inventors: 王兵; 张欢; 汪文艳; 周郁明; 王彦; 程竹明
Original assignee: Anhui University of Technology AHUT
Current assignee: Anhui University of Technology AHUT
Priority date: 2019-07-29
Filing date: 2019-07-29
Publication date: 2019-09-20

Abstract

The invention discloses a kind of protein-protein interaction sites recognition methods, belong to bioinformatic analysis field.The method comprise the steps that first acquiring protein chain data and pre-processing to protein chain data, then pretreated protein chain data are divided into interface residue and non-interface residue；Feature is then extracted from database, and it is merged the feature of extraction to obtain data set, the disequilibrium of data set is handled again, then by treated, data set is divided into training set and test set, training set training XGBoost model is recycled, finally obtains protein-protein interaction sites using XGBoost model.Present invention aims to overcome that in the prior art, hold different degrees of " false positive ", " false negative " feature when predicting protein-protein interaction sites, so that the deficiency that interpretation of result is relatively difficult, the present invention can overcome the above deficiency, and the accuracy of identification of protein-protein interaction sites can be improved.

Description

A kind of protein-protein interaction sites recognition methods

Technical field

The present invention relates to bioinformatic analysis technical fields, more specifically to a kind of protein interaction position Point recognition methods.

Background technique

In all cells, protein is most important component part, and in most cells function protein it Between interaction be most basic activity.Interaction between protein and protein constitutes cellular biochemical reaction network An important component, protein-protein interaction network is the main implementation of biological information regulation, be determine it is thin The key factor of born of the same parents' destiny.Study protein-protein interaction be the basis for understanding vital movement, be rear era gene most One of important field of research.With the Human Genome Project implementation so that the data in protein sequence database significantly Rise, the structure and interaction between protein are also impossible to be measured one by one by the method tested, and pass through experiment screening As a result usual imbalanced training sets, the imbalanced training sets problem how being effectively treated in protein data just cause researcher Challenge.In modern molecular biology, the research of protein interaction is played a very important role.Therefore, albumen is disclosed Interaction relationship between matter, the network for establishing interaction relationship, it has also become the hot spot in proteomics research.

Through retrieving, invention and created name are as follows: the method and its application to interact between kind quantitative analysis of protein matter are (open Number: CN101982778A；Publication date: on March 2nd, 2011), the program be based on inventor discovery Fluc and Rluc between not There are the characteristics of interaction, construct biotinylation Fluc expression vector and Rluc expression vector, will quantitative analysis egg White X and Y is inserted into above-mentioned carrier respectively, obtains biotinylation Fluc-X fusion protein expression vector and Rluc-Y expressing fusion protein Carrier, two kinds of fusion proteins coexpression, or express respectively carry out again it is external it is common be incubated for after, pass through the coated magnetic bead of streptavidin Purifying biological element Fluc-X, the DLR of the Fluc-X fusion protein and Rluc-Y fusion protein that are then obtained with DLR measurement purifying Activity.Due to the power to interact between the activity of the Fluc and Rluc of purifying and the amount and albumin X, Y of addition fusion protein It is closely related, therefore this method can quantitatively determine the power to interact between protein Y and X.

In addition, there are also invention and created names are as follows: a kind of Protein interaction detection method of low false positive rate is (open Number: CN103290091A；Publication date: on September 11st, 2013), the program is related to a kind of detection system, belongs to molecular biotechnology Field.The program utilizes the bimolecular fluorescence complementary technology based on fluorescin, will usually require two based on single expression plasmid The BiFC detection architecture of carrier is building up in a double expression plasmid carrier system, can be substantially reduced BiFC detection method in egg False positive rate in white matter-protein interaction research, and can realize quantitative analysis.The program can also be used in based on gene text The agnoprotein matter in library-protein interaction screening study and the protein-protein interaction in living animal Detection.But the shortcoming of this application is, can not distinguish to different pest and disease damages.

Above method belongs to experimental method, can predict protein-protein interaction sites, is that there is also same The shortcomings that sample, such as takes time and effort, and holds different degrees of " false positive ", " false negative " feature, so that interpretation of result is relatively difficult Etc..The problem of how protein-protein interaction sites being predicted, being prior art urgent need to resolve.

Summary of the invention

1. to solve the problems, such as

It is an object of the invention to overcome in the prior art, difference is held when predicting protein-protein interaction sites " false positive ", " false negative " feature of degree, so that the deficiency that interpretation of result is relatively difficult, provides a kind of protein phase interaction With site recognition methods, can to avoid holding different degrees of " false positive ", " false negative " feature, and greatly reduce space and Time overhead, while also improving the accuracy of identification of protein-protein interaction sites.

2. technical solution

To solve the above-mentioned problems, the technical solution adopted in the present invention is as follows:

A kind of protein-protein interaction sites recognition methods of the invention first acquires protein chain data and to protein chain Data are pre-processed, then pretreated protein chain data are divided into interface residue and non-interface residue；Then from data The feature of protein chain is extracted in library, and the feature of extraction is merged to obtain data set, then the disequilibrium to data set It is handled, then by treated, data set is divided into training set and test set, training set training XGBoost model is recycled, Finally protein-protein interaction sites are obtained using XGBoost model.

Further, the specific steps are as follows:

1) protein chain data are first acquired and protein chain data are pre-processed；

It 2) will be pretreated by the distance between the opposite accessible surface product of amino acid and two residue a carbon atoms Protein chain data are divided into interface residue and non-interface residue；

3) feature of protein chain is extracted from database by the evolutionary conservatism of amino acid；

4) feature of extraction is merged, and the feature of each residue to be measured is expanded to obtain data set；

5) it carries out disequilibrium to data set to handle, then by treated, data set is divided into training set and test Collection recycles training set training XGBoost model, finally obtains protein-protein interaction sites using XGBoost model.

Further, pretreated process is carried out to protein chain data are as follows: abandon the egg for being less than 50 residues first White matter chain data, then cast out the protein chain data that sequence similarity is more than or equal to 30%, then give up some out-of-date albumen Matter chain data finally obtain the protein chain data of nonredundancy.

Further, pretreated protein chain data are divided into the detailed process of interface residue and non-interface residue If are as follows: the 0.16 of the opposite accessible surface product of an amino acid residue at least its maximum accessible surface product, by the residue It is divided into surface residue；In surface residue, if the nm of spacing d < 1.2 between any two residues alpha carbon atom, the residue It is divided into interface residue；If the nm of d >=1.2, residue are divided into non-interface residue.

Further, successively by the feature of the feature of each residue to be measured and most similar ten surface residues in its space It is combined, the dimension of each feature is expanded.

Further, disequilibrium treatment process is carried out to data set are as follows:

Target sample is first chosen, then checks the classification of nearest three samples in data set by K- neighbour's rule, works as presence Two or more sample class and when selected target sample difference, illustrate that the sample is noise data, then to the sample into Row is deleted；Circulate operation is not until have noise data.

Further, journey is treated to the disequilibrium of data set are as follows: first by the number of interface residue and non-boundary The number of face residue is compared, then chooses the residue classification being larger in number in the two, and following equation is recycled to delete selection IH value is greater than 0.7 data point in residue classification:

IH(<x_i,y_i>)=1-p (y_i|x_i,h)

Wherein p (y_i|x_i, h) and indicate that mapping function will mark y_iAs input feature value x_iSymbol probability

Further, the formula of training set training XGBoost model is utilized are as follows:

Wherein F is all classification tree and regression tree space,Corresponding x_iPrediction result, f_k(x_i) indicate sample x_iInput The prediction score of the leaf node obtained after to kth tree.

The formula of XGBoost model are as follows:

It is the error function of model,；It is regularization term, indicates the complexity of K tree.

Further, obtained protein-protein interaction sites are tested by test set.

Further, the feature of extraction is residue spatial sequence spectrum, residue sequence comentropy, relative entropy, residue sequence Conservative weight and residue evolutionary rate.

3. beneficial effect

Compared with the prior art, the invention has the benefit that

A kind of protein-protein interaction sites recognition methods of the invention, by being pre-processed to sample data, thus Problem that can be irregular to avoid the protein chain quality of data, i.e., more convenient for the progress of follow-up work；And it is based on amino acid Evolutionary conservatism extract feature from database and merged, so as to preferably characterize the phase interaction in relation to protein With；Processing is further carried out by the disequilibrium to data set and is equalized data set, to improve protein phase interaction With the accuracy of identification in site；Protein-protein interaction sites are obtained by XGBoost model again, it can be different degrees of to avoid holding " false positive ", " false negative " feature, and greatly reduce room and time expense, further improve mutual to protein The prediction effect of action site.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of the protein-protein interaction sites recognition methods of embodiment 1；

The forecast assessment that 1 specimen sample of Fig. 2 embodiment and XGBoost are combined shows compared with；

Fig. 3 is the prediction result comparison schematic diagram of method and other methods of the invention to protein-protein interaction sites.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments；Moreover, be not between each embodiment it is relatively independent, according to It needs can be combined with each other, to reach more preferably effect.Therefore, below to the embodiment of the present invention provided in the accompanying drawings Detailed description is not intended to limit the range of claimed invention, but is merely representative of selected embodiment of the invention.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.

To further appreciate that the contents of the present invention, the present invention is described in detail in conjunction with the accompanying drawings and embodiments.

Embodiment 1

A kind of protein-protein interaction sites recognition methods of the invention first acquires protein chain data and to protein chain Data are pre-processed, by pre-processing to protein chain data, so as to avoid the protein chain quality of data irregular Uneven problem, i.e., more convenient for the progress of follow-up work.Pretreated protein chain data are then divided into interface residue With non-interface residue, then the feature of protein chain is extracted from database, and is merged the feature of extraction to obtain data set； It is worth noting that by extracting feature and being merged, to preferably represent the interaction of protein.Later to data The disequilibrium of collection is handled, and then to treated, data set carries out classification prediction, obtains protein-protein interaction sites.

As shown in connection with fig. 1, a kind of protein-protein interaction sites recognition methods of the invention, the specific steps are as follows:

1) first acquire and protein chain data and protein chain data pre-processed, wherein to protein chain data into The pretreated detailed process of row are as follows: abandon the protein chain data for being less than 50 residues first；Cast out sequence similarity again to be greater than Protein chain data equal to 30%；Give up some out-of-date protein chain data, finally obtains the protein chain number of nonredundancy According to.170 protein chain data are acquired in the present embodiment, and the protein chain data of 91 nonredundancies are obtained after pretreatment.

It 2) will be pretreated by the distance between the opposite accessible surface product of amino acid and two residue a carbon atoms Protein chain data are divided into interface residue and non-interface residue；Specifically, if the opposite accessible surface of an amino acid residue The 0.16 of product at least its maximum accessible surface product, is divided into surface residue for the residue；In surface residue, if any two The nm of spacing d < 1.2 between residue alpha carbon atom, the residue are divided into interface residue；If the nm of d >=1.2, residue are divided into Non- interface residue.There are 10430 surface residues in the present embodiment, wherein interface residue accounts for 22.04%, i.e. 2299 interfaces are residual Base；Another part is non-interface residue, there is 8131.

3) feature of protein chain is extracted from database by the evolutionary conservatism of amino acid；The present embodiment is from HSSP number According to extracting feature in library and The ConSurf Sever database, wherein the feature of extraction is residue spatial sequence spectrum, residue Sequence information entropy, relative entropy, residue sequence guard weight and residue evolutionary rate.Five features describe specific as follows:

Residue spatial sequence spectrum: being got, commonly used a kind of feature in research by Multiple Sequence Alignment, is indicated in albumen The frequency of various amino acid is presented in the specified resi-dues of certain in matter primary structure.

Residue sequence comentropy: according to Shanoon information theory, the deformable conservative scoring of sequence is measured；

Relative entropy: being the normalization of residue sequence comentropy:

Residue sequence guards weight: is calculated the conservative of protein sequence position.

Residue evolutionary rate: for describing disabled evolution information, with Rate4Site algorithm to resi-dues each in sequence Conservative scoring carry out operation, calculate the conservative of each amino acid position by calculating the maximal possibility estimation of evolutionary rate Property.

In above formula, Entropy indicates that sequence information entropy, d refer to all amino acid classes sums, f_iRefer to i-th kind of amino The frequency that acid occurs in sequence position.

4) feature of extraction is merged, and the feature of each residue to be measured is expanded to obtain data set；Wherein The detailed process that the feature of each residue to be measured is expanded are as follows: successively by the feature of each residue to be measured and its space most phase The feature of ten close surface residues is combined, and is expanded the dimension of each feature；It is finally obtained in the present embodiment every 264 dimensional feature vectors of a residue；It should be noted that residue to be measured refers to interface residue and non-interface residue.

5) it is handled by disequilibrium of the specimen sample technology to data set, it is worth noting that, for data set Disequilibrium processing specimen sample technology process are as follows:

A. nearest neighbor algorithm (RENN) is repeated

The algorithm is repeated as many times on the basis of nearest neighbor algorithm (Edited Nearest Neighbors), arest neighbors The step of algorithm are as follows: first choose target sample, then check the classification of nearest three samples in data set by K- neighbour's rule, have Body process are as follows: calculate data set in all samples at a distance from target sample, then check in entire data set with target sample This three nearest sample；When there are two or more sample class and selected target sample difference, illustrate the sample For noise data, then the sample is deleted.Repeating Nearest Neighbor Method is that the above-mentioned algorithm of repetition is multiple, until not having sample energy Until enough removings.

Or disequilibrium processing is carried out to data set using following methods:

B. example hardness threshold value (IHT)

This method indicates data point in training set using the concept of IH property by the probability of mistake classification, in two or Edge between more than two classes or the IH value with higher of the data sample with noise characteristic, this is because learning algorithm meeting Force their over-fittings, IH is that P (h | t) is got from Bayes' theorem, and wherein h indicates for input feature vector to be mapped to its correlation The mapping function of label, t indicate training data: first carrying out the number of interface residue and the number of non-interface residue in the present invention Compare, then choose the residue classification being larger in number in the two, recycles following equation to delete IH value in the residue classification chosen big In 0.7 data point:

IH(<x_i,y_i>)=1-p (y_i|x_i,h)

Wherein p (y_i|x_i, h) and indicate that mapping function will mark y_iAs input feature value x_iSymbol probability, p (y_i| x_i, h) and bigger, illustrate that correct label gives x_iA possibility that it is bigger.

It is worth noting that the thinking of this method of IHT is got from Bayes' theorem: from Bayes' theorem:

Wherein h: input feature value is mapped to the function of corresponding label vector, and t is training set, uses Bayes' theorem The concept of example hardness can be obtained by the decomposition of P (h | t):

It is higher by removing IH value in most classes using a kind of lack sampling method using IHT based on this concept Data point is until data set reaches balance.

It is worth noting that most class samples are partial in the unbalanced prediction that will lead to model of data, so that interface The prediction of residue is excessively poor, by carrying out disequilibrium processing to data set, so as to improve the prediction effect to interface residue Fruit.Wherein, most class samples refer to the sample for the residue classification being larger in number in both interface residue and non-interface residue；Example Number such as interface residue is bigger than the number of non-interface residue, then the sample of interface residue is most class samples.Further, lead to Cross ten times of cross validations will treated data set is divided into training set and test set, data set is divided into ten sons in the present embodiment Collection, wherein nine subsets are training set, remaining a subset is test set.Further, protein is obtained using training set Interaction sites；Specifically, first with training set training XGBoost disaggregated model, detailed process are as follows: for given training CollectionThe k classification trained or regression tree set F={ f₁(x),f₂(x),...,f_kIt (x) }, can be each Output sample is assigned to different leaf nodes according to the cut-point of attribute value, and each leaf node corresponds to one in real time Score f_k, as the sample x that given needs are predicted_iWhen, the prediction result for the sample is exactly the sum of the prediction result of each tree, Concrete model is as follows:

Wherein F is all classification tree and regression tree space,Corresponding x_iPrediction result, f_k(x_i) indicate sample x_iIt is defeated The prediction score of the leaf node obtained after entering to kth tree.

The objective function of model may be defined as:

The optimization aim of model mainly includes two parts, first partRefer to the error letter of model Number；Second partIt is the regularization term of model, indicates the complexity of K tree.

As shown in connection with fig. 2, protein-protein interaction sites can be obtained by above-mentioned XGBoost disaggregated model；It is worth saying Bright, XGBoost model classifies to interface residue and non-interface residue, and the accuracy rate of classification represents to obtain protein The predictive ability of interaction sites.Further, obtained protein-protein interaction sites are tested using test set Evaluation.Wherein, the index of test are as follows:

Accuracy:

Sensitivity:

Accuracy:

Specificity:

F value:

MCC value:

Wherein TP is true positives number, correctly to predict the positive sample number come；TN is true negative number, indicates correct Predict the negative sample number come；FP is false positive number, i.e., was that negative sample is predicted to be positive sample originally in prediction result Number；FN is false negative number, i.e., was originally positive sample and the mispredicted number for negative sample.F value be Precision and The weighted harmonic mean of Recall combines Precision and Recall's as a result, illustrating test method when F value is higher Compare effectively；MCC is the measurement standard well for measuring imbalance problem, and essence is the phase relation between true value and predicted value Number, between -1 and 1, -1 indicates that prediction result is worst, and 1 indicates that prediction result is best.

See Table 1 for details to the result that protein-protein interaction sites identify for the present embodiment, can be seen that this reality by data in table 1 The accuracy rate for applying example has been up to 80.7%.

The classification performance of XGBoost model of the table 1 based on two kinds of specimen sample technologies is assessed

As shown in connection with fig. 3, XGB represents method of the invention, and Li and Luo represents SVM support vector machine method, this hair A kind of bright protein-protein interaction sites recognition methods is compared with other methods, and the present invention knows protein-protein interaction sites Other effect is more preferable.Further, a kind of protein-protein interaction sites recognition methods of the invention, by being carried out to sample data Pretreatment, so as to the problem for avoiding the protein chain quality of data irregular, i.e., more convenient for the progress of follow-up work；Needle To the Characteristic Problem of sample data, the present invention is based on the evolutionary conservatisms of amino acid to extract feature and be melted from database It closes, so as to preferably characterize the interaction in relation to protein；Further by the disequilibrium to data set at Reason is equalized data set, to improve the accuracy of identification of protein-protein interaction sites；It is obtained again by XGBoost model Protein-protein interaction sites, can be to avoid holding different degrees of " false positive ", " false negative " feature, and greatly reduces Room and time expense further improves the prediction effect to protein-protein interaction sites.

Above in conjunction with it is specific exemplary when embodiment the present invention is described in detail.It is understood, however, that can be not It is carry out various modifications in the case where being detached from the scope of the present invention that is defined by the following claims and modification.It is detailed description and it is attached Figure should be to be considered only as it is illustrative and not restrictive, if there is any such modifications and variations, then they are all It will fall into the scope of the present invention described herein.In addition, Development Status and meaning that background technique is intended in order to illustrate this technology Justice, it is no intended to the limitation present invention or the application and application field of the invention.

Claims

1. a kind of protein-protein interaction sites recognition methods, which is characterized in that first acquire protein chain data and to protein Chain data are pre-processed, then pretreated protein chain data are divided into interface residue and non-interface residue；Then from number According to the feature of extraction protein chain in library, and the feature of extraction is merged to obtain data set, then the imbalance to data set Property handled, then will treated data set is divided into training set and test set, recycle training set training XGBoost mould Type finally obtains protein-protein interaction sites using XGBoost model.

2. a kind of protein-protein interaction sites recognition methods according to claim 1, which is characterized in that specific steps are such as Under:

2) by the way that the opposite accessible surface of amino acid is long-pending and the distance between two residue a carbon atoms are by pretreated albumen Matter chain data are divided into interface residue and non-interface residue；

5) disequilibrium is carried out to data set to handle, then by treated, data set is divided into training set and test set, then Using training set training XGBoost model, protein-protein interaction sites finally are obtained using XGBoost model.

3. a kind of protein-protein interaction sites recognition methods according to claim 1, which is characterized in that protein chain Data carry out pretreated process are as follows: abandon the protein chain data for being less than 50 residues first, then to cast out sequence similarity big In the protein chain data for being equal to 30%, then gives up some out-of-date protein chain data, finally obtain the albumen of nonredundancy Matter chain data.

4. a kind of protein-protein interaction sites recognition methods according to claim 1, which is characterized in that after pretreatment Protein chain data be divided into the detailed process of interface residue and non-interface residue are as follows: if the opposite of amino acid residue can connect Touching surface area is at least the 0.16 of its maximum accessible surface product, which is divided into surface residue；In surface residue, if Spacing d < 1.2nm between any two residues alpha carbon atom, the residue are divided into interface residue；If d >=1.2nm, residue It is divided into non-interface residue.

5. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that successively will be each The feature of residue to be measured and the feature of most similar ten surface residues in its space are combined, and are carried out to the dimension of each feature Expand.

6. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that data set into Row disequilibrium treatment process are as follows:

Target sample is first chosen, then checks the classification of nearest three samples in data set by K- neighbour's rule, when there are two Or when more than two sample class and selected target sample difference, illustrates that the sample is noise data, then the sample is deleted It removes；Circulate operation is not until have noise data.

7. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that for data set Disequilibrium treatment process are as follows: first the number of the number of interface residue and non-interface residue is compared, then both is chosen In the residue classification that is larger in number, recycle following equation to delete the data point that IH value in the residue classification chosen is greater than 0.7:

IH(<x_i,y_i>)=1-p (y_i|x_i,h)

Wherein p (y_i|x_i, h) and indicate that mapping function will mark y_iAs input feature value x_iSymbol probability.

8. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that utilize training set The formula of training XGBoost model are as follows:

Wherein F is all classification tree and regression tree space,Corresponding x_iPrediction result, f_k(x_i) indicate sample x_iIt is input to The prediction score of the leaf node obtained after k tree；

The formula of XGBoost model are as follows:

9. described in any item a kind of protein-protein interaction sites recognition methods according to claim 1~8, which is characterized in that Obtained protein-protein interaction sites are tested by test set.

10. described in any item a kind of protein-protein interaction sites recognition methods according to claim 1~8, which is characterized in that The feature of extraction is residue spatial sequence spectrum, residue sequence comentropy, relative entropy, residue sequence guard weight and residue evolution is fast Rate.