CN109326329B

CN109326329B - Zinc binding protein action site prediction method

Info

Publication number: CN109326329B
Application number: CN201811353819.0A
Authority: CN
Inventors: 李慧
Original assignee: Jinling Institute of Technology
Current assignee: Jinling Institute of Technology
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2020-07-07
Anticipated expiration: 2038-11-14
Also published as: CN109326329A

Abstract

The invention discloses a zinc binding protein action site prediction method, aiming at the characteristics of zinc binding protein action sites, protein source data are preprocessed; carrying out balancing treatment on the nonequilibrium of the zinc binding protein action sites by means of a random down-sampling technology to obtain a plurality of sub-balance data sets; selecting distinguishable protein biochemical characteristics on a plurality of sub-equilibrium data sets respectively, and performing characteristic representation to form characteristic vectors; respectively taking the feature vectors as the input of a base classifier support vector machine, calculating sample weights, then constructing a probability neural network model based on sample weights, and finally integrating the base classification model support vector machine and the probability neural network model based on the sample weights to obtain a prediction model; and identifying the zinc binding protein action site in the target sample by using the obtained prediction model.

Description

Zinc binding protein action site prediction method

Technical Field

The invention relates to a zinc binding protein action site prediction method, which aims at identifying a zinc binding protein action site by utilizing an ensemble learning classification model under a non-equilibrium classification mode, and belongs to the crossing field of proteomics and computer science.

Background

With the completion of human genome project, life sciences have entered the post-genome era, and proteins expressed by genes have become one of the important research subjects in the fields of life sciences and natural sciences. Proteins (proteins) are the basic organic substances that make up cells, are the material basis of life, and play a decisive role in biological life processes. However, this decisive effect is not simply determined by a single protein, and in most cases, it is necessary for a protein to interact with other proteins or ligands in order to perform a specific biological function.

In cells, proteins serve as both players and carriers of vital activities and perform specific critical functions such as DNA synthesis, signal transduction, gene transcriptional activation, vital metabolic processes, viral protection, etc., by interacting with ligands. Secondly, the protein action also has great promotion effect on the treatment of various diseases, in particular to the invasion of some virus proteins, such as Ebola virus (Ebola virus), which can reveal the pathogenesis of some diseases, and can find the target of some medicines and have guiding effect on the development of new medicines.

The metal ions are used as cofactors to be combined with proteins, and play a decisive role in the biological functions of the proteins and even in some life processes. The zinc ion is the second most abundant metal ion in the organism, is second to iron, and has important regulation and control effects on growth and development of the organism, disease control, DNA synthesis and the like. Zinc ion deficiency can lead to diseases such as age-related degenerative diseases, malignancies and Wilson's disease. In addition, zinc also has important effects on aging, apoptosis, immune function and oxidative stress. Zinc ions bind to proteins to perform biological functions such as catalysis, structure stabilization and coordination.

The recognition of the zinc binding protein action site mainly adopts a biochemical experiment method. Although the experimental methods can determine the interaction sites between the protein and the zinc ions, the experimental determination cost is too high, and time and labor are wasted; moreover, because different limiting conditions are required for experiments, different experimental principles are adopted, so that the experimental result has certain false negative and false positive. Therefore, the biological significance of these data found by simple experimental techniques and means is far from meeting the needs of biological development.

With the development of information technology and the appearance of massive biological data, it is a necessary trend of development to automatically identify the zinc binding protein action sites by using some calculation methods such as data mining technology and machine learning related algorithms. The method has the advantages of low cost, high speed and the like, can make up for the defects of the experiment, and further provides direct support and guidance for the interaction of the expensive biological experiment determination.

The prediction of the zinc ion binding protein action site is a two-class problem, the really bound action site is few, the non-bound action site is high in proportion, and the prediction of the zinc ion binding protein action site is a typical non-equilibrium classification problem. The existing prediction method adopts methods such as data mining and the like to establish a classification model, treats two types of samples equally, does not consider the imbalance of data, and causes the prediction precision of the zinc binding protein action site to be very low. Therefore, the research on the non-equilibrium in the prediction of the action site of the zinc binding protein and the improvement of the classification accuracy of a few classes have important research significance.

Disclosure of Invention

The invention aims to provide a zinc binding protein action site prediction method based on ensemble learning in an unbalanced mode aiming at the unbalanced classification problem in the zinc binding protein action site prediction.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a method for predicting the action site of a zinc binding protein, comprising the following steps:

the method comprises the following steps: preprocessing protein source data aiming at the characteristics of the action site of the zinc binding protein;

step two: carrying out balancing treatment on the nonequilibrium of the zinc binding protein action sites by means of a random down-sampling technology to obtain a plurality of sub-balance data sets;

step three: selecting distinguishable protein biochemical characteristics on a plurality of sub-equilibrium data sets respectively, and performing characteristic representation to form characteristic vectors;

step four: respectively taking the feature vectors as the input of a base classifier support vector machine, calculating sample weights, then constructing a probability neural network model based on sample weights, and finally integrating the base classification model support vector machine and the probability neural network model based on the sample weights to obtain a prediction model;

step five: and identifying the zinc binding protein action site in the target sample by adopting the prediction model obtained in the fourth step.

In the first step, the preprocessing removes the following noise data:

(1) removing peptide chain structure with homology higher than 70%;

(2) elimination of repetitive, shorter protein chains and erroneous and unreliable data;

(3) chains satisfying less than 20% sequence redundancy are removed.

In the second step, the balancing treatment is a random downsampling technology, namely random downsampling is carried out on large samples, the number of the large samples is the same as that of small samples, and a plurality of sub-balance data sets are formed; the large class of samples are non-binding protein sites of action and the small class of samples are zinc-binding protein sites of action.

In step three, the distinguishable biochemical characteristics comprise a characteristic position specificity score matrix, conservative scores and relative weights of RW-GRMTP (relative weight of gapless real matches of pseudoranges) to pseudo amounts; carrying out normalization processing on the position specificity scoring matrix, and adopting a histogram and a sliding window to process to obtain a 20-dimensional vector; converting the 20-dimensional conservation score into a value; normalization processing is carried out on RW-GRMTP to obtain a 2-dimensional vector; finally, a 23-dimensional feature vector is formed.

In the fourth step, training SVM support vector machine on several sub-balance data sets, respectively, and calculating prediction error rate e according to equation (1) and equation (2)_jAnd important program weights α for classification models_j；

Wherein the whole volume data set is D, D { (x)₁,y₁),(x₂,y₂),…,(x_n,y_n)}，

X represents the class domain instance space of the classification problem,

i is 1,2, … n, n is the number of samples; w is a_miFor weighting, the initial value is set to 1/n, i.e. w₁＝(w₁₁,w₁₂,...,w_1n) Wherein w is_1i＝1/n；i＝1,2,…,n；m is 1, 2; respectively training the k balanced data sets by using a base classifier SVM to obtain k classification prediction results C_{svm_j}(x)，j＝1,…,k。

Calculating the weight of the current sample and carrying out normalization processing, wherein the sample is correctly classified, and the corresponding sample weight is reduced; if the sample classification is wrong, the corresponding sample weight is increased, and the calculation formula is as follows (3):

and constructing a probability neural network model based on sample weighting, namely weighting the protein characteristic data, taking the weighted sample data as the input of the probability neural network model, and predicting by using the probability neural network, wherein the method is marked as SWPNN (single-point neural network), and the prediction result is SWPNN (x).

Integrating a base classification model support vector machine and a sample weighting-based probabilistic neural network model to obtain a prediction model SSWPNN, wherein the SSWPNN is { SVM, SWPNN, kernelopt, spread, f }, wherein the kernelopt and the spread are parameters of the SVM and the SWPNN classifier respectively, the definition of f is shown as a formula (4), and meanwhile, calculating corresponding weight β according to an error rate_j；

Where δ is the threshold value, C_{svm_j}(x) And SWPNN (x) are the classification results of the classifiers SVM and SWPNN respectively, if the value of the classification result is greater than 0, the classification result is predicted to be a positive sample, and if the value of the classification result is less than 0, the classification result is predicted to be a negative sample. If the value of svm (x) is positive and smaller than the threshold δ and swpnn (x) is predicted as a counter example, the final integrated prediction result is determined as a counter example, otherwise, the svm (x) result is used as the final determination result.

In the fifth step, the whole test data set is respectively predicted by using an integration model SSWPNN to obtain different classification results, then the results are subjected to weighted integration, and finally the action sites of the zinc binding protein in the target sample are identified, as shown in the formula (5):

has the advantages that:

the method provided by the invention provides a novel zinc binding protein action site prediction method based on ensemble learning from the perspective of machine learning, aiming at the problem of recognition of the zinc binding protein action site in an unbalanced mode, so that the prediction of the zinc binding protein action site in the unbalanced classification mode is effectively solved, and a certain prediction accuracy is obtained. The invention can be applied to the prediction and identification of the action sites of other types of metal ion binding proteins after being expanded.

Drawings

The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is an overall block diagram of the method of the present invention.

FIG. 2 is a framework diagram of zinc binding protein action site classifier based on SVM and SWPNN models.

Fig. 3 is a prediction process diagram of the SSWPNN classifier.

Detailed Description

The invention will be better understood from the following examples.

The general flow of the present invention is shown in FIG. 1.

Aiming at the problem of predicting the zinc binding protein action site under an unbalanced data set, the invention uses a down-sampling technology to balance data so that the data tend to be stable. And (3) constructing a probabilistic neural network classifier model based on a support vector machine and sample weighting by utilizing an integration technology, and classifying and identifying the zinc binding protein action sites by using the model.

The specific implementation steps are as follows:

1. equilibration treatment

The zinc-binding protein site of action is called a subclass (negative subclass); the unbound protein sites of action are called the bulk sample (positive sample). Randomizing large samplesAnd (4) under-sampling without putting back, and in order to avoid the loss of useful information of a large class of samples caused by random under-sampling, multiple under-sampling without putting back on the data corpus is adopted. Random sample sampling without returning is carried out on the large class samples, the same number of the large class samples as the small class samples are extracted each time, namely the large class samples are divided into k subsets, and each subset and the small class samples are synthesized into a balanced data set D₁,D₂,…,D_k. The process can be described by algorithm 1:

algorithm 1: data balancing processing algorithm

Inputting: protein sequence sample data D

And (3) outputting: sub-balanced data set D₁,D₂,…,D_k

1 BEGIN；

2 Divide(D)；

3 N＝CountUp(MinoritySample)；

4 For(i＝1；i<＝k；i++)；

5 ExtractedSample_i＝RandomExtract(MajoritySample,N)；

6 D_i＝Merge(MinoritySample,ExtractedSample_i)；

7 MajoritySample＝MajoritySample-ExtractedSample_i；

8 End for；

9 END。

2. Attribute feature representation

Selecting distinguishable biochemical characteristics: position specificity scoring matrix, conservative score and RW-GRMTP (relative weight of gapless programs to pseudo) to perform feature representation and form feature vector set. Carrying out normalization processing on the position specificity scoring matrix, and adopting a histogram and a sliding window to process to obtain a 20-dimensional vector; converting the 20-dimensional conservation score into a value; normalization processing is carried out on RW-GRMTP to obtain a 2-dimensional vector; finally, a 23-dimensional feature vector is formed.

3. Probabilistic neural network model integrating support vector machine and sample weighting

And training by using a base classifier support vector machine, weighting samples according to a classification result, and training a weighted probability neural network model for 'difficult-to-divide samples' which are easily divided by mistake at a boundary.

Let the whole volume data set be D, D { (x)₁,y₁),(x₂,y₂),…,(x_n,y_n)}，

X represents the class domain instance space of the classification problem,

i is 1,2, … n, n is the number of samples. The process is as follows:

step 1, respectively training SVM classifiers on a plurality of sub-balance data sets;

respectively training the k sub-balance data sets by using a base classifier SVM, and obtaining k classification prediction results C by adopting 5-folding cross validation_{svm_j}(x) J is 1, …, k. The predicted error rate is recorded as e_jThe importance degree of the classification model is weighted α_jThe equations (1) and (2) are calculated. In the formula (1), w_miFor weighting, the initial value is set to 1/n, i.e. w₁＝(w₁₁,w₁₂,...,w_1n) Wherein w is_1i＝1/n；i＝1,2,…,n；m＝1,2。

Step 2, calculating the weight of the current sample and carrying out normalization processing;

after the first round of prediction by the SVM, if a certain sample is correctly classified, reducing the weight of the sample in the next round of prediction; conversely, if a sample is classified incorrectly, his weight is increased in the next prediction round. The sample weight function is calculated as in equation (3):

step 3, training a sample weighting-based PNN predictor SWPNN;

and weighting the feature sample data by using the weight calculated in Step 2, training a probability neural network model based on the weighting, recording the proposed method as SWPNN, and obtaining a prediction result of SWPNN (x). The zinc binding site classifier framework based on SVM and SWPNN models is shown in figure 2.

Step 4, integrating a base classification model SVM and a sample weighted SWPNN classifier;

a probability neural network model integrating a base classifier SVM and sample weighting provides a new prediction method SSWPNN, namely SSWPNN { (SVM, SWPNN, kernelopt, spread, f }, wherein the kernelopt and the spread are parameters of the SVM and the SWPNN classifier respectively, and the definition of f is shown as a formula (4)_j(the weight of the base classifier in the final classifier).

Where δ is the threshold value, C_{svm_j}(x) And SWPNN (x) are the classification results of the classifiers SVM and SWPNN respectively, if the value of the classification result is greater than 0, the classification result is predicted to be a positive sample, and if the value of the classification result is less than 0, the classification result is predicted to be a negative sample. If the value of svm (x) is positive and smaller, smaller than the threshold δ, and the swpnn (x) prediction is counterexample, the final integrated prediction result is determined as counterexample, otherwise, the svm (x) result is used as the final determination result.

And Step 5, respectively predicting the whole data set by using an integration model SSWPNN in Step 4 to obtain different classification results, and performing weighted integration on the results by using a formula (5) to finally identify the action sites of the zinc binding protein. The frame model is shown in fig. 3.

The test was performed on the 392 protein chain data set, and compared with the existing four methods (meta-zinc prediction, zinc explorer, zinc finder, zinc pred), the method of the present invention is superior to other methods, whether the overall prediction performance of the four residues (CHED) or the prediction performance of any one of the residues.

The present invention provides a method and a concept for predicting the site of action of a zinc binding protein, and a plurality of methods and ways for implementing the method, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of modifications and amendments can be made without departing from the principle of the present invention, and these modifications and amendments should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. A method for predicting a zinc binding protein site of action, comprising the steps of:

step five: identifying the zinc binding protein action site in the target sample by adopting the prediction model obtained in the step four;

in the first step, the preprocessing removes the following noise data:

(1) removing peptide chain structure with homology higher than 70%;

(3) removing chains satisfying sequence redundancy less than 20%;

in the second step, the balancing treatment is a random downsampling technology, namely random downsampling is carried out on large samples, the number of the large samples is the same as that of small samples, and a plurality of sub-balance data sets are formed; the large sample is a non-binding protein action site, and the small sample is a zinc-binding protein action site;

in step three, the distinguishable biochemical characteristics comprise a characteristic position specificity score matrix, a conservative score and RW-GRMTP; carrying out normalization processing on the position specificity scoring matrix, and adopting a histogram and a sliding window to process to obtain a 20-dimensional vector; converting the 20-dimensional conservation score into a value; normalization processing is carried out on RW-GRMTP to obtain a 2-dimensional vector; finally forming a 23-dimensional feature vector;

Wherein the whole volume data set is D, D { (x)₁,y₁),(x₂,y₂),…,(x_n,y_n)}，x_i∈ X, X stands for class Domain instance space of the classification problem, y_i∈ {1, -1}, i ═ 1,2, … n, n is the number of samples, w_miFor weighting, the initial value is set to 1/n, i.e. w₁＝(w₁₁,w₁₂,...,w_1n) Wherein w is_1i1/n; 1,2, …, n; m is 1, 2; separately on k sub-balanced datasetsTraining by using a base classifier SVM to obtain k classification prediction results C_{svm_j}(x)，j＝1,…,k；

In the fourth step, the weight of the current sample is calculated and normalized, the sample is classified correctly, and the corresponding sample weight is reduced; if the sample classification is wrong, the corresponding sample weight is increased, and the calculation formula is as follows (3):

in the fourth step, a probabilistic neural network model based on sample weighting is constructed to weight protein characteristic data, the weighted sample data is used as the input of the probabilistic neural network model, and the probabilistic neural network is used for prediction, wherein the method is marked as SWPNN, and the prediction result is SWPNN (x);

in the fourth step, a prediction model SSWPNN is obtained by integrating a base classification model support vector machine and a sample weighting-based probabilistic neural network model, wherein the SSWPNN is { SVM, SWPNN, kernelopt, spread, f }, wherein the kernelopt and the spread are parameters of the SVM and the SWPNN classifier respectively, the definition of f is shown in a formula (4), and corresponding weights β are calculated according to error rates at the same time_j；

Where δ is the threshold value, C_{svm_j}(x) And SWPNN (x) are the classification results of the classifiers SVM and SWPNN respectively, if the value of the classification result is greater than 0, the classification result is predicted to be a positive sample, and if the value of the classification result is less than 0, the classification result is predicted to be a negative sample; if the value of SVM (X) is positive and less than the threshold value delta and SWPNN (X) is predicted to be a counterexample, the final integrated prediction result is judged to be the counterexample, and in other cases, the SVM (X) result is taken as the final judgment result;

in the fifth step, the whole data set is respectively predicted by using an integration model SSWPNN to obtain different classification results, then the results are subjected to weighted integration, and finally the action sites of the zinc binding protein in the target sample are identified, as shown in the formula (5):