CN107607723A

CN107607723A - A kind of protein-protein interaction assay method based on accidental projection Ensemble classifier

Info

Publication number: CN107607723A
Application number: CN201710653339.5A
Authority: CN
Inventors: 宋晓宇; 邱泽阳; 孙向阳; 赵阳
Original assignee: Lanzhou Jiaotong University
Current assignee: Lanzhou Jiaotong University
Priority date: 2017-08-02
Filing date: 2017-08-02
Publication date: 2018-01-19

Abstract

The present invention relates to field of biology, specifically a kind of protein-protein interaction assay method based on accidental projection Ensemble classifier.Comprise the following steps：The screening of A protein datas, the expression of B replacement matrixes, C discrete cosine transforms, D establish accidental projection integrated model and E model determinations.The forecast model of protein interaction is obtained according to the characteristic of division of protein interaction by the present invention, using the method for the related protein interaction of forecast model detection disease, unpredictable protein interaction is solved the problems, such as with disease associated, and then prediction protein interaction and disease associated effect.Verify and screen present invention may apply to animal, cell protein, represented by replacing matrix, discrete cosine transform is easy to post analysis establishing accidental projection integrated model, effect display expression is more accurate, and accuracy rate, sensitivity, positive predictive value and Ma Xiusi coefficient correlations measure stability and accuracy are more excellent compared with existing conventional method.

Description

Method for measuring interaction between proteins based on random projection set classification

Technical Field

The invention relates to the field of biology, in particular to a method for measuring the interaction between proteins based on random projection set classification.

Background

Protein-protein interactions (PPIs) are the result of interactions in time and space, are the basis for the realization of protein functions, and are the key to the study of cell life activities. Although the prior art discloses a method for obtaining PPIs data from organisms by using biotechnology, the prior art has the defects of low efficiency, high cost, high false positive rate and the like, obviously does not meet the technical development requirement, and the development of a method and a technology for measuring PPIs with high efficiency and low cost is urgently needed.

Disclosure of Invention

The invention solves the defects of the prior art and provides the method for determining the interaction between the proteins based on the random projection set classification, which has high efficiency, low cost and high positive rate.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for measuring protein-protein interaction based on random projection clustering classification comprises the following steps:

A. protein data screening

Screening protein interaction pairs from a protein database DIP;

B. replacement matrix representation

Using the BLOSUM62 matrix, a protein sequence of length N will produce an N × 20 matrix, the SMR matrix representing the expression: SMR (i, j) ═ B (P (i), j) i ═ 1.. N, j ═ 1.. 20,

b (P (i), j) represents the probability of mutating the amino acid i to the amino acid j, and P (i), j represents the position of the protein sequence consisting of N amino acids;

C. discrete cosine transform

The discrete cosine transform DCT formula is as follows:

wherein,

D. establishing random projection integrated model

Selecting an n X d dimensional original matrix X_n×dObtaining a low-dimensional matrix after the mapping of the original matrixThe random projection can be expressed as:wherein k is less than or equal to d.

Also included is model E assay:

determining the accuracy Acc, the sensitivity Sen, the positive predictive value PE and the Mazis correlation coefficient MCC;

step a screens protein interaction pairs from the protein database DIP, first removing proteins of less than 50 residues, second removing protein pairs with greater than 40% sequence identity among the protein pairs, and leaving the remaining protein pairs for use.

And B, selecting different cells to re-select the remaining protein pairs with the same number of pairs as a negative data set according to the step A.

The invention has the beneficial effects that:

according to the invention, the prediction model of the protein interaction is obtained according to the classification characteristics of the protein interaction, and the method for detecting the protein interaction related to the disease by adopting the prediction model solves the problem that the protein interaction and the disease correlation cannot be predicted, thereby achieving the effect of predicting the protein interaction and the disease correlation. The method can be suitable for screening animal and strain protein pairs, and the random projection integrated model is established through substitution matrix representation and discrete cosine transform, so that later analysis is facilitated, effect display expression is more accurate, and compared with the conventional method, the method has higher accuracy, sensitivity, positive predictive value and Marx correlation coefficient measurement stability and accuracy.

Drawings

FIG. 1 is a flow chart of the measurement of the present invention.

Detailed Description

A method for measuring protein-protein interaction based on random projection clustering is characterized by comprising the following steps:

A. protein data screening

Screening protein interaction pairs from a protein database DIP;

B. replacement matrix representation

C. discrete cosine transform

The discrete cosine transform DCT formula is as follows:

wherein,

D. establishing random projection integrated model

E. Model determination

step a screens protein interaction pairs from the protein database DIP, first removing proteins of less than 50 residues, second removing protein pairs with greater than 40% sequence identity among the protein pairs, and leaving the remaining protein pairs for use. And B, selecting different cells to re-select the remaining protein pairs with the same number of pairs as a negative data set according to the step A.

Selecting specific protein for determination, wherein the specific scheme is as follows:

different proteins may contain amino acid sequences of different lengths, and in order to obtain uniform eigenvectors for different proteins, we use the cosine discrete transform on the transformation matrix, Sig ∈ R^N×MThe input signal matrix, i.e. the above alternative matrix, selects the first 20 x 20 matrices, i.e. the first 400 coefficients of the cosine dct result as a protein feature matrix because the information after dct is concentrated in the upper left corner of the matrix. The characteristic matrix of the protein sequence "MNEDIEAYFERIGYKNSRNKL" obtained using the above method is shown in the following table.

To construct a standard dataset, the 5594 protein pairs selected were used as positive datasets, and the corresponding 5594 protein pairs without interaction were selected from different subcells to construct negative datasets. Thus, the data set used for the experiment consisted of 11188 protein pairs from the 50% positive and 50% negative data set samples.

In order to verify whether the research method is applicable to other types of Protein pairs, two datasets are constructed, the first is Human Protein, data are collected from a Human Protein References Database (HPRD) Database, Human Protein pairs with more than 25% of sequence identity are also removed, 3899 Protein pairs with correlation are obtained through screening, and the 3899 pairs are used as positive datasets. 4262 non-reactive pairs of human proteins from different subcellular groups were selected as negative data sets according to the principle that the proteins from different subcellular groups in an organism cannot interact with each other. Finally, the Human dataset was composed of 8161 protein pairs. The second data set consisted of 2916 Helicobacter pylori (Helicobacter pylori) protein pairs described by Martin et al, including 1458 Helicobacter pylori interacting pairs and 1458 Helicobacter pylori non-interacting pairs.

B. Replacement matrix representation

the substitution matrix method based on BLOSUM62 matrix is illustrated by taking a protein sequence "MNEDIEAYFERIGYKNSRNKL" as an example, and the following table is a BLOSUM62 matrix table,

TABLE 1 BLOSUM62 matrix table

Based on the above table, amino acid M in the amino acid sequence can be replaced with "-1-1-1213-2-3-20-2-1-1512-20-1-1" by BLOSUM matrix. The same approach as "-310-2-206100-100-2-3-3-3-3-2-4" for amino acid N gives EDIEAYFERIGYKNSRNKL substitution vectors and thus substitution matrices for the entire sequence, as shown in the following table, where each row represents one amino acid and 21 amino acids are represented by the substitution matrix as a 21 x 20 matrix.

Table 2 is a protein sequence substitution matrix table

-1	-1	-1	-2	-1	-3	-2	-3	-2	0	-2	-1	-1	5	1	2	-2	0	-1	-1
																				-3	1	0	-2	-2	0	6	1	0	0	-1	0	0	-2	-3	-3	-3	-3	-2	-4
-4	0	0	-1	-1	-2	0	2	5	2	0	0	1	-2	-3	-3	-3	-3	-2	-3
																				-3	0	1	-1	-2	-1	1	6	2	0	-1	-2	-1	-3	-3	-4	-3	-3	-3	-4
-1	-2	-2	-3	-1	-4	-3	-3	-3	-3	-3	-3	-3	1	4	2	1	0	-1	-3
																				-4	0	0	-1	-1	-2	0	2	5	2	0	0	1	-2	-3	-3	-3	-3	-2	-3
0	1	-1	-1	4	0	-1	-2	-1	-1	-2	-1	-1	-1	-1	-1	-2	-2	-2	-3
																				-2	-2	-2	-3	-2	-3	-2	-3	-2	-1	2	-2	-2	-1	-1	-1	-1	3	7	2
-2	-2	-2	-4	-2	-3	-3	-3	-3	-3	-1	-3	-3	0	0	0	-1	6	3	1
																				-4	0	0	-1	-1	-2	0	2	5	2	0	0	1	-2	-3	-3	-3	-3	-2	-3
-3	-1	-1	-2	-1	-2	0	-2	0	1	0	5	2	-1	-3	-2	-3	-3	-2	-3
																				-1	-2	-2	-3	-1	-4	-3	-3	-3	-3	-3	-3	-3	1	4	2	1	0	-1	-3
-3	0	1	-2	0	6	-2	-1	-2	-2	-2	-2	-2	-3	-4	-4	0	-3	-3	-2
																				-2	-2	-2	-3	-2	-3	-2	-3	-2	-1	2	-2	-2	-1	-1	-1	-1	3	7	2
-3	0	0	-1	-1	-2	0	-1	1	1	-1	2	5	-1	-3	-2	-3	-3	-2	-3
																				-3	1	0	-2	-2	0	6	1	0	0	-1	0	0	-2	-3	-3	-3	-3	-2	-4
-1	4	1	-1	1	0	1	0	0	0	-1	-1	0	-1	-2	-2	-2	-2	-2	-3
																				-3	-1	-1	-2	-1	-2	0	-2	0	1	0	5	2	-1	-3	-2	-3	-3	-2	-3
-3	1	0	-2	-2	0	6	1	0	0	-1	0	0	-2	-3	-3	-3	-3	-2	-4
																				-3	0	0	-1	-1	-2	0	-1	1	1	-1	2	5	-1	-3	-2	-3	-3	-2	-3
-1	-2	-2	-3	-1	-4	-3	-4	-3	-2	-3	-2	-2	2	2	4	3	0	-1	-2

C. Discrete cosine transform

The Discrete Cosine Transform (DCT) is a transform defined on a signal, and the transform results in a signal in the frequency domain. DCT has a very important property (energy concentration property): most of the energy of the signal is concentrated in the low frequency part after the discrete cosine transform. DCT has thus gained widespread use in information transformation

The discrete cosine transform DCT is calculated as follows

Wherein,

different proteins may contain amino acid sequences of different lengths, and in order to obtain uniform eigenvectors for different proteins, we use the cosine discrete transform on the transformation matrix, Sig ∈ R^N×MThe input signal matrix, i.e. the above alternative matrix, selects the first 20 x 20 matrices, i.e. the first 400 coefficients of the cosine dct result as a protein feature matrix because the information after dct is concentrated in the upper left corner of the matrix. Protein sequence MNE obtained by using the methodDIEAYFERIGYKNSRNKL "is shown in the following table.

Table 3 is a protein sequence feature matrix

Traversing the feature matrix according to rows to obtain a feature vector of the protein: -20.84-0.33.... -1.132.11.... -6.47-8.70... 3.79.... -0.07.

D. Establishing random projection integrated model

Random Projection (RP) is a very effective dimension reduction technique, which uses a Random Projection matrix to project high-dimensional data into a low-dimensional subspace for the purpose of dimension reduction. The ensemble learning method is to combine a plurality of models to obtain a better effect, the integrated models have stronger generalization capability, the ensemble algorithm is usually better than a single classifier, the random projection ensemble algorithm combines the two, high-dimensional data are projected for a plurality of times to obtain low-dimensional data, the low-dimensional data are trained and classified by using a basic classifier, and in the RP algorithm, the original d-dimensional data are projected to a k (k is less than or equal to d) dimensional subspace under the action of an n x d dimensional random matrix A.

Selecting an n X d dimensional original matrix X_d×nThen, a random matrix A of d multiplied by k is screened, and a low-dimensional matrix after mapping is obtainedThe random projection can be expressed as:

a specific case description random projection integration algorithm is used, a training data set is initialized to be a 50 x 100 matrix and a label vector, test data is a 100 x 100 matrix, the training set is projected, an error estimation mode is set in the projection process to ensure that a projected data block expresses original information with minimum error, in the example, an error estimation method is selected to be Leave-One-0ut (Loo method), a projection matrix is set to be 100 x 10, a 50 x 10 low-dimensional training matrix and a 100 x 10 low-dimensional test matrix can be obtained, 10 projections are carried out to obtain a plurality of low-dimensional matrixes, a basic classifier, such as a K Nearest Neighbor (KNN) classifier, is used for classifying each sample on the basis of the training data and the label vector, 10 projections are carried out on the matrix, and each sample in the training set and the test set can obtain 10 classification results, and setting a label threshold according to the classification result of the training set and the label vector, wherein in a sample, the average value of the labels exceeds the threshold, the sample belongs to the label 2, and otherwise, the sample belongs to the label 1. In this case, the average value of 10 classification results of each sample is calculated according to 50 × 10 classification results of 50 samples of the training set, and it is found that when the average value of the label 2 is mostly greater than 1.6, and the average value of the label 1 is mostly less than 1.6, the obtained threshold value is 1.6, that is, if the average value of 10 classification results of a certain sample in the test set is greater than 1.6, the sample is the label 2, otherwise, the sample is the label 1. Table 4 is a partial 50 x 100 matrix training data set

TABLE 5 training data set labels

Table 6 is a partial 100 by 100 matrix test data set

Table 7 shows the classification results after projection

Table 8 shows the final classification results

E. Model determination

We used the following parameter evaluation proposed method, accuracy Acc, sensitivity Sen, positive predictive value PE and Mazis correlation coefficient MCC determination,

wherein TP, FP, FN and TN respectively represent the number of samples of different types.

As shown in table 9 below:

experiments were performed on the 3 databases using the proposed method, and in order to avoid overfitting and stability, quintupling was used for the experiments, i.e. each database in the experiments was divided into 5 parts, of which 4/5 was the remaining 1/5 as training samples as test samples. 5 sets of experiments were performed per database, with the following results:

TABLE 10 results of yeast database experiments under quintupling cross

TABLE 11 results of human database experiments under quintupling cross

TABLE 12 results of H.pylori database experiments under quintupling cross

Comparison with SVM classifier

To evaluate the proposed method, we compared our method with the mainstream classifier Support Vector Machine (SVM), using the excellent SVM toolbox LIBSVM toolbox, where c and g parameters were optimized using a grid search method, c is 0.5, g is 0.6 and c is 0.5, g is 0.5 on yeast and human protein respectively. In the helicobacter pylori experiment, the RBF kernel function was used with c being set to 0.08 and g to 22. Experimental results the following, RPEC is the method proposed herein: and (4) a random projection integration algorithm. The experimental results can be compared that the method is better than the SVM method.

Table 13 shows the results of the SVM comparisons

Comparison with other methods we collected other prediction method results and compared our proposed method, as shown in tables 14 and 15, taking yeast protein prediction as an example, the accuracy of other methods is generally 86.15% to 94.72%. In table 15, we compare other integration algorithms with the integration algorithm proposed by us, and experimental data show that most of our reference indexes are excellent. The traditional method translation in the figure is as follows: AC (Auto Covariance) Auto-Covariance transformation feature value extraction method, RoF (Rotation Forest Classifier), LDA (Linear Analysis) linear discriminant Analysis method, RF (Random Forest Classifier) Random Forest classification method, LD (Local Protein Sequence Descriptors) Local Protein Sequence description method, ACC (Auto Covariance) Auto-Covariance transformation feature value extraction method, KNN (K New Neighbor Classifier) K classification algorithm, PR-LPQ (physical Property Matrix combining with Local Quantization Descriptor) Physicochemical Property reaction combined Local fragment Quantization description method, MAC (MAC) Auto-Covariance transformation feature value extraction method, MCD (Continuous correlation Matrix describing method-Local Multi-scale learning method, MLD (Local scale learning) Multi-scale learning method, MLD (Random Forest Classifier) Random Forest classification method, LD (Local Protein Sequence Descriptor) Local Protein Sequence description method, and MLD (Local scale learning Multi-parameter description) feature Set, CT (Cosine Transform) Cosine Transform.

TABLE 14 comparison of different prediction methods for yeast proteins

TABLE 15 comparison of different prediction methods for human proteins

Claims

1. A method for measuring protein-protein interaction based on random projection clustering is characterized by comprising the following steps:

A. protein data screening

Screening protein interaction pairs from a protein database DIP;

B. replacement matrix representation

C. discrete cosine transform

The discrete cosine transform DCT formula is as follows:

<mrow> <mi>D</mi> <mi>C</mi> <mi>T</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>k</mi> <mi>i</mi> </msub> <msub> <mi>k</mi> <mi>j</mi> </msub> <munderover> <mi>&Sigma;</mi> <mrow> <mi>m</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>M</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <munderover> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <mi>S</mi> <mi>i</mi> <mi>g</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mi>cos</mi> <mfrac> <mrow> <mi>&pi;</mi> <mrow> <mo>(</mo> <mn>2</mn> <mi>m</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>i</mi> </mrow> <mrow> <mn>2</mn> <mi>M</mi> </mrow> </mfrac> <mo>&CenterDot;</mo> <mi>cos</mi> <mfrac> <mrow> <mi>&pi;</mi> <mrow> <mo>(</mo> <mn>2</mn> <mi>n</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>i</mi> </mrow> <mrow> <mn>2</mn> <mi>N</mi> </mrow> </mfrac> <mo>,</mo> <mn>0</mn> <mo>&le;</mo> <mi>i</mi> <mo>&le;</mo> <mi>M</mi> <mo>,</mo> <mn>0</mn> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>N</mi> </mrow>

wherein,

D. establishing random projection integrated model

2. The method of claim 1, further comprising an E-model test of the interaction between proteins in the classes based on stochastic projection:

<mrow> <mi>M</mi> <mi>C</mi> <mi>C</mi> <mo>=</mo> <mfrac> <mrow> <mi>T</mi> <mi>P</mi> <mo>&times;</mo> <mi>T</mi> <mi>N</mi> <mo>-</mo> <mi>F</mi> <mi>P</mi> <mo>&times;</mo> <mi>F</mi> <mi>N</mi> </mrow> <msqrt> <mrow> <mo>(</mo> <mi>T</mi> <mi>P</mi> <mo>+</mo> <mi>F</mi> <mi>N</mi> <mo>)</mo> <mo>&times;</mo> <mo>(</mo> <mi>T</mi> <mi>N</mi> <mo>+</mo> <mi>F</mi> <mi>P</mi> <mo>)</mo> <mo>&times;</mo> <mo>(</mo> <mi>T</mi> <mi>P</mi> <mo>+</mo> <mi>F</mi> <mi>P</mi> <mo>)</mo> <mo>&times;</mo> <mo>(</mo> <mi>T</mi> <mi>N</mi> <mo>+</mo> <mi>F</mi> <mi>N</mi> <mo>)</mo> </mrow> </msqrt> </mfrac> </mrow>

3. the method of claim 1, wherein step a is performed to select protein interaction pairs from the protein database DIP, wherein less than 50 residues of protein are removed, protein pairs with sequence identity greater than 40% are removed, and the remaining protein pairs are kept for use.

4. The method according to claim 3, wherein the remaining protein pairs are selected as positive data sets, and different cells are selected to re-select the same number of remaining protein pairs as negative data sets according to step A.