CN107607723A - A kind of protein-protein interaction assay method based on accidental projection Ensemble classifier - Google Patents
A kind of protein-protein interaction assay method based on accidental projection Ensemble classifier Download PDFInfo
- Publication number
- CN107607723A CN107607723A CN201710653339.5A CN201710653339A CN107607723A CN 107607723 A CN107607723 A CN 107607723A CN 201710653339 A CN201710653339 A CN 201710653339A CN 107607723 A CN107607723 A CN 107607723A
- Authority
- CN
- China
- Prior art keywords
- mrow
- protein
- mfrac
- matrix
- protein interaction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000002005 protein protein interaction detection Methods 0.000 title abstract 2
- 238000002762 protein-protein interaction assay Methods 0.000 title abstract 2
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 64
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 64
- 239000011159 matrix material Substances 0.000 claims abstract description 63
- 230000006916 protein interaction Effects 0.000 claims abstract description 19
- 238000012216 screening Methods 0.000 claims abstract description 9
- 230000035945 sensitivity Effects 0.000 claims abstract description 6
- 150000001413 amino acids Chemical class 0.000 claims description 16
- 230000004850 protein–protein interaction Effects 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 4
- 201000010099 disease Diseases 0.000 abstract description 6
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 abstract description 6
- 230000000694 effects Effects 0.000 abstract description 6
- 238000004458 analytical method Methods 0.000 abstract description 4
- 241001465754 Metazoa Species 0.000 abstract description 2
- 238000007796 conventional method Methods 0.000 abstract description 2
- 241001269238 Data Species 0.000 abstract 1
- 238000001514 detection method Methods 0.000 abstract 1
- 238000012549 training Methods 0.000 description 10
- 238000002474 experimental method Methods 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000006467 substitution reaction Methods 0.000 description 6
- 238000012706 support-vector machine Methods 0.000 description 6
- 230000009466 transformation Effects 0.000 description 6
- 241000590002 Helicobacter pylori Species 0.000 description 5
- 229940037467 helicobacter pylori Drugs 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 230000010354 integration Effects 0.000 description 4
- 238000007637 random forest analysis Methods 0.000 description 4
- 108010058643 Fungal Proteins Proteins 0.000 description 3
- 108090000144 Human Proteins Proteins 0.000 description 3
- 102000003839 Human Proteins Human genes 0.000 description 3
- 125000003275 alpha amino acid group Chemical group 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 240000004808 Saccharomyces cerevisiae Species 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003556 assay Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000007636 ensemble learning method Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000004853 protein function Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Landscapes
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The present invention relates to field of biology, specifically a kind of protein-protein interaction assay method based on accidental projection Ensemble classifier.Comprise the following steps:The screening of A protein datas, the expression of B replacement matrixes, C discrete cosine transforms, D establish accidental projection integrated model and E model determinations.The forecast model of protein interaction is obtained according to the characteristic of division of protein interaction by the present invention, using the method for the related protein interaction of forecast model detection disease, unpredictable protein interaction is solved the problems, such as with disease associated, and then prediction protein interaction and disease associated effect.Verify and screen present invention may apply to animal, cell protein, represented by replacing matrix, discrete cosine transform is easy to post analysis establishing accidental projection integrated model, effect display expression is more accurate, and accuracy rate, sensitivity, positive predictive value and Ma Xiusi coefficient correlations measure stability and accuracy are more excellent compared with existing conventional method.
Description
Technical Field
The invention relates to the field of biology, in particular to a method for measuring the interaction between proteins based on random projection set classification.
Background
Protein-protein interactions (PPIs) are the result of interactions in time and space, are the basis for the realization of protein functions, and are the key to the study of cell life activities. Although the prior art discloses a method for obtaining PPIs data from organisms by using biotechnology, the prior art has the defects of low efficiency, high cost, high false positive rate and the like, obviously does not meet the technical development requirement, and the development of a method and a technology for measuring PPIs with high efficiency and low cost is urgently needed.
Disclosure of Invention
The invention solves the defects of the prior art and provides the method for determining the interaction between the proteins based on the random projection set classification, which has high efficiency, low cost and high positive rate.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for measuring protein-protein interaction based on random projection clustering classification comprises the following steps:
A. protein data screening
Screening protein interaction pairs from a protein database DIP;
B. replacement matrix representation
Using the BLOSUM62 matrix, a protein sequence of length N will produce an N × 20 matrix, the SMR matrix representing the expression: SMR (i, j) ═ B (P (i), j) i ═ 1.. N, j ═ 1.. 20,
b (P (i), j) represents the probability of mutating the amino acid i to the amino acid j, and P (i), j represents the position of the protein sequence consisting of N amino acids;
C. discrete cosine transform
The discrete cosine transform DCT formula is as follows:
wherein,
D. establishing random projection integrated model
Selecting an n X d dimensional original matrix Xn×dObtaining a low-dimensional matrix after the mapping of the original matrixThe random projection can be expressed as:wherein k is less than or equal to d.
Also included is model E assay:
determining the accuracy Acc, the sensitivity Sen, the positive predictive value PE and the Mazis correlation coefficient MCC;
step a screens protein interaction pairs from the protein database DIP, first removing proteins of less than 50 residues, second removing protein pairs with greater than 40% sequence identity among the protein pairs, and leaving the remaining protein pairs for use.
And B, selecting different cells to re-select the remaining protein pairs with the same number of pairs as a negative data set according to the step A.
The invention has the beneficial effects that:
according to the invention, the prediction model of the protein interaction is obtained according to the classification characteristics of the protein interaction, and the method for detecting the protein interaction related to the disease by adopting the prediction model solves the problem that the protein interaction and the disease correlation cannot be predicted, thereby achieving the effect of predicting the protein interaction and the disease correlation. The method can be suitable for screening animal and strain protein pairs, and the random projection integrated model is established through substitution matrix representation and discrete cosine transform, so that later analysis is facilitated, effect display expression is more accurate, and compared with the conventional method, the method has higher accuracy, sensitivity, positive predictive value and Marx correlation coefficient measurement stability and accuracy.
Drawings
FIG. 1 is a flow chart of the measurement of the present invention.
Detailed Description
A method for measuring protein-protein interaction based on random projection clustering is characterized by comprising the following steps:
A. protein data screening
Screening protein interaction pairs from a protein database DIP;
B. replacement matrix representation
Using the BLOSUM62 matrix, a protein sequence of length N will produce an N × 20 matrix, the SMR matrix representing the expression: SMR (i, j) ═ B (P (i), j) i ═ 1.. N, j ═ 1.. 20,
b (P (i), j) represents the probability of mutating the amino acid i to the amino acid j, and P (i), j represents the position of the protein sequence consisting of N amino acids;
C. discrete cosine transform
The discrete cosine transform DCT formula is as follows:
wherein,
D. establishing random projection integrated model
Selecting an n X d dimensional original matrix Xn×dObtaining a low-dimensional matrix after the mapping of the original matrixThe random projection can be expressed as:wherein k is less than or equal to d.
E. Model determination
Determining the accuracy Acc, the sensitivity Sen, the positive predictive value PE and the Mazis correlation coefficient MCC;
step a screens protein interaction pairs from the protein database DIP, first removing proteins of less than 50 residues, second removing protein pairs with greater than 40% sequence identity among the protein pairs, and leaving the remaining protein pairs for use. And B, selecting different cells to re-select the remaining protein pairs with the same number of pairs as a negative data set according to the step A.
Selecting specific protein for determination, wherein the specific scheme is as follows:
different proteins may contain amino acid sequences of different lengths, and in order to obtain uniform eigenvectors for different proteins, we use the cosine discrete transform on the transformation matrix, Sig ∈ RN×MThe input signal matrix, i.e. the above alternative matrix, selects the first 20 x 20 matrices, i.e. the first 400 coefficients of the cosine dct result as a protein feature matrix because the information after dct is concentrated in the upper left corner of the matrix. The characteristic matrix of the protein sequence "MNEDIEAYFERIGYKNSRNKL" obtained using the above method is shown in the following table.
To construct a standard dataset, the 5594 protein pairs selected were used as positive datasets, and the corresponding 5594 protein pairs without interaction were selected from different subcells to construct negative datasets. Thus, the data set used for the experiment consisted of 11188 protein pairs from the 50% positive and 50% negative data set samples.
In order to verify whether the research method is applicable to other types of Protein pairs, two datasets are constructed, the first is Human Protein, data are collected from a Human Protein References Database (HPRD) Database, Human Protein pairs with more than 25% of sequence identity are also removed, 3899 Protein pairs with correlation are obtained through screening, and the 3899 pairs are used as positive datasets. 4262 non-reactive pairs of human proteins from different subcellular groups were selected as negative data sets according to the principle that the proteins from different subcellular groups in an organism cannot interact with each other. Finally, the Human dataset was composed of 8161 protein pairs. The second data set consisted of 2916 Helicobacter pylori (Helicobacter pylori) protein pairs described by Martin et al, including 1458 Helicobacter pylori interacting pairs and 1458 Helicobacter pylori non-interacting pairs.
B. Replacement matrix representation
Using the BLOSUM62 matrix, a protein sequence of length N will produce an N × 20 matrix, the SMR matrix representing the expression: SMR (i, j) ═ B (P (i), j) i ═ 1.. N, j ═ 1.. 20,
b (P (i), j) represents the probability of mutating the amino acid i to the amino acid j, and P (i), j represents the position of the protein sequence consisting of N amino acids;
the substitution matrix method based on BLOSUM62 matrix is illustrated by taking a protein sequence "MNEDIEAYFERIGYKNSRNKL" as an example, and the following table is a BLOSUM62 matrix table,
TABLE 1 BLOSUM62 matrix table
Based on the above table, amino acid M in the amino acid sequence can be replaced with "-1-1-1213-2-3-20-2-1-1512-20-1-1" by BLOSUM matrix. The same approach as "-310-2-206100-100-2-3-3-3-3-2-4" for amino acid N gives EDIEAYFERIGYKNSRNKL substitution vectors and thus substitution matrices for the entire sequence, as shown in the following table, where each row represents one amino acid and 21 amino acids are represented by the substitution matrix as a 21 x 20 matrix.
Table 2 is a protein sequence substitution matrix table
-1 | -1 | -1 | -2 | -1 | -3 | -2 | -3 | -2 | 0 | -2 | -1 | -1 | 5 | 1 | 2 | -2 | 0 | -1 | -1 |
-3 | 1 | 0 | -2 | -2 | 0 | 6 | 1 | 0 | 0 | -1 | 0 | 0 | -2 | -3 | -3 | -3 | -3 | -2 | -4 |
-4 | 0 | 0 | -1 | -1 | -2 | 0 | 2 | 5 | 2 | 0 | 0 | 1 | -2 | -3 | -3 | -3 | -3 | -2 | -3 |
-3 | 0 | 1 | -1 | -2 | -1 | 1 | 6 | 2 | 0 | -1 | -2 | -1 | -3 | -3 | -4 | -3 | -3 | -3 | -4 |
-1 | -2 | -2 | -3 | -1 | -4 | -3 | -3 | -3 | -3 | -3 | -3 | -3 | 1 | 4 | 2 | 1 | 0 | -1 | -3 |
-4 | 0 | 0 | -1 | -1 | -2 | 0 | 2 | 5 | 2 | 0 | 0 | 1 | -2 | -3 | -3 | -3 | -3 | -2 | -3 |
0 | 1 | -1 | -1 | 4 | 0 | -1 | -2 | -1 | -1 | -2 | -1 | -1 | -1 | -1 | -1 | -2 | -2 | -2 | -3 |
-2 | -2 | -2 | -3 | -2 | -3 | -2 | -3 | -2 | -1 | 2 | -2 | -2 | -1 | -1 | -1 | -1 | 3 | 7 | 2 |
-2 | -2 | -2 | -4 | -2 | -3 | -3 | -3 | -3 | -3 | -1 | -3 | -3 | 0 | 0 | 0 | -1 | 6 | 3 | 1 |
-4 | 0 | 0 | -1 | -1 | -2 | 0 | 2 | 5 | 2 | 0 | 0 | 1 | -2 | -3 | -3 | -3 | -3 | -2 | -3 |
-3 | -1 | -1 | -2 | -1 | -2 | 0 | -2 | 0 | 1 | 0 | 5 | 2 | -1 | -3 | -2 | -3 | -3 | -2 | -3 |
-1 | -2 | -2 | -3 | -1 | -4 | -3 | -3 | -3 | -3 | -3 | -3 | -3 | 1 | 4 | 2 | 1 | 0 | -1 | -3 |
-3 | 0 | 1 | -2 | 0 | 6 | -2 | -1 | -2 | -2 | -2 | -2 | -2 | -3 | -4 | -4 | 0 | -3 | -3 | -2 |
-2 | -2 | -2 | -3 | -2 | -3 | -2 | -3 | -2 | -1 | 2 | -2 | -2 | -1 | -1 | -1 | -1 | 3 | 7 | 2 |
-3 | 0 | 0 | -1 | -1 | -2 | 0 | -1 | 1 | 1 | -1 | 2 | 5 | -1 | -3 | -2 | -3 | -3 | -2 | -3 |
-3 | 1 | 0 | -2 | -2 | 0 | 6 | 1 | 0 | 0 | -1 | 0 | 0 | -2 | -3 | -3 | -3 | -3 | -2 | -4 |
-1 | 4 | 1 | -1 | 1 | 0 | 1 | 0 | 0 | 0 | -1 | -1 | 0 | -1 | -2 | -2 | -2 | -2 | -2 | -3 |
-3 | -1 | -1 | -2 | -1 | -2 | 0 | -2 | 0 | 1 | 0 | 5 | 2 | -1 | -3 | -2 | -3 | -3 | -2 | -3 |
-3 | 1 | 0 | -2 | -2 | 0 | 6 | 1 | 0 | 0 | -1 | 0 | 0 | -2 | -3 | -3 | -3 | -3 | -2 | -4 |
-3 | 0 | 0 | -1 | -1 | -2 | 0 | -1 | 1 | 1 | -1 | 2 | 5 | -1 | -3 | -2 | -3 | -3 | -2 | -3 |
-1 | -2 | -2 | -3 | -1 | -4 | -3 | -4 | -3 | -2 | -3 | -2 | -2 | 2 | 2 | 4 | 3 | 0 | -1 | -2 |
C. Discrete cosine transform
The Discrete Cosine Transform (DCT) is a transform defined on a signal, and the transform results in a signal in the frequency domain. DCT has a very important property (energy concentration property): most of the energy of the signal is concentrated in the low frequency part after the discrete cosine transform. DCT has thus gained widespread use in information transformation
The discrete cosine transform DCT is calculated as follows
Wherein,
different proteins may contain amino acid sequences of different lengths, and in order to obtain uniform eigenvectors for different proteins, we use the cosine discrete transform on the transformation matrix, Sig ∈ RN×MThe input signal matrix, i.e. the above alternative matrix, selects the first 20 x 20 matrices, i.e. the first 400 coefficients of the cosine dct result as a protein feature matrix because the information after dct is concentrated in the upper left corner of the matrix. Protein sequence MNE obtained by using the methodDIEAYFERIGYKNSRNKL "is shown in the following table.
Table 3 is a protein sequence feature matrix
Traversing the feature matrix according to rows to obtain a feature vector of the protein: -20.84-0.33.... -1.132.11.... -6.47-8.70... 3.79.... -0.07.
D. Establishing random projection integrated model
Random Projection (RP) is a very effective dimension reduction technique, which uses a Random Projection matrix to project high-dimensional data into a low-dimensional subspace for the purpose of dimension reduction. The ensemble learning method is to combine a plurality of models to obtain a better effect, the integrated models have stronger generalization capability, the ensemble algorithm is usually better than a single classifier, the random projection ensemble algorithm combines the two, high-dimensional data are projected for a plurality of times to obtain low-dimensional data, the low-dimensional data are trained and classified by using a basic classifier, and in the RP algorithm, the original d-dimensional data are projected to a k (k is less than or equal to d) dimensional subspace under the action of an n x d dimensional random matrix A.
Selecting an n X d dimensional original matrix Xd×nThen, a random matrix A of d multiplied by k is screened, and a low-dimensional matrix after mapping is obtainedThe random projection can be expressed as:
a specific case description random projection integration algorithm is used, a training data set is initialized to be a 50 x 100 matrix and a label vector, test data is a 100 x 100 matrix, the training set is projected, an error estimation mode is set in the projection process to ensure that a projected data block expresses original information with minimum error, in the example, an error estimation method is selected to be Leave-One-0ut (Loo method), a projection matrix is set to be 100 x 10, a 50 x 10 low-dimensional training matrix and a 100 x 10 low-dimensional test matrix can be obtained, 10 projections are carried out to obtain a plurality of low-dimensional matrixes, a basic classifier, such as a K Nearest Neighbor (KNN) classifier, is used for classifying each sample on the basis of the training data and the label vector, 10 projections are carried out on the matrix, and each sample in the training set and the test set can obtain 10 classification results, and setting a label threshold according to the classification result of the training set and the label vector, wherein in a sample, the average value of the labels exceeds the threshold, the sample belongs to the label 2, and otherwise, the sample belongs to the label 1. In this case, the average value of 10 classification results of each sample is calculated according to 50 × 10 classification results of 50 samples of the training set, and it is found that when the average value of the label 2 is mostly greater than 1.6, and the average value of the label 1 is mostly less than 1.6, the obtained threshold value is 1.6, that is, if the average value of 10 classification results of a certain sample in the test set is greater than 1.6, the sample is the label 2, otherwise, the sample is the label 1. Table 4 is a partial 50 x 100 matrix training data set
TABLE 5 training data set labels
Table 6 is a partial 100 by 100 matrix test data set
Table 7 shows the classification results after projection
Table 8 shows the final classification results
E. Model determination
We used the following parameter evaluation proposed method, accuracy Acc, sensitivity Sen, positive predictive value PE and Mazis correlation coefficient MCC determination,
wherein TP, FP, FN and TN respectively represent the number of samples of different types.
As shown in table 9 below:
experiments were performed on the 3 databases using the proposed method, and in order to avoid overfitting and stability, quintupling was used for the experiments, i.e. each database in the experiments was divided into 5 parts, of which 4/5 was the remaining 1/5 as training samples as test samples. 5 sets of experiments were performed per database, with the following results:
TABLE 10 results of yeast database experiments under quintupling cross
TABLE 11 results of human database experiments under quintupling cross
TABLE 12 results of H.pylori database experiments under quintupling cross
Comparison with SVM classifier
To evaluate the proposed method, we compared our method with the mainstream classifier Support Vector Machine (SVM), using the excellent SVM toolbox LIBSVM toolbox, where c and g parameters were optimized using a grid search method, c is 0.5, g is 0.6 and c is 0.5, g is 0.5 on yeast and human protein respectively. In the helicobacter pylori experiment, the RBF kernel function was used with c being set to 0.08 and g to 22. Experimental results the following, RPEC is the method proposed herein: and (4) a random projection integration algorithm. The experimental results can be compared that the method is better than the SVM method.
Table 13 shows the results of the SVM comparisons
Comparison with other methods we collected other prediction method results and compared our proposed method, as shown in tables 14 and 15, taking yeast protein prediction as an example, the accuracy of other methods is generally 86.15% to 94.72%. In table 15, we compare other integration algorithms with the integration algorithm proposed by us, and experimental data show that most of our reference indexes are excellent. The traditional method translation in the figure is as follows: AC (Auto Covariance) Auto-Covariance transformation feature value extraction method, RoF (Rotation Forest Classifier), LDA (Linear Analysis) linear discriminant Analysis method, RF (Random Forest Classifier) Random Forest classification method, LD (Local Protein Sequence Descriptors) Local Protein Sequence description method, ACC (Auto Covariance) Auto-Covariance transformation feature value extraction method, KNN (K New Neighbor Classifier) K classification algorithm, PR-LPQ (physical Property Matrix combining with Local Quantization Descriptor) Physicochemical Property reaction combined Local fragment Quantization description method, MAC (MAC) Auto-Covariance transformation feature value extraction method, MCD (Continuous correlation Matrix describing method-Local Multi-scale learning method, MLD (Local scale learning) Multi-scale learning method, MLD (Random Forest Classifier) Random Forest classification method, LD (Local Protein Sequence Descriptor) Local Protein Sequence description method, and MLD (Local scale learning Multi-parameter description) feature Set, CT (Cosine Transform) Cosine Transform.
TABLE 14 comparison of different prediction methods for yeast proteins
TABLE 15 comparison of different prediction methods for human proteins
Claims (4)
1. A method for measuring protein-protein interaction based on random projection clustering is characterized by comprising the following steps:
A. protein data screening
Screening protein interaction pairs from a protein database DIP;
B. replacement matrix representation
Using the BLOSUM62 matrix, a protein sequence of length N will produce an N × 20 matrix, the SMR matrix representing the expression: SMR (i, j) ═ B (P (i), j) i ═ 1.. N, j ═ 1.. 20,
b (P (i), j) represents the probability of mutating the amino acid i to the amino acid j, and P (i), j represents the position of the protein sequence consisting of N amino acids;
C. discrete cosine transform
The discrete cosine transform DCT formula is as follows:
<mrow> <mi>D</mi> <mi>C</mi> <mi>T</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>k</mi> <mi>i</mi> </msub> <msub> <mi>k</mi> <mi>j</mi> </msub> <munderover> <mi>&Sigma;</mi> <mrow> <mi>m</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>M</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <munderover> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <mi>S</mi> <mi>i</mi> <mi>g</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mi>cos</mi> <mfrac> <mrow> <mi>&pi;</mi> <mrow> <mo>(</mo> <mn>2</mn> <mi>m</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>i</mi> </mrow> <mrow> <mn>2</mn> <mi>M</mi> </mrow> </mfrac> <mo>&CenterDot;</mo> <mi>cos</mi> <mfrac> <mrow> <mi>&pi;</mi> <mrow> <mo>(</mo> <mn>2</mn> <mi>n</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>i</mi> </mrow> <mrow> <mn>2</mn> <mi>N</mi> </mrow> </mfrac> <mo>,</mo> <mn>0</mn> <mo>&le;</mo> <mi>i</mi> <mo>&le;</mo> <mi>M</mi> <mo>,</mo> <mn>0</mn> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>N</mi> </mrow>
wherein,
D. establishing random projection integrated model
Selecting an n X d dimensional original matrix Xn×dObtaining a low-dimensional matrix after the mapping of the original matrixThe random projection can be expressed as:wherein k is less than or equal to d.
2. The method of claim 1, further comprising an E-model test of the interaction between proteins in the classes based on stochastic projection:
determining the accuracy Acc, the sensitivity Sen, the positive predictive value PE and the Mazis correlation coefficient MCC;
<mrow> <mi>A</mi> <mi>c</mi> <mi>c</mi> <mo>=</mo> <mfrac> <mrow> <mi>T</mi> <mi>P</mi> <mo>+</mo> <mi>T</mi> <mi>N</mi> </mrow> <mrow> <mi>T</mi> <mi>P</mi> <mo>+</mo> <mi>F</mi> <mi>P</mi> <mo>+</mo> <mi>T</mi> <mi>N</mi> <mo>+</mo> <mi>F</mi> <mi>N</mi> </mrow> </mfrac> </mrow>
<mrow> <mi>S</mi> <mi>e</mi> <mi>n</mi> <mo>=</mo> <mfrac> <mrow> <mi>T</mi> <mi>P</mi> </mrow> <mrow> <mi>T</mi> <mi>P</mi> <mo>+</mo> <mi>F</mi> <mi>N</mi> </mrow> </mfrac> </mrow>
<mrow> <mi>P</mi> <mi>E</mi> <mo>=</mo> <mfrac> <mrow> <mi>T</mi> <mi>P</mi> </mrow> <mrow> <mi>T</mi> <mi>P</mi> <mo>+</mo> <mi>F</mi> <mi>P</mi> </mrow> </mfrac> </mrow>
<mrow> <mi>M</mi> <mi>C</mi> <mi>C</mi> <mo>=</mo> <mfrac> <mrow> <mi>T</mi> <mi>P</mi> <mo>&times;</mo> <mi>T</mi> <mi>N</mi> <mo>-</mo> <mi>F</mi> <mi>P</mi> <mo>&times;</mo> <mi>F</mi> <mi>N</mi> </mrow> <msqrt> <mrow> <mo>(</mo> <mi>T</mi> <mi>P</mi> <mo>+</mo> <mi>F</mi> <mi>N</mi> <mo>)</mo> <mo>&times;</mo> <mo>(</mo> <mi>T</mi> <mi>N</mi> <mo>+</mo> <mi>F</mi> <mi>P</mi> <mo>)</mo> <mo>&times;</mo> <mo>(</mo> <mi>T</mi> <mi>P</mi> <mo>+</mo> <mi>F</mi> <mi>P</mi> <mo>)</mo> <mo>&times;</mo> <mo>(</mo> <mi>T</mi> <mi>N</mi> <mo>+</mo> <mi>F</mi> <mi>N</mi> <mo>)</mo> </mrow> </msqrt> </mfrac> </mrow>
3. the method of claim 1, wherein step a is performed to select protein interaction pairs from the protein database DIP, wherein less than 50 residues of protein are removed, protein pairs with sequence identity greater than 40% are removed, and the remaining protein pairs are kept for use.
4. The method according to claim 3, wherein the remaining protein pairs are selected as positive data sets, and different cells are selected to re-select the same number of remaining protein pairs as negative data sets according to step A.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710653339.5A CN107607723A (en) | 2017-08-02 | 2017-08-02 | A kind of protein-protein interaction assay method based on accidental projection Ensemble classifier |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710653339.5A CN107607723A (en) | 2017-08-02 | 2017-08-02 | A kind of protein-protein interaction assay method based on accidental projection Ensemble classifier |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107607723A true CN107607723A (en) | 2018-01-19 |
Family
ID=61064844
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710653339.5A Pending CN107607723A (en) | 2017-08-02 | 2017-08-02 | A kind of protein-protein interaction assay method based on accidental projection Ensemble classifier |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107607723A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108280236A (en) * | 2018-02-28 | 2018-07-13 | 福州大学 | A kind of random forest visualization data analysing method based on LargeVis |
CN111916148A (en) * | 2020-08-13 | 2020-11-10 | 中国计量大学 | Method for predicting protein interaction |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106096636A (en) * | 2016-05-31 | 2016-11-09 | 安徽工业大学 | A kind of Advancement Type mild cognition impairment recognition methods based on neuroimaging |
CN106778065A (en) * | 2016-12-30 | 2017-05-31 | 同济大学 | A kind of Forecasting Methodology based on multivariate data prediction DNA mutation influence interactions between protein |
-
2017
- 2017-08-02 CN CN201710653339.5A patent/CN107607723A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106096636A (en) * | 2016-05-31 | 2016-11-09 | 安徽工业大学 | A kind of Advancement Type mild cognition impairment recognition methods based on neuroimaging |
CN106778065A (en) * | 2016-12-30 | 2017-05-31 | 同济大学 | A kind of Forecasting Methodology based on multivariate data prediction DNA mutation influence interactions between protein |
Non-Patent Citations (2)
Title |
---|
SHIBIAO WAN ET AL.: "Ensemble Random Projection for Multi-label Classification with Application to Protein Subcellular Localization", 《IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS》 * |
YU-AN HUANG ET AL.: "Using Weighted Sparse Representation Model Combined with Discrete Cosine Transformation to Predict Protein-Protein Interactions from Protein Sequence", 《BIOMED RESEARCH INTERNATIONAL》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108280236A (en) * | 2018-02-28 | 2018-07-13 | 福州大学 | A kind of random forest visualization data analysing method based on LargeVis |
CN108280236B (en) * | 2018-02-28 | 2022-03-15 | 福州大学 | Method for analyzing random forest visual data based on LargeVis |
CN111916148A (en) * | 2020-08-13 | 2020-11-10 | 中国计量大学 | Method for predicting protein interaction |
CN111916148B (en) * | 2020-08-13 | 2023-01-31 | 中国计量大学 | Method for predicting protein interaction |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Erisoglu et al. | A new algorithm for initial cluster centers in k-means algorithm | |
EP3657392A1 (en) | Image feature acquisition | |
Galluccio et al. | Graph based k-means clustering | |
Lall et al. | Structure-aware principal component analysis for single-cell RNA-seq data | |
CN103942562B (en) | Hyperspectral image classifying method based on multi-classifier combining | |
CN103064941B (en) | Image search method and device | |
CN108596154A (en) | Classifying Method in Remote Sensing Image based on high dimensional feature selection and multi-level fusion | |
CN105046323B (en) | Regularization-based RBF network multi-label classification method | |
Mukhopadhyay | Large-scale mode identification and data-driven sciences | |
CN108985161B (en) | Low-rank sparse representation image feature learning method based on Laplace regularization | |
Thomas et al. | Enhancing classification of mass spectrometry imaging data with deep neural networks | |
CN113724195B (en) | Quantitative analysis model and establishment method of protein based on immunofluorescence image | |
US20220414108A1 (en) | Classification engineering using regional locality-sensitive hashing (lsh) searches | |
Salman et al. | Gene expression analysis via spatial clustering and evaluation indexing | |
CN107607723A (en) | A kind of protein-protein interaction assay method based on accidental projection Ensemble classifier | |
Wang et al. | Structured sparse multi-view feature selection based on weighted hinge loss | |
CN112085245A (en) | Protein residue contact prediction method based on deep residual error neural network | |
Maji et al. | Multimodal Omics Data Integration Using Max Relevance--Max Significance Criterion | |
CN111048145A (en) | Method, device, equipment and storage medium for generating protein prediction model | |
CN114118292B (en) | Fault classification method based on linear discriminant neighborhood preserving embedding | |
Wong et al. | A probabilistic mechanism based on clustering analysis and distance measure for subset gene selection | |
Zhen et al. | A novel framework for single-cell hi-c clustering based on graph-convolution-based imputation and two-phase-based feature extraction | |
Arcolano et al. | Nyström approximation of Wishart matrices | |
Toussi et al. | Feature selection in spectral clustering | |
Wang et al. | Prediction of Protein‐Protein Interactions from Protein Sequences by Combining MatPCA Feature Extraction Algorithms and Weighted Sparse Representation Models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180119 |
|
RJ01 | Rejection of invention patent application after publication |