CN111370068A - Method and device for predicting interaction of protein isomer pairs - Google Patents
Method and device for predicting interaction of protein isomer pairs Download PDFInfo
- Publication number
- CN111370068A CN111370068A CN202010157694.5A CN202010157694A CN111370068A CN 111370068 A CN111370068 A CN 111370068A CN 202010157694 A CN202010157694 A CN 202010157694A CN 111370068 A CN111370068 A CN 111370068A
- Authority
- CN
- China
- Prior art keywords
- protein
- pair
- pairs
- interaction
- isomer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Bioethics (AREA)
- Data Mining & Analysis (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Artificial Intelligence (AREA)
- Analytical Chemistry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a method and a device for predicting the interaction of protein isomer pairs, wherein the method comprises the following steps of firstly, respectively determining n characteristics of each protein isomer pair based on Pearson correlation coefficients of expression data of the protein isomer pair in n tissues; then protein interaction data are obtained, wherein the protein interaction data comprise protein pairs with interaction, and for the protein pairs, protein pairs corresponding to only one protein isomer pair are screened out, so that the label of the corresponding protein isomer pair is 1; generating protein pairs without interaction by using a random sampling method, and enabling labels of all corresponding protein isomer pairs to be 0; taking the protein isomer pair with the determined label as a sample, and training a prediction model based on the sample data; and finally, inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair. The invention can predict the protein interaction relationship more accurately.
Description
Technical Field
The invention belongs to bioinformatics, and relates to a method and a device for predicting the interaction of protein isomer pairs.
Background
Protein-protein interaction (PPI) refers to the process by which two or more protein molecules form a protein complex through non-covalent bonds. PPIs are the basis of cellular vital activities, and in organisms, some proteins may function in the form of monomers, but most proteins function with chaperones or with other proteins. And the interaction between the protein and the protein forms a main component of a cell biochemical reaction network, and the protein-protein interaction network and the transcription regulation network have important significance for regulating and controlling cells and signals thereof.
Due to the existence of variable shearing process, one gene can code a plurality of protein isomers (proteinisofom), one protein can have a plurality of protein isomers, and one corresponding protein pair can correspond to a plurality of protein isomer pairs. For example, protein A has 3 isoforms and protein B has 2 isoforms, and by permutation and combination, there may be actually 6 isoforms interacting between the two proteins. In short, the relationship between two proteins may exist in various interaction relationships at the isoform level. It is because variable cleavage results in a diversity of protein isomers, which makes the study of protein interactions very complex and challenging. Current protein interaction data does not distinguish between isomers.
At present, the research on the interaction of protein isomer is still in a preliminary research stage, and related research is relatively lack and progresses slowly.
There are two major categories of methods currently in the prediction of protein isoform pair interactions:
(1) detection by a biological assay method:
the results predicted by such conventional biological experimental methods are accurate to some extent. The reported experimental method firstly clones out different mRNA splice isomers of the same gene, then translates into different protein isomers, and finally adopts yeast two-hybrid method to determine the interaction between the protein isomers. Such experimental methods require a lot of manpower and experimental resources, while also taking a lot of time, and only allow prediction of the interaction between a small number of protein isomers. To date, the most comprehensive experiments have only obtained interaction data for protein isoforms from over 100 genes.
(2) Prediction by machine learning method:
in the published document IIIDB for the expression of protein isomers, researchers use RNA-sq data corresponding to protein isomers and expression data of the protein isomers as characteristics, combine protein domain interaction data, and use a logistic regression model (LR) to construct a protein isomer pair interaction prediction model to predict the interaction between the protein isomers. Compared with an experimental method, the method can greatly save manpower and experimental resources, and can predict the interaction between all possible protein isomers. The limitation of this method is that the model cannot fully utilize the information of all protein isomers, and the logistic regression model is difficult to capture the nonlinear relationship between the protein isomer expression level and the protein isomer pair interaction, and the prediction accuracy needs to be further improved.
Therefore, it is necessary to propose new schemes for predicting the interaction between protein isomers.
Disclosure of Invention
The invention aims to solve the technical problem that the method and the device for predicting the interaction of the protein isomer pairs are provided aiming at the defects of the prior art, so that the interaction of the protein isomer pairs can be predicted more accurately, and a large amount of manpower and material resources consumed by biochemical experiments are effectively avoided.
The technical scheme provided by the invention is as follows:
in one aspect, a method for predicting a protein isoform pair interaction is provided, comprising the steps of:
a characteristic extraction step:
combining the protein isomers pairwise to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
training set construction and model training:
obtaining protein interaction data which comprises protein pairs with interaction, and screening the protein pairs corresponding to only one protein isomer pair for the protein pairs to enable the label of the corresponding protein isomer pair to be 1;
generating protein pairs without interaction by using a random sampling method, and enabling labels of all corresponding protein isomer pairs to be 0 for the protein pairs; the idea of multi-instance learning is utilized, that is, if a protein pair in one package has an interaction relationship (label is 1), at least one instance having an interaction relationship (label is 1) exists in the package, that is, if at least one protein isomer pair exists in all protein isomer pairs corresponding to the protein pair (label is 1), and if the protein pair only corresponds to one protein isomer pair, the corresponding protein isomer pair necessarily has an interaction relationship (label is 1); if a packet-protein pair has no interaction (label 0), then all the samples in the packet have no interaction (label 0), i.e. the protein pair has no interaction (label 0) with all the corresponding protein isoform pairs.
Taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction step:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair, wherein the prediction result is the possibility of interaction or whether the protein isomer pair has a label of interaction.
Further, in the feature extraction step, for any protein isomer pair, the pearson correlation coefficient calculation formula of the expression data in any tissue is as follows:
wherein, XiAnd YiRespectively representing the expression quantity of two protein isomers in the protein isomer pair in the ith sample of the tissue,andrespectively represent XiAnd YiMean value of (i)m is the number of samples of the tissue.
Further, the feature extraction step specifically includes:
firstly, storing Pearson correlation coefficients of expression data of all protein isomer pairs in n tissues in a matrix M, wherein each row in M corresponds to one protein isomer pair, each column in M corresponds to one tissue, and each element in M represents the Pearson correlation coefficient of the expression data of one protein isomer pair in one tissue;
then, Fisher-Z transformation is carried out on the matrix M, and the formula is as follows:
wherein r represents the element value before transformation in the matrix M, and Z is the value obtained after Fisher-Z transformation of r; the range of matrix elements can be expanded from [ -1,1] to [ - ∞, + ∞ ] by means of a Fisher-Z transformation. The expansion of the range of the characteristic value can be helpful for analyzing the influence of the characteristic value on the prediction result in different organizations;
finally, recording the matrix obtained after transformation as M'; for each protein isoform pair, its n elements in the corresponding row in M' are taken as its n eigenvalues.
Further, the prediction model is a random forest model.
Further, the training set constructing and model training step comprises the following steps:
step 1, forming a set P by protein pairs with interaction, and enabling the label of each protein pair in the set P to be 1; forming a set N by protein pairs without interaction, and enabling the label of each protein pair in the set N to be 0; taking a union of the set P and the set N to obtain a protein interaction data set Q; for any pair of protein isoforms, if the corresponding pair of protein isoforms is present in the set of protein interaction data Q, adding the pair of protein isoforms to the set R;
the initial training set is (F)0,L0) The iteration number t is 1;
step 2, iterative training of a random forest model:
step A: in the process of the t-th iteration, the first step is based on (F)0,L0) Training a random forest model; then, for each protein isomer pair in the set R, respectively inputting the characteristic data of the protein isomer pair into the trained random forest model to obtain a prediction result W, wherein the W comprises the score of each protein isomer pair in the set R, and the higher the score is, the higher the possibility that the corresponding protein isomer pair has an interaction relation is;
and B: if the iteration times are more than or equal to 2, judging whether the prediction result W is converged, if the prediction result W is converged, ending the iteration, and taking the random forest model after the training in the iteration process as the trained prediction model, otherwise, performing the step C; if the iteration times are less than 2, skipping the step; the method for judging whether the prediction result W is converged is the current time, namely the prediction result obtained by the t iteration is slightly changed relative to the prediction result obtained by the last time, namely the t-1 iteration;
and C: screening out data of the next iteration, namely screening out a corresponding protein isomer pair with the highest score for each protein pair with a label of 1 in the set Q, and taking the protein isomer pair as a positive sample to enable the label of the positive sample to be 1; for each protein pair labeled 0 in the set Q, the corresponding protein isoform pair with the highest score was selected as the negative sampleOriginally, let its label be 0; the pair of protein isomers thus selected was used as a new sample, and their characteristic data was used as a new F0Their corresponding labels as new L0And returning to the step A for the next iteration by making t equal to t + 1.
In another aspect, there is provided a device for predicting a pair of protein isoform interactions, comprising the steps of:
a feature extraction module:
for combining two protein isomers to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
a training set construction and model training module:
obtaining protein interaction data which comprises protein pairs with interaction, and screening the protein pairs corresponding to only one protein isomer pair for the protein pairs to enable the label of the corresponding protein isomer pair to be 1;
generating protein pairs without interaction by using a random sampling method, and enabling labels of all corresponding protein isomer pairs to be 0 for the protein pairs;
taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction module:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair.
The working principle of each module of the device is described in the detailed description of the corresponding steps in the protein isomer pair interaction prediction method.
In another aspect, an electronic device is provided, which includes a memory and a processor, wherein the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to implement the protein isoform pair interaction prediction method.
In another aspect, a computer readable storage medium is provided, on which a computer program is stored, which computer program, when executed by a processor, implements the protein isoform pair interaction prediction method.
Has the advantages that:
the invention can accurately predict the interaction of protein isomer pairs based on the existing protein isomer expression amount information and protein interaction information. When the prediction model is trained, the sample set is updated through iteration, the problems that when a traditional machine learning model is used, the number of samples is too small due to the fact that the number of known data samples is too small, large errors are possibly generated are solved, and a better prediction result can be obtained. And because the traditional biological method is used for measuring the interaction of the protein isomer pair, a large amount of manpower and material resources are consumed, and the problems can be well solved under the condition that only a small amount of protein interaction samples can be detected. If the original data exist, the method can predict whether any pair of protein isomers has interaction, so that a larger pair of protein isomers which are possibly interacted can be accurately screened out before the detection of an actual biological experiment, and then the biological experiment is carried out to verify the interaction of the pair of protein isomers, thereby avoiding the consumption of a large amount of manpower and material resources.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
FIG. 2 is a graph of accuracy versus recall of gene expression data downloaded from GTEx, divided into 29 tissues and predicted using the MIL-RF model; wherein fig. 2(a), 2(b) and 2(c) correspond to the prediction results of version1, version2 and version3, respectively.
Detailed Description
The invention will be described in further detail below with reference to the following figures and specific examples:
example 1:
the embodiment provides a method for predicting the interaction of protein isomer pairs, which comprises the following steps:
a characteristic extraction step:
combining the protein isomers pairwise to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
training set construction and model training:
downloading protein interaction data from a public database, wherein the protein interaction data comprise protein pairs with interaction, a set P is formed by the protein pairs, and the label of each protein pair in the set P is 1; generating protein pairs without interaction by using a random sampling method, forming a set N by the protein pairs, and enabling the label of each protein pair in the set N to be 0; taking a union of the set P and the set N to obtain a protein interaction data set Q;
screening protein pairs corresponding to only one protein isomer pair from the protein pairs in the set P, and enabling the labels of the corresponding protein isomer pairs to be 1; for protein pairs without interaction, the labels of all corresponding protein isomer pairs are 0;
taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction step:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair.
Example 2:
this example is based on example 1, and in the feature extraction step, for any pair of protein isomers, the pearson correlation coefficient calculation formula of the expression data in any tissue is as follows:
Example 3:
in this embodiment, on the basis of embodiment 2, the feature extraction step specifically includes:
firstly, storing Pearson correlation coefficients of expression data of all protein isomer pairs in n tissues in a matrix M, wherein each row in M corresponds to one protein isomer pair, each column in M corresponds to one tissue, and each element in M represents the Pearson correlation coefficient of the expression data of one protein isomer pair in one tissue;
then, Fisher-Z transformation is carried out on the matrix M, and the formula is as follows:
wherein r represents the element value before transformation in the matrix M, and Z is the value obtained after Fisher-Z transformation of r;
finally, recording the matrix obtained after transformation as M'; for each protein isoform pair, its n elements in the corresponding row in M' are taken as its n eigenvalues.
Example 4:
in this embodiment, based on embodiment 3, the prediction model is a random forest model.
Example 5:
in this embodiment, on the basis of embodiment 4, the training set constructing and model training step includes the following steps:
step 1, for any protein isomer pair, if the corresponding protein pair exists in a protein interaction data set Q, adding the protein isomer pair into a set R; the initial training set is (F)0,L0) The iteration number t is 1;
step 2, iterative training of a random forest model:
step A: in the process of the t-th iteration, the first step is based on (F)0,L0) Training a random forest model; then, for each protein isomer pair in the set R, respectively inputting the characteristic data of the protein isomer pair into the trained random forest model to obtain a prediction result W, wherein the W comprises the score of each protein isomer pair in the set R, and the higher the score is, the higher the possibility that the corresponding protein isomer pair has an interaction relation is;
and B: if the iteration times are more than or equal to 2, judging whether the prediction result W is converged, if the prediction result W is converged, ending the iteration, and taking the random forest model after the training in the iteration process as the trained prediction model, otherwise, performing the step C;
and C: screening out data of the next iteration, namely screening out a corresponding protein isomer pair with the highest score for each protein pair with a label of 1 in the set Q, and taking the protein isomer pair as a positive sample to enable the label of the positive sample to be 1; for each protein pair with a label of 0 in the set Q, screening out the corresponding protein isomer pair with the highest score as a negative sample, and making the label of the negative sample be 0; the pair of protein isomers thus selected was used as a new sample, and their characteristic data was used as a new F0Their corresponding labels as new L0And returning to the step A for the next iteration by making t equal to t + 1.
Example 6:
the embodiment provides a device for predicting the interaction of protein isomer pairs, which comprises the following steps:
a feature extraction module:
for combining two protein isomers to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
a training set construction and model training module:
for downloading protein interaction data from a public database, comprising pairs of protein pairs having interactions, from which a set P is formed, with the label of each pair in the set P being 1; generating protein pairs without interaction by using a random sampling method, forming a set N by the protein pairs, and enabling the label of each protein pair in the set N to be 0; taking a union of the set P and the set N to obtain a protein interaction data set Q;
screening protein pairs corresponding to only one protein isomer pair from the protein pairs in the set P, and enabling the labels of the corresponding protein isomer pairs to be 1; for protein pairs without interaction, the labels of all corresponding protein isomer pairs are 0;
taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction module:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair.
The working principle of each module in this embodiment refers to the detailed description of the corresponding steps in embodiments 1 to 5.
Example 7:
the present embodiment provides an electronic device, including a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is enabled to implement the method according to any one of embodiments 1 to 5.
Example 8:
the present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any of embodiments 1-5.
And (3) experimental verification:
experiment 1:
a simulation example test was performed to verify the feasibility of example 5 above by comparing the AUC and aucrc values of the experimental results under different variables.
First, generating simulated protein isoform data;
setting two-by-two combination parameters of MD ∈ {0.1,0.2,0.3} and MGR ∈ {0.2,0.3,0.5} respectively, wherein MD represents the difference of average expression quantity between protein isomer pairs with interaction and protein isomer pairs without interaction, MGR represents the proportion of proteins corresponding to a plurality of protein isomers in all proteins, randomly generating simulated protein isomer data, wherein the data comprises information of the protein isomers and the expression quantity of the protein isomers on different tissues, generating 9 groups of equivalent simulated data by three MD values and three MGR values, and setting the characteristic dimension of the simulated data as 50, wherein each dimension represents different tissues.
Then, the protein isomer pair interaction prediction method provided in example 5 was used to train a prediction model;
in the steps of training set construction and model training, the protein isomer pair with the determined label is used as a sample; and all samples are equally divided into 5 groups in a completely random mode, and the characteristic data of each group of samples are recorded as F1,F2,…,F5The labels corresponding to the samples in each group are respectively marked as L1,L2,…,L5. In the course of 5 experiments, 4 groups of data were used as training set in turn (F)0,L0) And taking 1 group of data as a test set to perform 5-fold cross validation on the accuracy of the algorithm.
And inputting the characteristic data of the corresponding test concentrated samples into the prediction model trained in each experimental process to obtain the prediction results of the test concentrated samples, and calculating the accuracy of the prediction model trained in the experimental process based on the prediction results of the test concentrated samples and the real labels of the test concentrated samples. The calculated accuracy rate shows that the algorithm has high accuracy.
And (4) integrating the prediction results in 5 experiments to obtain the prediction labels of all samples. All samples are divided into 9 groups according to values of MD and MGR, and AUC and AUPRC indexes of each group are respectively counted, wherein AUC is defined as the area under the ROC curve. AUPRC is the area under precision-recall curve (PR curve), and the larger the value, the better the prediction (classification) effect. The method provided by the embodiment of the invention is compared and evaluated with the prediction result of the existing method (a method for constructing a protein isomer pair interaction prediction model by adopting a logistic regression model and predicting the interaction between protein isomers, which is abbreviated as LR). As in tables 1-3. As can be seen from tables 1-3, no matter which combination of MD and MGR is adopted, the AUC and AUPRC index values of the method (called MIL-RF) provided by the embodiment of the invention are both larger than those of the existing method, which shows that the method provided by the embodiment of the invention has good prediction effect and high accuracy. And with the richness of the sample information of the embodiment of the invention, namely, when MD and MGR are increased, the accuracy of the method provided by the embodiment of the invention is also increased, and when MD is 0.3 and MGR is 0.5, the AUC and aucrc index values are maximum. This is consistent with the predicted results and also verifies the feasibility of the solution.
TABLE 1 AUC and AUPRC index values for two methods, where 0.1 is taken as MD and 0.2,0.3 and 0.5 is taken as MGR
MD=0.1,MGR=0.2 | MIL-RF | LR |
AUC | 0.823±0.009 | 0.782±0.011 |
AUPRC | 0.230±0.015 | 0.171±0.013 |
MD=0.1,MGR=0.3 | MIL-RF | LR |
AUC | 0.866±0.008 | 0.827±0.010 |
AUPRC | 0.292±0.017 | 0.220±0.016 |
MD=0.1,MGR=0.5 | MIL-RF | LR |
AUC | 0.933±0.004 | 0.904±0.006 |
AUPRC | 0.477±0.019 | 0.375±0.016 |
TABLE 2 AUC and AUPRC index values for two methods, where 0.2 is taken as MD and 0.2,0.3 and 0.5 is taken as MGR
MD=0.1,MGR=0.2 | MIL-RF | LR |
AUC | 0.9088±0.006 | 0.894±0.006 |
AUPRC | 0.442±0.016 | 0.395±0.014 |
MD=0.1,MGR=0.3 | MIL-RF | LR |
AUC | 0.929±0.007 | 0.916±0.007 |
AUPRC | 0.500±0.020 | 0.450±0.016 |
MD=0.1,MGR=0.5 | MIL-RF | LR |
AUC | 0.961±0.003 | 0.952±0.003 |
AUPRC | 0.638±0.012 | 0.583±0.010 |
TABLE 2 AUC and AUPRC index values for two methods, where 0.3 is taken as MD and 0.2,0.3 and 0.5 is taken as MGR
MD=0.1,MGR=0.2 | MIL-RF | LR |
AUC | 0.961±0.001 | 0.856±0.003 |
AUPRC | 0.688±0.013 | 0.660±0.014 |
MD=0.1,MGR=0.3 | MIL-RF | LR |
AUC | 0.970±0.002 | 0.965±0.002 |
AUPRC | 0.730±0.013 | 0.702±0.014 |
MD=0.1,MGR=0.5 | MIL-RF | LR |
AUC | 0.984±0.002 | 0.981±0.002 |
AUPRC | 0.823±0.011 | 0.795±0.012 |
Experiment 2:
this section uses experiments with protein isoform expression data downloaded from GTEx.
First, protein isoform expression data of 29 different tissues (e.g., head, chest, blood) are obtained, and a protein isoform expression data matrix, denoted as matrix E, is constructed for each tissue1,E2,...En(ii) a For any tissue, each row in the corresponding matrix corresponds to one protein isomer, each column corresponds to one sample, and each element represents the expression quantity of one protein isomer in one sample of the tissue;
for each tissue, screening protein isomers with the expression rate of more than 50% for subsequent analysis; for any tissue, the expression rate of any protein isoform was calculated as: and if the number of samples of the tissue is m, and the expression level of the protein isomer in k samples of the tissue is more than 1, the expression rate of the protein isomer in the tissue is k/m.
Then, the protein isomer pair interaction prediction method provided in example 5 was used to train a prediction model;
similarly, in the steps of training set construction and model training, the protein isomer pair with the determined label is taken as a sample; and all samples are equally divided into 5 groups in a completely random mode, and the characteristic data of each group of samples are recorded as F1,F2,…,F5The labels corresponding to the samples in each group are respectively marked as L1,L2,…,L5. In the course of 5 experiments, 4 groups of data were used as training set in turn (F)0,L0) And taking 1 group of data as a test set to perform 5-fold cross validation on the accuracy of the algorithm.
And inputting the characteristic data of the corresponding test concentrated samples into the prediction model trained in each experimental process to obtain the prediction results of the test concentrated samples, and calculating the accuracy of the prediction model trained in the experimental process based on the prediction results of the test concentrated samples and the real labels of the test concentrated samples. The calculated accuracy rate shows that the algorithm has high accuracy.
The accuracy of protein isoform prediction of interaction relationships is aided based on the accuracy of protein pair interaction relationship prediction below. And (5) drawing a corresponding accuracy-recall rate curve according to the prediction result obtained in the step (5), and comparing and analyzing. See fig. 2. Three types of data are selected in fig. 2, wherein version1 selects only protein pairs corresponding to only one protein isoform pair in set Q, version2 selects only protein pairs corresponding to two or more protein isoform pairs in set Q, and version3 selects all protein pairs in set Q.
Randomly sampling protein pairs in the version1 according to a positive-negative sample ratio (ratio of protein pairs labeled with 1 to protein pairs labeled with 0) p in the version3, selecting the protein pairs with the positive-negative sample ratio of p and the largest number from the protein pairs, inputting the characteristic data of the corresponding protein isomer pairs into a trained prediction model for each selected protein pair, obtaining the scores of the corresponding protein isomer pairs, and taking the scores as the prediction scores of the protein pairs; counting the real labels and the prediction scores of all the selected protein pairs, and drawing a corresponding prediction result accuracy-recall rate curve; the curve drawing method comprises the following steps: and performing descending order arrangement on all protein samples according to the predicted scores, selecting sample scores as a threshold value one by one from large to small, predicting the samples larger than the threshold value as positive samples, and otherwise predicting the samples as negative samples, comparing the real labels of the samples, calculating corresponding TP and FP, calculating Precision (Precision) and Recall (Recall) corresponding to the threshold value, traversing all samples, and calculating all points (Precision, Recall) contained in a Precision-Recall curve, thereby drawing the curve. The random sampling and prediction process was repeated 100 times to obtain 100 corresponding precision-recall curves, as shown in FIG. 2 (a).
Randomly sampling protein pairs in the version2 according to the positive and negative sample proportion p in the version3, selecting protein pairs with the positive and negative sample proportion p and the largest quantity from the protein pairs, inputting the characteristic data of each protein isomer pair corresponding to each selected protein pair into a trained prediction model to obtain the score of each protein isomer pair corresponding to each selected protein pair, and taking the highest score as the prediction score of the protein pair; counting the real labels and the prediction scores of all the selected protein pairs, and drawing a corresponding prediction result accuracy-recall rate curve; the curve plotting method is the same as described above. The random sampling and prediction process was repeated 100 times to obtain 100 corresponding precision-recall curves, as shown in fig. 2(b), in which most of the curves are overlapped and the difference is small.
For each protein pair in version3, respectively inputting the feature data of each corresponding protein isomer pair into a trained prediction model to obtain the score of each corresponding protein isomer pair, and taking the highest score as the prediction score of the protein pair; the true labels and the predicted scores of all the protein pairs in version3 were counted, and the corresponding prediction result accuracy-recall curve was drawn, as shown in fig. 2 (c).
As can be seen from the accuracy-recall curve of the prediction results drawn by fig. 2 for the protein pairs in version1, version2, and version3, the protein pair with a high prediction score obtained by the above experimental method has a high probability of being a true positive sample (corresponding to a point on the curve with a high accuracy and a low recall ratio), which indicates that the above experimental method can accurately screen out a large protein pair likely to have an interaction, thereby assisting in verifying that the method provided in embodiment 5 of the present invention can accurately screen out a large protein isomer pair likely to have an interaction; moreover, each protein in the version1 corresponds to a unique protein isomer pair, and a prediction result accuracy-recall ratio curve drawn for the protein pair selected by the version1 is actually an accuracy-recall ratio curve for the prediction result of the corresponding protein isomer pair, which can also directly prove that the method provided by the embodiment 5 of the present invention can accurately screen out the protein isomer pairs which are likely to have interaction, and the prediction effect is good.
Claims (8)
1. A method for predicting a protein isoform pair interaction, comprising the steps of:
a characteristic extraction step:
combining the protein isomers pairwise to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
training set construction and model training:
obtaining protein interaction data which comprises protein pairs with interaction, and screening the protein pairs corresponding to only one protein isomer pair for the protein pairs to enable the label of the corresponding protein isomer pair to be 1;
generating protein pairs without interaction by using a random sampling method, and enabling labels of all corresponding protein isomer pairs to be 0 for the protein pairs;
taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction step:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair.
2. The method for predicting protein isoform pair interaction according to claim 1, wherein said feature extraction step calculates Pearson's correlation coefficient of expression data in any tissue for any protein isoform pair as follows:
3. The method for predicting protein isoform pair interactions according to claim 1, wherein said feature extraction step comprises:
firstly, storing Pearson correlation coefficients of expression data of all protein isomer pairs in n tissues in a matrix M, wherein each row in M corresponds to one protein isomer pair, each column in M corresponds to one tissue, and each element in M represents the Pearson correlation coefficient of the expression data of one protein isomer pair in one tissue;
then, Fisher-Z transformation is carried out on the matrix M, and the formula is as follows:
wherein r represents the element value before transformation in the matrix M, and Z is the value obtained after Fisher-Z transformation of r;
finally, recording the matrix obtained after transformation as M'; for each protein isoform pair, its n elements in the corresponding row in M' are taken as its n eigenvalues.
4. The method of predicting protein isoform pair interactions according to claim 1, wherein said prediction model is a random forest model.
5. The method of predicting protein isoform pair interactions according to claim 4, wherein said training set constructing and model training steps comprise the steps of:
step 1, for protein pairs with interaction, making the label of the protein pairs to be 1; for pairs of proteins that do not have an interaction, let the label be 0; a protein interaction data set Q is formed by a protein pair with a label of 1 and a protein pair with a label of 0; for any pair of protein isoforms, if the corresponding pair of protein isoforms is present in the set of protein interaction data Q, adding the pair of protein isoforms to the set R;
the initial training set is (F)0,L0) The iteration number t is 1;
step 2, iterative training of a random forest model:
step A: in the process of the t-th iteration, the first step is based on (F)0,L0) Training a random forest model; then, for each protein isomer pair in the set R, respectively inputting the characteristic data of the protein isomer pair into the trained random forest model to obtain a prediction result W, wherein the W comprises the score of each protein isomer pair in the set R, and the higher the score is, the higher the possibility that the corresponding protein isomer pair has an interaction relation is;
and B: if the iteration times are more than or equal to 2, judging whether the prediction result W is converged, if the prediction result W is converged, ending the iteration, and taking the random forest model after the training in the iteration process as the trained prediction model, otherwise, performing the step C;
and C: screening out data for the next iteration, i.e. forScreening each protein pair with a label of 1 in the set Q to obtain a corresponding protein isomer pair with the highest score, and enabling the label of the protein isomer pair to be 1; for each protein pair with a label of 0 in the set Q, screening out the corresponding protein isomer pair with the highest score, and making the label of the protein isomer pair be 0; the pair of protein isomers thus selected was used as a new sample, and their characteristic data was used as a new F0Their corresponding labels as new L0And returning to the step A for the next iteration by making t equal to t + 1.
6. A device for predicting a pair of protein isoform interactions, comprising the steps of:
a feature extraction module:
for combining two protein isomers to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
a training set construction and model training module:
obtaining protein interaction data which comprises protein pairs with interaction, and screening the protein pairs corresponding to only one protein isomer pair for the protein pairs to enable the label of the corresponding protein isomer pair to be 1;
generating protein pairs without interaction by using a random sampling method, and enabling labels of all corresponding protein isomer pairs to be 0 for the protein pairs;
taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction module:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair.
7. An electronic device comprising a memory and a processor, the memory having stored therein a computer program, wherein the computer program, when executed by the processor, causes the processor to implement the method of any of claims 1-5.
8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010157694.5A CN111370068B (en) | 2020-03-09 | 2020-03-09 | Protein isomer pair interaction prediction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010157694.5A CN111370068B (en) | 2020-03-09 | 2020-03-09 | Protein isomer pair interaction prediction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111370068A true CN111370068A (en) | 2020-07-03 |
CN111370068B CN111370068B (en) | 2022-11-04 |
Family
ID=71210433
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010157694.5A Active CN111370068B (en) | 2020-03-09 | 2020-03-09 | Protein isomer pair interaction prediction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111370068B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105354441A (en) * | 2015-10-23 | 2016-02-24 | 上海交通大学 | Vegetable protein interaction network construction method |
US20170298418A1 (en) * | 2014-10-21 | 2017-10-19 | uBiome, Inc. | Method and system for microbiome-derived diagnostics and therapeutics for conditions associated with microbiome functional features |
US20180080079A1 (en) * | 2001-11-07 | 2018-03-22 | Bioventures, Llc | Diagnosis, prognosis and identification of potential therapeutic targets of multiple myeloma based on gene expression profiling |
CN108763861A (en) * | 2018-04-16 | 2018-11-06 | 深圳大学 | Prediction technique, device, terminal and the medium of protein-protein interaction |
CN109801674A (en) * | 2019-01-30 | 2019-05-24 | 长沙学院 | A kind of key protein matter recognition methods based on the fusion of isomery bio-networks |
-
2020
- 2020-03-09 CN CN202010157694.5A patent/CN111370068B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180080079A1 (en) * | 2001-11-07 | 2018-03-22 | Bioventures, Llc | Diagnosis, prognosis and identification of potential therapeutic targets of multiple myeloma based on gene expression profiling |
US20170298418A1 (en) * | 2014-10-21 | 2017-10-19 | uBiome, Inc. | Method and system for microbiome-derived diagnostics and therapeutics for conditions associated with microbiome functional features |
CN105354441A (en) * | 2015-10-23 | 2016-02-24 | 上海交通大学 | Vegetable protein interaction network construction method |
CN108763861A (en) * | 2018-04-16 | 2018-11-06 | 深圳大学 | Prediction technique, device, terminal and the medium of protein-protein interaction |
CN109801674A (en) * | 2019-01-30 | 2019-05-24 | 长沙学院 | A kind of key protein matter recognition methods based on the fusion of isomery bio-networks |
Also Published As
Publication number | Publication date |
---|---|
CN111370068B (en) | 2022-11-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112735535B (en) | Prediction model training method, prediction model training device, data prediction method, data prediction device and storage medium | |
CN113299346B (en) | Classification model training and classifying method and device, computer equipment and storage medium | |
WO2023035745A1 (en) | Olfactory receptor screening method and apparatus, model training method and apparatus, and wine product identification method and apparatus | |
CN108537005B (en) | A kind of crucial lncRNA prediction technique based on BPSO-KNN model | |
CN109559781A (en) | A kind of two-way LSTM and CNN model that prediction DNA- protein combines | |
Golugula et al. | Evaluating feature selection strategies for high dimensional, small sample size datasets | |
CN114881343B (en) | Short-term load prediction method and device for power system based on feature selection | |
CN115798730A (en) | Method, apparatus and medium for circular RNA-disease association prediction based on weighted graph attention and heterogeneous graph neural networks | |
CN106951728B (en) | Tumor key gene identification method based on particle swarm optimization and scoring criterion | |
CN112017730B (en) | Cell screening method and device based on expression quantity prediction model | |
CN111048145B (en) | Method, apparatus, device and storage medium for generating protein prediction model | |
CN112163632A (en) | Application of semi-supervised extreme learning machine based on bat algorithm in industrial detection | |
CN111370068B (en) | Protein isomer pair interaction prediction method and device | |
CN110400605A (en) | A kind of the ligand bioactivity prediction technique and its application of GPCR drug targets | |
CN112102882B (en) | Quality control system and method for NGS detection process of tumor sample | |
CN113936804B (en) | System for constructing model for predicting risk of continuous air leakage after lung cancer resection | |
CN114819151A (en) | Biochemical path planning method based on improved agent-assisted shuffled frog leaping algorithm | |
CN108595914A (en) | One grows tobacco mitochondrial RNA (mt RNA) editing sites high-precision forecasting method | |
CN104636636A (en) | Protein remote homology detecting method and device | |
CN111104950A (en) | K value prediction method and device in k-NN algorithm based on neural network | |
CN113035363B (en) | Probability density weighted genetic metabolic disease screening data mixed sampling method | |
Chin et al. | Optimized local protein structure with support vector machine to predict protein secondary structure | |
CN112885409B (en) | Colorectal cancer protein marker selection system based on feature selection | |
CN116994652B (en) | Information prediction method and device based on neural network and electronic equipment | |
US20230116904A1 (en) | Selecting a cell line for an assay |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |