CN111370068A - Method and device for predicting interaction of protein isomer pairs - Google Patents

Method and device for predicting interaction of protein isomer pairs Download PDF

Info

Publication number
CN111370068A
CN111370068A CN202010157694.5A CN202010157694A CN111370068A CN 111370068 A CN111370068 A CN 111370068A CN 202010157694 A CN202010157694 A CN 202010157694A CN 111370068 A CN111370068 A CN 111370068A
Authority
CN
China
Prior art keywords
protein
pair
pairs
interaction
isomer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010157694.5A
Other languages
Chinese (zh)
Other versions
CN111370068B (en
Inventor
王建新
文骥威
李洪东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202010157694.5A priority Critical patent/CN111370068B/en
Publication of CN111370068A publication Critical patent/CN111370068A/en
Application granted granted Critical
Publication of CN111370068B publication Critical patent/CN111370068B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Analytical Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method and a device for predicting the interaction of protein isomer pairs, wherein the method comprises the following steps of firstly, respectively determining n characteristics of each protein isomer pair based on Pearson correlation coefficients of expression data of the protein isomer pair in n tissues; then protein interaction data are obtained, wherein the protein interaction data comprise protein pairs with interaction, and for the protein pairs, protein pairs corresponding to only one protein isomer pair are screened out, so that the label of the corresponding protein isomer pair is 1; generating protein pairs without interaction by using a random sampling method, and enabling labels of all corresponding protein isomer pairs to be 0; taking the protein isomer pair with the determined label as a sample, and training a prediction model based on the sample data; and finally, inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair. The invention can predict the protein interaction relationship more accurately.

Description

Method and device for predicting interaction of protein isomer pairs
Technical Field
The invention belongs to bioinformatics, and relates to a method and a device for predicting the interaction of protein isomer pairs.
Background
Protein-protein interaction (PPI) refers to the process by which two or more protein molecules form a protein complex through non-covalent bonds. PPIs are the basis of cellular vital activities, and in organisms, some proteins may function in the form of monomers, but most proteins function with chaperones or with other proteins. And the interaction between the protein and the protein forms a main component of a cell biochemical reaction network, and the protein-protein interaction network and the transcription regulation network have important significance for regulating and controlling cells and signals thereof.
Due to the existence of variable shearing process, one gene can code a plurality of protein isomers (proteinisofom), one protein can have a plurality of protein isomers, and one corresponding protein pair can correspond to a plurality of protein isomer pairs. For example, protein A has 3 isoforms and protein B has 2 isoforms, and by permutation and combination, there may be actually 6 isoforms interacting between the two proteins. In short, the relationship between two proteins may exist in various interaction relationships at the isoform level. It is because variable cleavage results in a diversity of protein isomers, which makes the study of protein interactions very complex and challenging. Current protein interaction data does not distinguish between isomers.
At present, the research on the interaction of protein isomer is still in a preliminary research stage, and related research is relatively lack and progresses slowly.
There are two major categories of methods currently in the prediction of protein isoform pair interactions:
(1) detection by a biological assay method:
the results predicted by such conventional biological experimental methods are accurate to some extent. The reported experimental method firstly clones out different mRNA splice isomers of the same gene, then translates into different protein isomers, and finally adopts yeast two-hybrid method to determine the interaction between the protein isomers. Such experimental methods require a lot of manpower and experimental resources, while also taking a lot of time, and only allow prediction of the interaction between a small number of protein isomers. To date, the most comprehensive experiments have only obtained interaction data for protein isoforms from over 100 genes.
(2) Prediction by machine learning method:
in the published document IIIDB for the expression of protein isomers, researchers use RNA-sq data corresponding to protein isomers and expression data of the protein isomers as characteristics, combine protein domain interaction data, and use a logistic regression model (LR) to construct a protein isomer pair interaction prediction model to predict the interaction between the protein isomers. Compared with an experimental method, the method can greatly save manpower and experimental resources, and can predict the interaction between all possible protein isomers. The limitation of this method is that the model cannot fully utilize the information of all protein isomers, and the logistic regression model is difficult to capture the nonlinear relationship between the protein isomer expression level and the protein isomer pair interaction, and the prediction accuracy needs to be further improved.
Therefore, it is necessary to propose new schemes for predicting the interaction between protein isomers.
Disclosure of Invention
The invention aims to solve the technical problem that the method and the device for predicting the interaction of the protein isomer pairs are provided aiming at the defects of the prior art, so that the interaction of the protein isomer pairs can be predicted more accurately, and a large amount of manpower and material resources consumed by biochemical experiments are effectively avoided.
The technical scheme provided by the invention is as follows:
in one aspect, a method for predicting a protein isoform pair interaction is provided, comprising the steps of:
a characteristic extraction step:
combining the protein isomers pairwise to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
training set construction and model training:
obtaining protein interaction data which comprises protein pairs with interaction, and screening the protein pairs corresponding to only one protein isomer pair for the protein pairs to enable the label of the corresponding protein isomer pair to be 1;
generating protein pairs without interaction by using a random sampling method, and enabling labels of all corresponding protein isomer pairs to be 0 for the protein pairs; the idea of multi-instance learning is utilized, that is, if a protein pair in one package has an interaction relationship (label is 1), at least one instance having an interaction relationship (label is 1) exists in the package, that is, if at least one protein isomer pair exists in all protein isomer pairs corresponding to the protein pair (label is 1), and if the protein pair only corresponds to one protein isomer pair, the corresponding protein isomer pair necessarily has an interaction relationship (label is 1); if a packet-protein pair has no interaction (label 0), then all the samples in the packet have no interaction (label 0), i.e. the protein pair has no interaction (label 0) with all the corresponding protein isoform pairs.
Taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction step:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair, wherein the prediction result is the possibility of interaction or whether the protein isomer pair has a label of interaction.
Further, in the feature extraction step, for any protein isomer pair, the pearson correlation coefficient calculation formula of the expression data in any tissue is as follows:
Figure BDA0002404673110000031
wherein, XiAnd YiRespectively representing the expression quantity of two protein isomers in the protein isomer pair in the ith sample of the tissue,
Figure BDA0002404673110000032
and
Figure BDA0002404673110000033
respectively represent XiAnd YiMean value of (i)
Figure BDA0002404673110000034
m is the number of samples of the tissue.
Further, the feature extraction step specifically includes:
firstly, storing Pearson correlation coefficients of expression data of all protein isomer pairs in n tissues in a matrix M, wherein each row in M corresponds to one protein isomer pair, each column in M corresponds to one tissue, and each element in M represents the Pearson correlation coefficient of the expression data of one protein isomer pair in one tissue;
then, Fisher-Z transformation is carried out on the matrix M, and the formula is as follows:
Figure BDA0002404673110000035
wherein r represents the element value before transformation in the matrix M, and Z is the value obtained after Fisher-Z transformation of r; the range of matrix elements can be expanded from [ -1,1] to [ - ∞, + ∞ ] by means of a Fisher-Z transformation. The expansion of the range of the characteristic value can be helpful for analyzing the influence of the characteristic value on the prediction result in different organizations;
finally, recording the matrix obtained after transformation as M'; for each protein isoform pair, its n elements in the corresponding row in M' are taken as its n eigenvalues.
Further, the prediction model is a random forest model.
Further, the training set constructing and model training step comprises the following steps:
step 1, forming a set P by protein pairs with interaction, and enabling the label of each protein pair in the set P to be 1; forming a set N by protein pairs without interaction, and enabling the label of each protein pair in the set N to be 0; taking a union of the set P and the set N to obtain a protein interaction data set Q; for any pair of protein isoforms, if the corresponding pair of protein isoforms is present in the set of protein interaction data Q, adding the pair of protein isoforms to the set R;
the initial training set is (F)0,L0) The iteration number t is 1;
step 2, iterative training of a random forest model:
step A: in the process of the t-th iteration, the first step is based on (F)0,L0) Training a random forest model; then, for each protein isomer pair in the set R, respectively inputting the characteristic data of the protein isomer pair into the trained random forest model to obtain a prediction result W, wherein the W comprises the score of each protein isomer pair in the set R, and the higher the score is, the higher the possibility that the corresponding protein isomer pair has an interaction relation is;
and B: if the iteration times are more than or equal to 2, judging whether the prediction result W is converged, if the prediction result W is converged, ending the iteration, and taking the random forest model after the training in the iteration process as the trained prediction model, otherwise, performing the step C; if the iteration times are less than 2, skipping the step; the method for judging whether the prediction result W is converged is the current time, namely the prediction result obtained by the t iteration is slightly changed relative to the prediction result obtained by the last time, namely the t-1 iteration;
and C: screening out data of the next iteration, namely screening out a corresponding protein isomer pair with the highest score for each protein pair with a label of 1 in the set Q, and taking the protein isomer pair as a positive sample to enable the label of the positive sample to be 1; for each protein pair labeled 0 in the set Q, the corresponding protein isoform pair with the highest score was selected as the negative sampleOriginally, let its label be 0; the pair of protein isomers thus selected was used as a new sample, and their characteristic data was used as a new F0Their corresponding labels as new L0And returning to the step A for the next iteration by making t equal to t + 1.
In another aspect, there is provided a device for predicting a pair of protein isoform interactions, comprising the steps of:
a feature extraction module:
for combining two protein isomers to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
a training set construction and model training module:
obtaining protein interaction data which comprises protein pairs with interaction, and screening the protein pairs corresponding to only one protein isomer pair for the protein pairs to enable the label of the corresponding protein isomer pair to be 1;
generating protein pairs without interaction by using a random sampling method, and enabling labels of all corresponding protein isomer pairs to be 0 for the protein pairs;
taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction module:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair.
The working principle of each module of the device is described in the detailed description of the corresponding steps in the protein isomer pair interaction prediction method.
In another aspect, an electronic device is provided, which includes a memory and a processor, wherein the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to implement the protein isoform pair interaction prediction method.
In another aspect, a computer readable storage medium is provided, on which a computer program is stored, which computer program, when executed by a processor, implements the protein isoform pair interaction prediction method.
Has the advantages that:
the invention can accurately predict the interaction of protein isomer pairs based on the existing protein isomer expression amount information and protein interaction information. When the prediction model is trained, the sample set is updated through iteration, the problems that when a traditional machine learning model is used, the number of samples is too small due to the fact that the number of known data samples is too small, large errors are possibly generated are solved, and a better prediction result can be obtained. And because the traditional biological method is used for measuring the interaction of the protein isomer pair, a large amount of manpower and material resources are consumed, and the problems can be well solved under the condition that only a small amount of protein interaction samples can be detected. If the original data exist, the method can predict whether any pair of protein isomers has interaction, so that a larger pair of protein isomers which are possibly interacted can be accurately screened out before the detection of an actual biological experiment, and then the biological experiment is carried out to verify the interaction of the pair of protein isomers, thereby avoiding the consumption of a large amount of manpower and material resources.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
FIG. 2 is a graph of accuracy versus recall of gene expression data downloaded from GTEx, divided into 29 tissues and predicted using the MIL-RF model; wherein fig. 2(a), 2(b) and 2(c) correspond to the prediction results of version1, version2 and version3, respectively.
Detailed Description
The invention will be described in further detail below with reference to the following figures and specific examples:
example 1:
the embodiment provides a method for predicting the interaction of protein isomer pairs, which comprises the following steps:
a characteristic extraction step:
combining the protein isomers pairwise to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
training set construction and model training:
downloading protein interaction data from a public database, wherein the protein interaction data comprise protein pairs with interaction, a set P is formed by the protein pairs, and the label of each protein pair in the set P is 1; generating protein pairs without interaction by using a random sampling method, forming a set N by the protein pairs, and enabling the label of each protein pair in the set N to be 0; taking a union of the set P and the set N to obtain a protein interaction data set Q;
screening protein pairs corresponding to only one protein isomer pair from the protein pairs in the set P, and enabling the labels of the corresponding protein isomer pairs to be 1; for protein pairs without interaction, the labels of all corresponding protein isomer pairs are 0;
taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction step:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair.
Example 2:
this example is based on example 1, and in the feature extraction step, for any pair of protein isomers, the pearson correlation coefficient calculation formula of the expression data in any tissue is as follows:
Figure BDA0002404673110000061
wherein the content of the first and second substances,Xiand YiRespectively representing the expression quantity of two protein isomers in the protein isomer pair in the ith sample of the tissue,
Figure BDA0002404673110000062
and
Figure BDA0002404673110000063
respectively represent XiAnd YiMean value of (i)
Figure BDA0002404673110000064
m is the number of samples of the tissue.
Example 3:
in this embodiment, on the basis of embodiment 2, the feature extraction step specifically includes:
firstly, storing Pearson correlation coefficients of expression data of all protein isomer pairs in n tissues in a matrix M, wherein each row in M corresponds to one protein isomer pair, each column in M corresponds to one tissue, and each element in M represents the Pearson correlation coefficient of the expression data of one protein isomer pair in one tissue;
then, Fisher-Z transformation is carried out on the matrix M, and the formula is as follows:
Figure BDA0002404673110000065
wherein r represents the element value before transformation in the matrix M, and Z is the value obtained after Fisher-Z transformation of r;
finally, recording the matrix obtained after transformation as M'; for each protein isoform pair, its n elements in the corresponding row in M' are taken as its n eigenvalues.
Example 4:
in this embodiment, based on embodiment 3, the prediction model is a random forest model.
Example 5:
in this embodiment, on the basis of embodiment 4, the training set constructing and model training step includes the following steps:
step 1, for any protein isomer pair, if the corresponding protein pair exists in a protein interaction data set Q, adding the protein isomer pair into a set R; the initial training set is (F)0,L0) The iteration number t is 1;
step 2, iterative training of a random forest model:
step A: in the process of the t-th iteration, the first step is based on (F)0,L0) Training a random forest model; then, for each protein isomer pair in the set R, respectively inputting the characteristic data of the protein isomer pair into the trained random forest model to obtain a prediction result W, wherein the W comprises the score of each protein isomer pair in the set R, and the higher the score is, the higher the possibility that the corresponding protein isomer pair has an interaction relation is;
and B: if the iteration times are more than or equal to 2, judging whether the prediction result W is converged, if the prediction result W is converged, ending the iteration, and taking the random forest model after the training in the iteration process as the trained prediction model, otherwise, performing the step C;
and C: screening out data of the next iteration, namely screening out a corresponding protein isomer pair with the highest score for each protein pair with a label of 1 in the set Q, and taking the protein isomer pair as a positive sample to enable the label of the positive sample to be 1; for each protein pair with a label of 0 in the set Q, screening out the corresponding protein isomer pair with the highest score as a negative sample, and making the label of the negative sample be 0; the pair of protein isomers thus selected was used as a new sample, and their characteristic data was used as a new F0Their corresponding labels as new L0And returning to the step A for the next iteration by making t equal to t + 1.
Example 6:
the embodiment provides a device for predicting the interaction of protein isomer pairs, which comprises the following steps:
a feature extraction module:
for combining two protein isomers to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
a training set construction and model training module:
for downloading protein interaction data from a public database, comprising pairs of protein pairs having interactions, from which a set P is formed, with the label of each pair in the set P being 1; generating protein pairs without interaction by using a random sampling method, forming a set N by the protein pairs, and enabling the label of each protein pair in the set N to be 0; taking a union of the set P and the set N to obtain a protein interaction data set Q;
screening protein pairs corresponding to only one protein isomer pair from the protein pairs in the set P, and enabling the labels of the corresponding protein isomer pairs to be 1; for protein pairs without interaction, the labels of all corresponding protein isomer pairs are 0;
taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction module:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair.
The working principle of each module in this embodiment refers to the detailed description of the corresponding steps in embodiments 1 to 5.
Example 7:
the present embodiment provides an electronic device, including a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is enabled to implement the method according to any one of embodiments 1 to 5.
Example 8:
the present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any of embodiments 1-5.
And (3) experimental verification:
experiment 1:
a simulation example test was performed to verify the feasibility of example 5 above by comparing the AUC and aucrc values of the experimental results under different variables.
First, generating simulated protein isoform data;
setting two-by-two combination parameters of MD ∈ {0.1,0.2,0.3} and MGR ∈ {0.2,0.3,0.5} respectively, wherein MD represents the difference of average expression quantity between protein isomer pairs with interaction and protein isomer pairs without interaction, MGR represents the proportion of proteins corresponding to a plurality of protein isomers in all proteins, randomly generating simulated protein isomer data, wherein the data comprises information of the protein isomers and the expression quantity of the protein isomers on different tissues, generating 9 groups of equivalent simulated data by three MD values and three MGR values, and setting the characteristic dimension of the simulated data as 50, wherein each dimension represents different tissues.
Then, the protein isomer pair interaction prediction method provided in example 5 was used to train a prediction model;
in the steps of training set construction and model training, the protein isomer pair with the determined label is used as a sample; and all samples are equally divided into 5 groups in a completely random mode, and the characteristic data of each group of samples are recorded as F1,F2,…,F5The labels corresponding to the samples in each group are respectively marked as L1,L2,…,L5. In the course of 5 experiments, 4 groups of data were used as training set in turn (F)0,L0) And taking 1 group of data as a test set to perform 5-fold cross validation on the accuracy of the algorithm.
And inputting the characteristic data of the corresponding test concentrated samples into the prediction model trained in each experimental process to obtain the prediction results of the test concentrated samples, and calculating the accuracy of the prediction model trained in the experimental process based on the prediction results of the test concentrated samples and the real labels of the test concentrated samples. The calculated accuracy rate shows that the algorithm has high accuracy.
And (4) integrating the prediction results in 5 experiments to obtain the prediction labels of all samples. All samples are divided into 9 groups according to values of MD and MGR, and AUC and AUPRC indexes of each group are respectively counted, wherein AUC is defined as the area under the ROC curve. AUPRC is the area under precision-recall curve (PR curve), and the larger the value, the better the prediction (classification) effect. The method provided by the embodiment of the invention is compared and evaluated with the prediction result of the existing method (a method for constructing a protein isomer pair interaction prediction model by adopting a logistic regression model and predicting the interaction between protein isomers, which is abbreviated as LR). As in tables 1-3. As can be seen from tables 1-3, no matter which combination of MD and MGR is adopted, the AUC and AUPRC index values of the method (called MIL-RF) provided by the embodiment of the invention are both larger than those of the existing method, which shows that the method provided by the embodiment of the invention has good prediction effect and high accuracy. And with the richness of the sample information of the embodiment of the invention, namely, when MD and MGR are increased, the accuracy of the method provided by the embodiment of the invention is also increased, and when MD is 0.3 and MGR is 0.5, the AUC and aucrc index values are maximum. This is consistent with the predicted results and also verifies the feasibility of the solution.
TABLE 1 AUC and AUPRC index values for two methods, where 0.1 is taken as MD and 0.2,0.3 and 0.5 is taken as MGR
MD=0.1,MGR=0.2 MIL-RF LR
AUC 0.823±0.009 0.782±0.011
AUPRC 0.230±0.015 0.171±0.013
MD=0.1,MGR=0.3 MIL-RF LR
AUC 0.866±0.008 0.827±0.010
AUPRC 0.292±0.017 0.220±0.016
MD=0.1,MGR=0.5 MIL-RF LR
AUC 0.933±0.004 0.904±0.006
AUPRC 0.477±0.019 0.375±0.016
TABLE 2 AUC and AUPRC index values for two methods, where 0.2 is taken as MD and 0.2,0.3 and 0.5 is taken as MGR
MD=0.1,MGR=0.2 MIL-RF LR
AUC 0.9088±0.006 0.894±0.006
AUPRC 0.442±0.016 0.395±0.014
MD=0.1,MGR=0.3 MIL-RF LR
AUC 0.929±0.007 0.916±0.007
AUPRC 0.500±0.020 0.450±0.016
MD=0.1,MGR=0.5 MIL-RF LR
AUC 0.961±0.003 0.952±0.003
AUPRC 0.638±0.012 0.583±0.010
TABLE 2 AUC and AUPRC index values for two methods, where 0.3 is taken as MD and 0.2,0.3 and 0.5 is taken as MGR
MD=0.1,MGR=0.2 MIL-RF LR
AUC 0.961±0.001 0.856±0.003
AUPRC 0.688±0.013 0.660±0.014
MD=0.1,MGR=0.3 MIL-RF LR
AUC 0.970±0.002 0.965±0.002
AUPRC 0.730±0.013 0.702±0.014
MD=0.1,MGR=0.5 MIL-RF LR
AUC 0.984±0.002 0.981±0.002
AUPRC 0.823±0.011 0.795±0.012
Experiment 2:
this section uses experiments with protein isoform expression data downloaded from GTEx.
First, protein isoform expression data of 29 different tissues (e.g., head, chest, blood) are obtained, and a protein isoform expression data matrix, denoted as matrix E, is constructed for each tissue1,E2,...En(ii) a For any tissue, each row in the corresponding matrix corresponds to one protein isomer, each column corresponds to one sample, and each element represents the expression quantity of one protein isomer in one sample of the tissue;
for each tissue, screening protein isomers with the expression rate of more than 50% for subsequent analysis; for any tissue, the expression rate of any protein isoform was calculated as: and if the number of samples of the tissue is m, and the expression level of the protein isomer in k samples of the tissue is more than 1, the expression rate of the protein isomer in the tissue is k/m.
Then, the protein isomer pair interaction prediction method provided in example 5 was used to train a prediction model;
similarly, in the steps of training set construction and model training, the protein isomer pair with the determined label is taken as a sample; and all samples are equally divided into 5 groups in a completely random mode, and the characteristic data of each group of samples are recorded as F1,F2,…,F5The labels corresponding to the samples in each group are respectively marked as L1,L2,…,L5. In the course of 5 experiments, 4 groups of data were used as training set in turn (F)0,L0) And taking 1 group of data as a test set to perform 5-fold cross validation on the accuracy of the algorithm.
And inputting the characteristic data of the corresponding test concentrated samples into the prediction model trained in each experimental process to obtain the prediction results of the test concentrated samples, and calculating the accuracy of the prediction model trained in the experimental process based on the prediction results of the test concentrated samples and the real labels of the test concentrated samples. The calculated accuracy rate shows that the algorithm has high accuracy.
The accuracy of protein isoform prediction of interaction relationships is aided based on the accuracy of protein pair interaction relationship prediction below. And (5) drawing a corresponding accuracy-recall rate curve according to the prediction result obtained in the step (5), and comparing and analyzing. See fig. 2. Three types of data are selected in fig. 2, wherein version1 selects only protein pairs corresponding to only one protein isoform pair in set Q, version2 selects only protein pairs corresponding to two or more protein isoform pairs in set Q, and version3 selects all protein pairs in set Q.
Randomly sampling protein pairs in the version1 according to a positive-negative sample ratio (ratio of protein pairs labeled with 1 to protein pairs labeled with 0) p in the version3, selecting the protein pairs with the positive-negative sample ratio of p and the largest number from the protein pairs, inputting the characteristic data of the corresponding protein isomer pairs into a trained prediction model for each selected protein pair, obtaining the scores of the corresponding protein isomer pairs, and taking the scores as the prediction scores of the protein pairs; counting the real labels and the prediction scores of all the selected protein pairs, and drawing a corresponding prediction result accuracy-recall rate curve; the curve drawing method comprises the following steps: and performing descending order arrangement on all protein samples according to the predicted scores, selecting sample scores as a threshold value one by one from large to small, predicting the samples larger than the threshold value as positive samples, and otherwise predicting the samples as negative samples, comparing the real labels of the samples, calculating corresponding TP and FP, calculating Precision (Precision) and Recall (Recall) corresponding to the threshold value, traversing all samples, and calculating all points (Precision, Recall) contained in a Precision-Recall curve, thereby drawing the curve. The random sampling and prediction process was repeated 100 times to obtain 100 corresponding precision-recall curves, as shown in FIG. 2 (a).
Randomly sampling protein pairs in the version2 according to the positive and negative sample proportion p in the version3, selecting protein pairs with the positive and negative sample proportion p and the largest quantity from the protein pairs, inputting the characteristic data of each protein isomer pair corresponding to each selected protein pair into a trained prediction model to obtain the score of each protein isomer pair corresponding to each selected protein pair, and taking the highest score as the prediction score of the protein pair; counting the real labels and the prediction scores of all the selected protein pairs, and drawing a corresponding prediction result accuracy-recall rate curve; the curve plotting method is the same as described above. The random sampling and prediction process was repeated 100 times to obtain 100 corresponding precision-recall curves, as shown in fig. 2(b), in which most of the curves are overlapped and the difference is small.
For each protein pair in version3, respectively inputting the feature data of each corresponding protein isomer pair into a trained prediction model to obtain the score of each corresponding protein isomer pair, and taking the highest score as the prediction score of the protein pair; the true labels and the predicted scores of all the protein pairs in version3 were counted, and the corresponding prediction result accuracy-recall curve was drawn, as shown in fig. 2 (c).
As can be seen from the accuracy-recall curve of the prediction results drawn by fig. 2 for the protein pairs in version1, version2, and version3, the protein pair with a high prediction score obtained by the above experimental method has a high probability of being a true positive sample (corresponding to a point on the curve with a high accuracy and a low recall ratio), which indicates that the above experimental method can accurately screen out a large protein pair likely to have an interaction, thereby assisting in verifying that the method provided in embodiment 5 of the present invention can accurately screen out a large protein isomer pair likely to have an interaction; moreover, each protein in the version1 corresponds to a unique protein isomer pair, and a prediction result accuracy-recall ratio curve drawn for the protein pair selected by the version1 is actually an accuracy-recall ratio curve for the prediction result of the corresponding protein isomer pair, which can also directly prove that the method provided by the embodiment 5 of the present invention can accurately screen out the protein isomer pairs which are likely to have interaction, and the prediction effect is good.

Claims (8)

1. A method for predicting a protein isoform pair interaction, comprising the steps of:
a characteristic extraction step:
combining the protein isomers pairwise to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
training set construction and model training:
obtaining protein interaction data which comprises protein pairs with interaction, and screening the protein pairs corresponding to only one protein isomer pair for the protein pairs to enable the label of the corresponding protein isomer pair to be 1;
generating protein pairs without interaction by using a random sampling method, and enabling labels of all corresponding protein isomer pairs to be 0 for the protein pairs;
taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction step:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair.
2. The method for predicting protein isoform pair interaction according to claim 1, wherein said feature extraction step calculates Pearson's correlation coefficient of expression data in any tissue for any protein isoform pair as follows:
Figure FDA0002404673100000011
wherein, XiAnd YiRespectively representing the expression quantity of two protein isomers in the protein isomer pair in the ith sample of the tissue,
Figure FDA0002404673100000012
and
Figure FDA0002404673100000013
respectively represent XiAnd YiMean value of (i)
Figure FDA0002404673100000014
m is the number of samples of the tissue.
3. The method for predicting protein isoform pair interactions according to claim 1, wherein said feature extraction step comprises:
firstly, storing Pearson correlation coefficients of expression data of all protein isomer pairs in n tissues in a matrix M, wherein each row in M corresponds to one protein isomer pair, each column in M corresponds to one tissue, and each element in M represents the Pearson correlation coefficient of the expression data of one protein isomer pair in one tissue;
then, Fisher-Z transformation is carried out on the matrix M, and the formula is as follows:
Figure FDA0002404673100000015
wherein r represents the element value before transformation in the matrix M, and Z is the value obtained after Fisher-Z transformation of r;
finally, recording the matrix obtained after transformation as M'; for each protein isoform pair, its n elements in the corresponding row in M' are taken as its n eigenvalues.
4. The method of predicting protein isoform pair interactions according to claim 1, wherein said prediction model is a random forest model.
5. The method of predicting protein isoform pair interactions according to claim 4, wherein said training set constructing and model training steps comprise the steps of:
step 1, for protein pairs with interaction, making the label of the protein pairs to be 1; for pairs of proteins that do not have an interaction, let the label be 0; a protein interaction data set Q is formed by a protein pair with a label of 1 and a protein pair with a label of 0; for any pair of protein isoforms, if the corresponding pair of protein isoforms is present in the set of protein interaction data Q, adding the pair of protein isoforms to the set R;
the initial training set is (F)0,L0) The iteration number t is 1;
step 2, iterative training of a random forest model:
step A: in the process of the t-th iteration, the first step is based on (F)0,L0) Training a random forest model; then, for each protein isomer pair in the set R, respectively inputting the characteristic data of the protein isomer pair into the trained random forest model to obtain a prediction result W, wherein the W comprises the score of each protein isomer pair in the set R, and the higher the score is, the higher the possibility that the corresponding protein isomer pair has an interaction relation is;
and B: if the iteration times are more than or equal to 2, judging whether the prediction result W is converged, if the prediction result W is converged, ending the iteration, and taking the random forest model after the training in the iteration process as the trained prediction model, otherwise, performing the step C;
and C: screening out data for the next iteration, i.e. forScreening each protein pair with a label of 1 in the set Q to obtain a corresponding protein isomer pair with the highest score, and enabling the label of the protein isomer pair to be 1; for each protein pair with a label of 0 in the set Q, screening out the corresponding protein isomer pair with the highest score, and making the label of the protein isomer pair be 0; the pair of protein isomers thus selected was used as a new sample, and their characteristic data was used as a new F0Their corresponding labels as new L0And returning to the step A for the next iteration by making t equal to t + 1.
6. A device for predicting a pair of protein isoform interactions, comprising the steps of:
a feature extraction module:
for combining two protein isomers to form a protein isomer pair; for each protein isoform pair, determining its n eigenvalues based on the pearson correlation coefficients of its expression data in n tissues, respectively;
a training set construction and model training module:
obtaining protein interaction data which comprises protein pairs with interaction, and screening the protein pairs corresponding to only one protein isomer pair for the protein pairs to enable the label of the corresponding protein isomer pair to be 1;
generating protein pairs without interaction by using a random sampling method, and enabling labels of all corresponding protein isomer pairs to be 0 for the protein pairs;
taking the pair of protein isomers with the determined labels as a sample; all samples were characterized by F0All samples have labels L0(ii) a Based on (F)0,L0) Training a prediction model;
a prediction module:
and inputting the characteristic data of the protein isomer pair to be classified into a trained prediction model to obtain a prediction result of the protein isomer pair.
7. An electronic device comprising a memory and a processor, the memory having stored therein a computer program, wherein the computer program, when executed by the processor, causes the processor to implement the method of any of claims 1-5.
8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 5.
CN202010157694.5A 2020-03-09 2020-03-09 Protein isomer pair interaction prediction method and device Active CN111370068B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010157694.5A CN111370068B (en) 2020-03-09 2020-03-09 Protein isomer pair interaction prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010157694.5A CN111370068B (en) 2020-03-09 2020-03-09 Protein isomer pair interaction prediction method and device

Publications (2)

Publication Number Publication Date
CN111370068A true CN111370068A (en) 2020-07-03
CN111370068B CN111370068B (en) 2022-11-04

Family

ID=71210433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010157694.5A Active CN111370068B (en) 2020-03-09 2020-03-09 Protein isomer pair interaction prediction method and device

Country Status (1)

Country Link
CN (1) CN111370068B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354441A (en) * 2015-10-23 2016-02-24 上海交通大学 Vegetable protein interaction network construction method
US20170298418A1 (en) * 2014-10-21 2017-10-19 uBiome, Inc. Method and system for microbiome-derived diagnostics and therapeutics for conditions associated with microbiome functional features
US20180080079A1 (en) * 2001-11-07 2018-03-22 Bioventures, Llc Diagnosis, prognosis and identification of potential therapeutic targets of multiple myeloma based on gene expression profiling
CN108763861A (en) * 2018-04-16 2018-11-06 深圳大学 Prediction technique, device, terminal and the medium of protein-protein interaction
CN109801674A (en) * 2019-01-30 2019-05-24 长沙学院 A kind of key protein matter recognition methods based on the fusion of isomery bio-networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180080079A1 (en) * 2001-11-07 2018-03-22 Bioventures, Llc Diagnosis, prognosis and identification of potential therapeutic targets of multiple myeloma based on gene expression profiling
US20170298418A1 (en) * 2014-10-21 2017-10-19 uBiome, Inc. Method and system for microbiome-derived diagnostics and therapeutics for conditions associated with microbiome functional features
CN105354441A (en) * 2015-10-23 2016-02-24 上海交通大学 Vegetable protein interaction network construction method
CN108763861A (en) * 2018-04-16 2018-11-06 深圳大学 Prediction technique, device, terminal and the medium of protein-protein interaction
CN109801674A (en) * 2019-01-30 2019-05-24 长沙学院 A kind of key protein matter recognition methods based on the fusion of isomery bio-networks

Also Published As

Publication number Publication date
CN111370068B (en) 2022-11-04

Similar Documents

Publication Publication Date Title
CN112735535B (en) Prediction model training method, prediction model training device, data prediction method, data prediction device and storage medium
CN113299346B (en) Classification model training and classifying method and device, computer equipment and storage medium
WO2023035745A1 (en) Olfactory receptor screening method and apparatus, model training method and apparatus, and wine product identification method and apparatus
CN108537005B (en) A kind of crucial lncRNA prediction technique based on BPSO-KNN model
CN109559781A (en) A kind of two-way LSTM and CNN model that prediction DNA- protein combines
Golugula et al. Evaluating feature selection strategies for high dimensional, small sample size datasets
CN114881343B (en) Short-term load prediction method and device for power system based on feature selection
CN115798730A (en) Method, apparatus and medium for circular RNA-disease association prediction based on weighted graph attention and heterogeneous graph neural networks
CN106951728B (en) Tumor key gene identification method based on particle swarm optimization and scoring criterion
CN112017730B (en) Cell screening method and device based on expression quantity prediction model
CN111048145B (en) Method, apparatus, device and storage medium for generating protein prediction model
CN112163632A (en) Application of semi-supervised extreme learning machine based on bat algorithm in industrial detection
CN111370068B (en) Protein isomer pair interaction prediction method and device
CN110400605A (en) A kind of the ligand bioactivity prediction technique and its application of GPCR drug targets
CN112102882B (en) Quality control system and method for NGS detection process of tumor sample
CN113936804B (en) System for constructing model for predicting risk of continuous air leakage after lung cancer resection
CN114819151A (en) Biochemical path planning method based on improved agent-assisted shuffled frog leaping algorithm
CN108595914A (en) One grows tobacco mitochondrial RNA (mt RNA) editing sites high-precision forecasting method
CN104636636A (en) Protein remote homology detecting method and device
CN111104950A (en) K value prediction method and device in k-NN algorithm based on neural network
CN113035363B (en) Probability density weighted genetic metabolic disease screening data mixed sampling method
Chin et al. Optimized local protein structure with support vector machine to predict protein secondary structure
CN112885409B (en) Colorectal cancer protein marker selection system based on feature selection
CN116994652B (en) Information prediction method and device based on neural network and electronic equipment
US20230116904A1 (en) Selecting a cell line for an assay

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant