Background
Protein-protein interactions (PPIs) are a fundamental indicator of many bioinformatics applications such as protein function and drug discovery. Therefore, accurate prediction of protein-protein interactions will help us to understand underlying molecular mechanisms and significantly facilitate drug discovery. The PPI can be predicted more accurately through Gene Ontology (GO) information. Most previous studies of gene ontology information to predict PPI have utilized Information Content (IC). Recently, some studies have used word embedding techniques in the field of natural language processing to learn vectors representing GO terms and proteins to predict PPIs.
Gene ontology is a standard lexical term for biological functional annotation, a uniform term used to describe the function of homologous genes and gene products across species. The invention utilizes a supervised sentence embedding technology to capture GO structure and GO annotation information to predict PPI. Combining gene ontology with powerful natural language processing techniques, our method provides a general computational flow to predict protein-protein interactions even without the use of protein sequence information.
Disclosure of Invention
The invention aims to provide a protein-protein interaction prediction method based on a sentence embedding insersent model, which predicts protein-protein interaction (PPI) based on a natural language processing model insersent binding Gene Ontology (GO). In the method, each record of GO annotation axiom has a corresponding weight; and training a PPI positive and negative data set on the model based on sentence embedding Infersent by combining GO annotation axiom and GO structure axiom to finally obtain a model for predicting PPI.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a method for predicting protein-protein interactions based on a sentence-embedded insersent model, comprising the steps of:
s1, the body of GO is constructed into a graph, wherein GO terms are used as nodes in the graph, and the relationship between GO terms is called as edges. Extracting and generating a GO structure axiom from a GO graph structure file by using the existing Onto2Vec technology, training the GO structure axiom, and obtaining a GO term word vector;
s2, screening and extracting annotation axioms: screening and extracting each GO annotation record with corresponding weight in a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom;
s3, combining the GO annotation axiom in the step S1, replacing the proteins of the PPI positive and negative data set with the GO terms annotated to the PPI positive and negative data set line by line to obtain final training data;
s4, modifying the InfersenttPI model into an InfersentpI model, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentpI model to finally obtain a model for predicting the PPI, and outputting a PPI prediction result.
Preferably, the step S1 further includes the steps of: s1.1, extracting GO graph structure records in GO. S1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;
s1.3, training in a skip-gram model as follows:
given a sequence of training words x
1,x
2,.....,x
3The objective of the Skip-gram model is to maximize the following formula:
where c is the size of the training context window, T is the size of the training word set, wiIs the ith training word in the sequence;
s1.4, obtaining word vectors of GO terms after training is finished and organizing the word vectors into a file to be output;
preferably, the step S2 further includes the steps of: s2.1, screening each record of the GOA according to the Event Code field content of a Gene Ontology Annotation (GOA) file to be processed, deleting the Event Code field content to be 'IEA' or 'ND' to obtain a screened GOA file, extracting a UniProtKB unique identification Code and a GO unique identification Code of each row of records of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of the repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record; s2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;
preferably, the step S3 further includes the steps of: s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction (PPI) positive and negative dataset, mapping the proteins into two UniProtKB unique identification codes, deleting the protein pairs in which the proteins cannot be mapped into the UniProtKB unique identification codes, generating an attribute label 'positive' or 'negative' of the corresponding protein pair according to the property of the dataset, organizing the protein pairs and the attribute label into a PPI record file, wherein the content of each line in the PPI record file consists of the two UniProtKB unique identification codes and the attribute label; s3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;
s3.3, in the PPI corpus in the step S1, randomly selecting 80%, 10% and 10% as a training set, a verification set and a test set as final training data.
Preferably, the step S4 further includes the steps of: s4.1, modifying based on an Infersend model, wherein a sentence encoder of the Infersend model is set as a convolutional neural network, a classifier of the Infersend model is set as two classes, and labels of the two classes are 'positive' and 'negative', so that an Infersend PPPI model is obtained; s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;
preferably, the iterative training in step S4.2 comprises the steps of: s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are word vectors of GO terms, the sentence encoders use convolutional neural networks, and the generated sentence vectors u and v are protein vectors u and v; s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1; s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2; s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;
s4.2-5, the formula of the predicted PPI is as follows:
InfersentPPI(a,b)=P(positive)>P(negative)?positive:negative
s4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times; s4.3, obtaining a model for predicting the PPI with the best effect after the iterative training is finished;
s4.4, predicting the PPI model in the step S4.3 on a test set, and organizing the prediction result of the test set into a file to be output;
preferably, in step S4.1, the "positive" classification represents PPI positive and the "negative" classification represents PPI negative, and in step S4.3, the predicted PPI is the PPI prediction of the protein annotated by the gene ontology.
Compared with the prior art, the invention has the beneficial effects that: the method for predicting protein-protein interaction based on the sentence-embedded Infersend model provided by the invention effectively improves the PPI prediction accuracy and AUC by means of natural language processing model Infersend and combining gene ontology.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1-5, the present invention provides a method for predicting protein-protein interactions based on a sentence-embedded Infersent model (detailed below with PPI positive-negative data as an example), comprising the steps of:
step S1, the body of GO is constructed into a graph, where GO terms are nodes in the graph, relationships (also called object attributes) between GO terms are called edges, and GO general information is available in GO. Extracting and generating a GO structure axiom in a GO map structure file go.own by using the existing Onto2Vec technology, training the GO structure axiom to obtain a GO term word vector, wherein the GO structure axiom is a description composition that a GO term (except a root term representing each aspect) has a subclass relation with another GO term;
step S2, filtering and extracting annotation axioms: screening and extracting each GO annotation record with corresponding weight in a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom;
step S3, combining the GO annotation axiom in the step S1, replacing the proteins of the PPI positive and negative data set with the GO terms annotated to the PPI positive and negative data set line by line to obtain final training data;
and S4, modifying the InfersenttPI model into an InfersentpI model, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentpI model to finally obtain a model for predicting the PPI, and outputting a PPI prediction result.
As shown in fig. 2, the step S1 further includes the following steps:
s1.1, extracting GO graph structure records in GO documents, each GO graph structure record being composed of a plurality of relation words (e.g., supbolasof, disajoint length) between GO unique identification codes and GO terms, the GO graph structure records being organized into a file to obtain a GO structure axiom file, which is specifically shown in table 1:
table 1 is an example of the contents of the GO structural axiom file
S1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;
s1.3, training in a skip-gram model as follows:
given a sequence of training words x1,x2,...,xTSkip-gram modelThe objective is to maximize the following equation:
wherein c is the size of the training context window, T is the size of the training word set, and Wi is the ith training word in the sequence;
s1.4, obtaining word vectors of GO terms after training is finished and organizing the word vectors into a file to be output;
as shown in fig. 3, the step S2 further includes the following steps:
s2.1, screening each record of the GOA according to the contents of the event Code field of a Gene Ontology Annotation (GOA) file to be processed. And the Event Code is a valid Evidence Code annotated by GO, and records with the Event Code field content of 'IEA' or 'ND' are deleted to obtain the screened GOA file. ND evidence codes are used for annotation when no information is available about the molecular function, biological process or cellular composition of the annotated gene or gene product. IEA-supported annotations are ultimately based on homology and/or other experimental or sequence information, but are generally not traceable back to the source of the experiment. Extracting UniProtKB unique identification codes and GO unique identification codes recorded in each line of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record;
specific examples are shown in table 2 and table 3:
table 2 is an example of the contents of a GOA file
Table 3 is an example of the contents of the GO annotation record file
UniProtKB ID
|
Relation
|
GO ID
|
A2P2R3
|
hasFunction
|
GO:0006047
|
D6VTK4
|
hasFunction
|
GO:0000750
|
D6VTK4
|
hasFunction
|
GO:0000750 |
S2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;
specific examples are shown in table 4:
table 4 is an example of the contents of the GO annotated axiom file
UniProtKB ID
|
GO ID
|
A2P2R3
|
GO:0006002;GO:0006047
|
D6VTK4
|
GO:0000750;GO:0000750 |
The step S3 further includes the steps of:
s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction (PPI) positive and negative dataset, mapping the pair of proteins to two UniProtKB unique identification codes, deleting the pair of proteins where the proteins cannot be mapped to the UniProtKB unique identification codes, and generating an attribute label 'positive' or 'negative' of the corresponding pair of proteins according to the property of the dataset, wherein 'positive' refers to PPI positive, and 'negative' refers to PPI negative. The protein pairs and the attribute tags are organized into a PPI record file, and the content of each line in the PPI record file is composed of two UniProtKB unique identification codes and the attribute tags;
specific examples are shown in table 5:
table 5 is an example of the contents of the PPI record file
ProteinA
|
ProteinB
|
Tag
|
P16649
|
P14922
|
positive
|
P07269
|
P22035
|
positive
|
P53248
|
P32366
|
negative
|
Q08558
|
P31412
|
negative
|
Q06169
|
P41807
|
negative |
S3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;
specific examples are shown in table 6:
table 6 is an example of the contents of a PPI corpus
S3.3, in the PPI corpus in the step S1, randomly selecting 80%, 10% and 10% as a training set, a verification set and a test set as final training data.
The step S4 further includes the steps of:
s4.1, modifying based on an Infersend model, wherein a sentence encoder of the Infersend model is set as a convolutional neural network, a classifier of the Infersend model is set as two classes, and labels of the two classes are 'positive' and 'negative', so that an Infersend PPPI model is obtained;
s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;
as shown in fig. 4, the iterative training in step S4.2 includes the following steps:
s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are the word vectors of GO terms, the sentence encoders use convolutional neural networks, and as shown in FIG. 5, the generated sentence vectors u and v are the protein vectors u and v;
s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1;
s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2;
s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;
s4.2-5, the formula of the predicted PPI is as follows:
InfersentPPI(a,b)=P(positive)>P(negative)?positive:negative
for example: one input of InfersentPPI is sentences a and B, and UniProtKB unique identification codes for protein a and protein B:
sentence a: P16649;
sentence b: P14922;
then replacing the proteins line by line with the unique GO identification code annotated to it according to step S3.2, resulting in the test data being word set GOs1, word set GOs 2:
GOs1:{GO_0000329,GO_0005739,GO_0005739,GO_0006623,GO _0022857,GO_0055085}
GOs2:{GO_0005783,GO_0006633,GO_0006892,GO_0009922,GO _0009922,GO_0019367,GO_0030148,GO_0030148,GO_0030176,GO_0 030497,GO_0032511,GO_0034625,GO_0034626,GO_0042761,GO_004 2761}
finally, p (positive) of the texts a and b is calculated according to the formula in step S4.2, and p (negative) is 0.724 and 0.276, so that InfersentPPI (a, b) is calculated.
S4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times;
s4.3, obtaining a model for predicting the PPI with the best effect after the iterative training is finished;
s4.4, the model for predicting the PPI in the step S4.3 predicts on the test set, the prediction result of the test set is organized into a file to be output, and the parameter Batch _ size is 2 trained model, so that the best prediction effect is achieved on the test set;
in step S4.1, the "positive" classification represents PPI positive and the "negative" classification represents PPI negative, and in step S4.3, the predicted PPI is the PPI prediction of the protein annotated by the gene ontology.
In conclusion, the method for predicting protein-protein interaction based on the sentence-embedded Infersent model provided by the invention effectively improves the accuracy and AUC of PPI prediction by means of natural language processing model Infersent in combination with gene ontology.
The invention can be applied not only to proteins, but also to other examples annotated with an ontology. In addition, the sentence encoder of the natural language processing model infersense can be replaced, and the implementation of the whole model is not influenced. The user can select a suitable sentence encoder according to the requirement.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.