CN112185457A

CN112185457A - Protein-protein interaction prediction method based on sentence embedding Infersent model

Info

Publication number: CN112185457A
Application number: CN202011085576.4A
Authority: CN
Inventors: 江莹莹; 李美晶
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-05

Abstract

The invention discloses a method for predicting protein-protein interaction based on a sentence embedding Infersent model, which is based on the natural language processing model Infersent combined with gene ontology to predict protein-protein interaction (Protein-Protein Interaction, referred to as PPI). The method includes combining GO graph structure to obtain GO term word vectors; screening and extracting Gene Ontology Annotation (GOA) files to generate GO annotation axioms; combining GO annotation axioms and GO term word vectors on the model based on sentence embedding Infersent Train the PPI positive and negative dataset, and finally get a model that predicts PPI. Aiming at the problem that the prediction accuracy rate and AUC of PPI are not high enough at present, the present invention proposes a new method for predicting PPI, which improves the prediction accuracy rate and AUC.

Description

Protein-protein interaction prediction method based on sentence embedding Infersent model

Technical Field

The invention relates to the fields of biological information and natural language processing, in particular to application of a gene ontology and sentence embedding model in the field of protein-protein interaction (PPI) prediction.

Background

Protein-protein interactions (PPIs) are a fundamental indicator of many bioinformatics applications such as protein function and drug discovery. Therefore, accurate prediction of protein-protein interactions will help us to understand underlying molecular mechanisms and significantly facilitate drug discovery. The PPI can be predicted more accurately through Gene Ontology (GO) information. Most previous studies of gene ontology information to predict PPI have utilized Information Content (IC). Recently, some studies have used word embedding techniques in the field of natural language processing to learn vectors representing GO terms and proteins to predict PPIs.

Gene ontology is a standard lexical term for biological functional annotation, a uniform term used to describe the function of homologous genes and gene products across species. The invention utilizes a supervised sentence embedding technology to capture GO structure and GO annotation information to predict PPI. Combining gene ontology with powerful natural language processing techniques, our method provides a general computational flow to predict protein-protein interactions even without the use of protein sequence information.

Disclosure of Invention

The invention aims to provide a protein-protein interaction prediction method based on a sentence embedding insersent model, which predicts protein-protein interaction (PPI) based on a natural language processing model insersent binding Gene Ontology (GO). In the method, each record of GO annotation axiom has a corresponding weight; and training a PPI positive and negative data set on the model based on sentence embedding Infersent by combining GO annotation axiom and GO structure axiom to finally obtain a model for predicting PPI.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a method for predicting protein-protein interactions based on a sentence-embedded insersent model, comprising the steps of:

s1, the body of GO is constructed into a graph, wherein GO terms are used as nodes in the graph, and the relationship between GO terms is called as edges. Extracting and generating a GO structure axiom from a GO graph structure file by using the existing Onto2Vec technology, training the GO structure axiom, and obtaining a GO term word vector;

s2, screening and extracting annotation axioms: screening and extracting each GO annotation record with corresponding weight in a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom;

s3, combining the GO annotation axiom in the step S1, replacing the proteins of the PPI positive and negative data set with the GO terms annotated to the PPI positive and negative data set line by line to obtain final training data;

s4, modifying the InfersenttPI model into an InfersentpI model, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentpI model to finally obtain a model for predicting the PPI, and outputting a PPI prediction result.

Preferably, the step S1 further includes the steps of: s1.1, extracting GO graph structure records in GO. S1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;

s1.3, training in a skip-gram model as follows:

given a sequence of training words x₁，x₂，.....，x₃The objective of the Skip-gram model is to maximize the following formula:

where c is the size of the training context window, T is the size of the training word set, w_iIs the ith training word in the sequence;

s1.4, obtaining word vectors of GO terms after training is finished and organizing the word vectors into a file to be output;

preferably, the step S2 further includes the steps of: s2.1, screening each record of the GOA according to the Event Code field content of a Gene Ontology Annotation (GOA) file to be processed, deleting the Event Code field content to be 'IEA' or 'ND' to obtain a screened GOA file, extracting a UniProtKB unique identification Code and a GO unique identification Code of each row of records of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of the repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record; s2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;

preferably, the step S3 further includes the steps of: s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction (PPI) positive and negative dataset, mapping the proteins into two UniProtKB unique identification codes, deleting the protein pairs in which the proteins cannot be mapped into the UniProtKB unique identification codes, generating an attribute label 'positive' or 'negative' of the corresponding protein pair according to the property of the dataset, organizing the protein pairs and the attribute label into a PPI record file, wherein the content of each line in the PPI record file consists of the two UniProtKB unique identification codes and the attribute label; s3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;

s3.3, in the PPI corpus in the step S1, randomly selecting 80%, 10% and 10% as a training set, a verification set and a test set as final training data.

Preferably, the step S4 further includes the steps of: s4.1, modifying based on an Infersend model, wherein a sentence encoder of the Infersend model is set as a convolutional neural network, a classifier of the Infersend model is set as two classes, and labels of the two classes are 'positive' and 'negative', so that an Infersend PPPI model is obtained; s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;

preferably, the iterative training in step S4.2 comprises the steps of: s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are word vectors of GO terms, the sentence encoders use convolutional neural networks, and the generated sentence vectors u and v are protein vectors u and v; s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1; s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2; s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;

s4.2-5, the formula of the predicted PPI is as follows:

InfersentPPI(a,b)＝P(positive)>P(negative)？positive:negative

s4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times; s4.3, obtaining a model for predicting the PPI with the best effect after the iterative training is finished;

s4.4, predicting the PPI model in the step S4.3 on a test set, and organizing the prediction result of the test set into a file to be output;

preferably, in step S4.1, the "positive" classification represents PPI positive and the "negative" classification represents PPI negative, and in step S4.3, the predicted PPI is the PPI prediction of the protein annotated by the gene ontology.

Compared with the prior art, the invention has the beneficial effects that: the method for predicting protein-protein interaction based on the sentence-embedded Infersend model provided by the invention effectively improves the PPI prediction accuracy and AUC by means of natural language processing model Infersend and combining gene ontology.

Drawings

Fig. 1 is a general flow chart of the operation of the present invention, divided into 4 modules: onto2Vec, screening, extracting and annotating axioms, combining and processing and InfersentPPI;

FIG. 2 is a specific implementation of the Onto2Vec generation GO vector of the present invention;

FIG. 3 is a schematic flow chart of the present invention for screening and extracting annotation axioms;

FIG. 4 is a specific implementation of the InfersentpI model of the present invention;

FIG. 5 is a specific implementation of the sentence coder of the InfersentpI model of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1-5, the present invention provides a method for predicting protein-protein interactions based on a sentence-embedded Infersent model (detailed below with PPI positive-negative data as an example), comprising the steps of:

step S1, the body of GO is constructed into a graph, where GO terms are nodes in the graph, relationships (also called object attributes) between GO terms are called edges, and GO general information is available in GO. Extracting and generating a GO structure axiom in a GO map structure file go.own by using the existing Onto2Vec technology, training the GO structure axiom to obtain a GO term word vector, wherein the GO structure axiom is a description composition that a GO term (except a root term representing each aspect) has a subclass relation with another GO term;

step S2, filtering and extracting annotation axioms: screening and extracting each GO annotation record with corresponding weight in a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom;

step S3, combining the GO annotation axiom in the step S1, replacing the proteins of the PPI positive and negative data set with the GO terms annotated to the PPI positive and negative data set line by line to obtain final training data;

and S4, modifying the InfersenttPI model into an InfersentpI model, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentpI model to finally obtain a model for predicting the PPI, and outputting a PPI prediction result.

As shown in fig. 2, the step S1 further includes the following steps:

s1.1, extracting GO graph structure records in GO documents, each GO graph structure record being composed of a plurality of relation words (e.g., supbolasof, disajoint length) between GO unique identification codes and GO terms, the GO graph structure records being organized into a file to obtain a GO structure axiom file, which is specifically shown in table 1:

table 1 is an example of the contents of the GO structural axiom file

S1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;

s1.3, training in a skip-gram model as follows:

given a sequence of training words x₁,x₂,...,x_TSkip-gram modelThe objective is to maximize the following equation:

wherein c is the size of the training context window, T is the size of the training word set, and Wi is the ith training word in the sequence;

as shown in fig. 3, the step S2 further includes the following steps:

s2.1, screening each record of the GOA according to the contents of the event Code field of a Gene Ontology Annotation (GOA) file to be processed. And the Event Code is a valid Evidence Code annotated by GO, and records with the Event Code field content of 'IEA' or 'ND' are deleted to obtain the screened GOA file. ND evidence codes are used for annotation when no information is available about the molecular function, biological process or cellular composition of the annotated gene or gene product. IEA-supported annotations are ultimately based on homology and/or other experimental or sequence information, but are generally not traceable back to the source of the experiment. Extracting UniProtKB unique identification codes and GO unique identification codes recorded in each line of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record;

specific examples are shown in table 2 and table 3:

table 2 is an example of the contents of a GOA file

Table 3 is an example of the contents of the GO annotation record file

UniProtKB ID	Relation	GO ID
			A2P2R3	hasFunction	GO:0006047
D6VTK4	hasFunction	GO:0000750
			D6VTK4	hasFunction	GO:0000750

S2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;

specific examples are shown in table 4:

table 4 is an example of the contents of the GO annotated axiom file

UniProtKB ID	GO ID
		A2P2R3	GO:0006002；GO:0006047
D6VTK4	GO:0000750；GO:0000750

The step S3 further includes the steps of:

s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction (PPI) positive and negative dataset, mapping the pair of proteins to two UniProtKB unique identification codes, deleting the pair of proteins where the proteins cannot be mapped to the UniProtKB unique identification codes, and generating an attribute label 'positive' or 'negative' of the corresponding pair of proteins according to the property of the dataset, wherein 'positive' refers to PPI positive, and 'negative' refers to PPI negative. The protein pairs and the attribute tags are organized into a PPI record file, and the content of each line in the PPI record file is composed of two UniProtKB unique identification codes and the attribute tags;

specific examples are shown in table 5:

table 5 is an example of the contents of the PPI record file

ProteinA	ProteinB	Tag
			P16649	P14922	positive
P07269	P22035	positive
			P53248	P32366	negative
Q08558	P31412	negative
			Q06169	P41807	negative

S3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;

specific examples are shown in table 6:

table 6 is an example of the contents of a PPI corpus

The step S4 further includes the steps of:

s4.1, modifying based on an Infersend model, wherein a sentence encoder of the Infersend model is set as a convolutional neural network, a classifier of the Infersend model is set as two classes, and labels of the two classes are 'positive' and 'negative', so that an Infersend PPPI model is obtained;

s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;

as shown in fig. 4, the iterative training in step S4.2 includes the following steps:

s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are the word vectors of GO terms, the sentence encoders use convolutional neural networks, and as shown in FIG. 5, the generated sentence vectors u and v are the protein vectors u and v;

s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1;

s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2;

s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;

s4.2-5, the formula of the predicted PPI is as follows:

InfersentPPI(a,b)＝P(positive)>P(negative)？positive:negative

for example: one input of InfersentPPI is sentences a and B, and UniProtKB unique identification codes for protein a and protein B:

sentence a: P16649;

sentence b: P14922;

then replacing the proteins line by line with the unique GO identification code annotated to it according to step S3.2, resulting in the test data being word set GOs1, word set GOs 2:

GOs1:{GO_0000329,GO_0005739,GO_0005739,GO_0006623,GO _0022857,GO_0055085}

GOs2:{GO_0005783,GO_0006633,GO_0006892,GO_0009922,GO _0009922,GO_0019367,GO_0030148,GO_0030148,GO_0030176,GO_0 030497,GO_0032511,GO_0034625,GO_0034626,GO_0042761,GO_004 2761}

finally, p (positive) of the texts a and b is calculated according to the formula in step S4.2, and p (negative) is 0.724 and 0.276, so that InfersentPPI (a, b) is calculated.

S4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times;

s4.3, obtaining a model for predicting the PPI with the best effect after the iterative training is finished;

s4.4, the model for predicting the PPI in the step S4.3 predicts on the test set, the prediction result of the test set is organized into a file to be output, and the parameter Batch _ size is 2 trained model, so that the best prediction effect is achieved on the test set;

in step S4.1, the "positive" classification represents PPI positive and the "negative" classification represents PPI negative, and in step S4.3, the predicted PPI is the PPI prediction of the protein annotated by the gene ontology.

In conclusion, the method for predicting protein-protein interaction based on the sentence-embedded Infersent model provided by the invention effectively improves the accuracy and AUC of PPI prediction by means of natural language processing model Infersent in combination with gene ontology.

The invention can be applied not only to proteins, but also to other examples annotated with an ontology. In addition, the sentence encoder of the natural language processing model infersense can be replaced, and the implementation of the whole model is not influenced. The user can select a suitable sentence encoder according to the requirement.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A method for protein-protein interaction prediction based on gene ontology, characterized in that, comprising the following steps:

The ontology of S1 and GO is constructed as a graph, in which GO terms are used as nodes in the graph, and the relationships between GO terms are called edges. Using Onto2Vec technology, the GO term word vector is obtained from the GO graph structure file go.owl;

S2. Create GO annotations by associating genes or gene products with GO terms; filter and extract each GO annotation record with corresponding weight in the GOA file, and organize and generate GO annotation axioms;

S3. Combined with the GO annotation axiom in step S1, replace the proteins of the protein interaction positive and negative data set with the GO terms that annotate it line by line to obtain the final training data;

S4, construct the InfersentPPI model based on Infersent, combine the described GO term word vector in step S2, carry out iterative training to the described training data in step S3 on InfersentPPI model, finally obtain the model of predicting PPI, output PPI prediction result;

The step S1 further includes the following steps:

S1.1. Take out the GO graph structure record in the go.owl file. Each GO graph structure record is composed of multiple GO unique identifiers and related words. The GO graph structure records are organized into files, and the GO structure axiom file is obtained;

S1.2, input the GO structure axiom file in step S1.1 line by line into the skip-gram model of Word2vec;

S1.3, train in the Skip-gram model, as follows:

Given a sequence of training words x ₁ , x ₂ , ....., x ₃ , the purpose of the Skip-gram model is to maximize the following formula:

where c is the size of the training context window, T is the size of the training word set, and w _i is the ith training word in the sequence;

S1.4. After training, the word vector of GO terms obtained is organized into file output;

The step S2 further includes the following steps:

S2.1. According to the content of the Evidence Code field of the Gene Ontology annotation file to be processed, filter each record of the GOA, delete the records whose Evidence Code field content is 'IEA' or 'ND', and obtain the filtered GOA file, Extract the UniProtKB unique identification code and GO unique identification code recorded in each line of the filtered GOA file, and obtain the GO annotation record file. The repeated records in the GO annotation record file are not deleted, and the number of repetitions represents the valid reference of this annotation record. The number of , which can be used as the weight of the corresponding annotation record;

S2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in step S1.2, gather them in the same line, organize them into files, and obtain the GO annotation axiom file;

The step S3 further includes the following steps:

S3.1. Extract a pair of proteins recorded in each row of the protein-protein interaction positive and negative data set, and map them to two UniProtKB unique identifiers. For proteins that cannot be mapped to UniProtKB unique identifiers, delete their protein pair. Generate attribute labels 'positive' or 'negative' of corresponding protein pairs according to the properties of the dataset. Protein pairs and attribute labels are organized into a PPI record file. The content of each line in the PPI record file is composed of two UniProtKB unique identifiers and attributes. label composition;

S3.2, using the gene ontology annotation axiom in step S1, replace the protein of the PPI record file in step S3.1 line by line with the GO unique identification code annotating it, to obtain the PPI corpus of the training model;

S3.3. For the PPI corpus in step S1, randomly select 80%, 10%, and 10% as the training set, verification set, and test set, as the final training data;

The step S4 further includes the following steps:

S4.1. Transform based on the Infersent model, where the sentence encoder of the Infersent model is set to a convolutional neural network, the classifier of the Infersent model is set to two-class, and the labels of the two-class are 'positive' and 'negative', 'positive' is PPI positive, 'negative' is PPI negative, and the InfersentPPI model is obtained;

S4.2, in conjunction with the word vector of the GO term in step S2, perform iterative training on the training data in step S3 in the InfersentPPI model in step S4.1;

The iterative training in step S4.2 includes the following steps:

S4.2-1. The GO unique identification codes of the two sets extracted from the training set of training data are input into two sentence encoders as sentence A and sentence B respectively. The word vector used by the sentence encoder is the word vector of the GO term. , the sentence encoder uses a convolutional neural network, and the generated sentence vector u and sentence vector v are the protein vector u and the protein vector v;

S4.2-2. Using the sentence vector u and the sentence vector v in step S4.1, calculate the end-to-end connection of u and v to obtain (u, v), multiply u and v to obtain u*v, calculate u Subtract it from v to get |u-v|, and finally send the obtained (u, v, u*v, |u-v|) result to a 2-class classifier, which consists of multiple fully connected layers and a softmax layer, Finally, the predicted values of the probability distributions of the labels 'positive' and 'negative' of the sentence A and sentence B in step S4.2-1 are obtained;

S4.2-3. Minimize the error between the labels of the training set and the predicted values of the probability distributions of the labels 'positive' and 'negative' in step S4.2-2;

S4.2-4. Repeat steps S4.2-1 to S4.2-3 until all the data of the training set is iterated once;

S4.2-5. The formula for predicting PPI is as follows:

InfersentPPI(a,b)=P(positive)>P(negative)? positive: negative

S4.2-6. Predict on the validation set. If the validation set result is worse than the last validation set result, stop training and do not save the model. If the validation set result is better than the last validation set result, save the model and adjust Learning rate, stop training when the learning rate is lower than the set minimum learning rate, repeat steps S4.2-1 to S4.2-4 when the learning rate is higher than the minimum learning rate set by the parameters to continue the next round of iterative training, iterative When the number of iterations reaches the maximum number of iterations set by the parameter, the training is stopped;

S4.3. The iterative training is over, and the best model for predicting PPI is obtained. PPI refers to the PPI of the protein annotated by the gene ontology;

S4.4. The model for predicting PPI in step S4.3 performs prediction on the test set, and organizes the prediction results of the test set into a file for output.