CN112185457A - Protein-protein interaction prediction method based on sentence embedding Infersent model - Google Patents
Protein-protein interaction prediction method based on sentence embedding Infersent model Download PDFInfo
- Publication number
- CN112185457A CN112185457A CN202011085576.4A CN202011085576A CN112185457A CN 112185457 A CN112185457 A CN 112185457A CN 202011085576 A CN202011085576 A CN 202011085576A CN 112185457 A CN112185457 A CN 112185457A
- Authority
- CN
- China
- Prior art keywords
- training
- ppi
- model
- protein
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- Molecular Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biotechnology (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Physiology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method for predicting Protein-Protein Interaction based on a sentence-embedded Infersend model, which is used for predicting Protein-Protein Interaction (PPI) based on a natural language processing model Infersend combined gene ontology. The method comprises the steps of combining a GO graph structure to obtain a GO term word vector; screening and extracting a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom; and training a PPI positive and negative data set by combining GO annotation axioms and GO term word vectors on a model based on sentence embedding Infersents, and finally obtaining a model for predicting PPIs. The invention provides a novel method for predicting PPI aiming at the problem that the PPI prediction accuracy and AUC are not high enough at the present stage, and the prediction accuracy and AUC are improved.
Description
Technical Field
The invention relates to the fields of biological information and natural language processing, in particular to application of a gene ontology and sentence embedding model in the field of protein-protein interaction (PPI) prediction.
Background
Protein-protein interactions (PPIs) are a fundamental indicator of many bioinformatics applications such as protein function and drug discovery. Therefore, accurate prediction of protein-protein interactions will help us to understand underlying molecular mechanisms and significantly facilitate drug discovery. The PPI can be predicted more accurately through Gene Ontology (GO) information. Most previous studies of gene ontology information to predict PPI have utilized Information Content (IC). Recently, some studies have used word embedding techniques in the field of natural language processing to learn vectors representing GO terms and proteins to predict PPIs.
Gene ontology is a standard lexical term for biological functional annotation, a uniform term used to describe the function of homologous genes and gene products across species. The invention utilizes a supervised sentence embedding technology to capture GO structure and GO annotation information to predict PPI. Combining gene ontology with powerful natural language processing techniques, our method provides a general computational flow to predict protein-protein interactions even without the use of protein sequence information.
Disclosure of Invention
The invention aims to provide a protein-protein interaction prediction method based on a sentence embedding insersent model, which predicts protein-protein interaction (PPI) based on a natural language processing model insersent binding Gene Ontology (GO). In the method, each record of GO annotation axiom has a corresponding weight; and training a PPI positive and negative data set on the model based on sentence embedding Infersent by combining GO annotation axiom and GO structure axiom to finally obtain a model for predicting PPI.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a method for predicting protein-protein interactions based on a sentence-embedded insersent model, comprising the steps of:
s1, the body of GO is constructed into a graph, wherein GO terms are used as nodes in the graph, and the relationship between GO terms is called as edges. Extracting and generating a GO structure axiom from a GO graph structure file by using the existing Onto2Vec technology, training the GO structure axiom, and obtaining a GO term word vector;
s2, screening and extracting annotation axioms: screening and extracting each GO annotation record with corresponding weight in a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom;
s3, combining the GO annotation axiom in the step S1, replacing the proteins of the PPI positive and negative data set with the GO terms annotated to the PPI positive and negative data set line by line to obtain final training data;
s4, modifying the InfersenttPI model into an InfersentpI model, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentpI model to finally obtain a model for predicting the PPI, and outputting a PPI prediction result.
Preferably, the step S1 further includes the steps of: s1.1, extracting GO graph structure records in GO. S1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;
s1.3, training in a skip-gram model as follows:
given a sequence of training words x1,x2,.....,x3The objective of the Skip-gram model is to maximize the following formula:
where c is the size of the training context window, T is the size of the training word set, wiIs the ith training word in the sequence;
s1.4, obtaining word vectors of GO terms after training is finished and organizing the word vectors into a file to be output;
preferably, the step S2 further includes the steps of: s2.1, screening each record of the GOA according to the Event Code field content of a Gene Ontology Annotation (GOA) file to be processed, deleting the Event Code field content to be 'IEA' or 'ND' to obtain a screened GOA file, extracting a UniProtKB unique identification Code and a GO unique identification Code of each row of records of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of the repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record; s2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;
preferably, the step S3 further includes the steps of: s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction (PPI) positive and negative dataset, mapping the proteins into two UniProtKB unique identification codes, deleting the protein pairs in which the proteins cannot be mapped into the UniProtKB unique identification codes, generating an attribute label 'positive' or 'negative' of the corresponding protein pair according to the property of the dataset, organizing the protein pairs and the attribute label into a PPI record file, wherein the content of each line in the PPI record file consists of the two UniProtKB unique identification codes and the attribute label; s3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;
s3.3, in the PPI corpus in the step S1, randomly selecting 80%, 10% and 10% as a training set, a verification set and a test set as final training data.
Preferably, the step S4 further includes the steps of: s4.1, modifying based on an Infersend model, wherein a sentence encoder of the Infersend model is set as a convolutional neural network, a classifier of the Infersend model is set as two classes, and labels of the two classes are 'positive' and 'negative', so that an Infersend PPPI model is obtained; s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;
preferably, the iterative training in step S4.2 comprises the steps of: s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are word vectors of GO terms, the sentence encoders use convolutional neural networks, and the generated sentence vectors u and v are protein vectors u and v; s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1; s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2; s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;
s4.2-5, the formula of the predicted PPI is as follows:
InfersentPPI(a,b)=P(positive)>P(negative)?positive:negative
s4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times; s4.3, obtaining a model for predicting the PPI with the best effect after the iterative training is finished;
s4.4, predicting the PPI model in the step S4.3 on a test set, and organizing the prediction result of the test set into a file to be output;
preferably, in step S4.1, the "positive" classification represents PPI positive and the "negative" classification represents PPI negative, and in step S4.3, the predicted PPI is the PPI prediction of the protein annotated by the gene ontology.
Compared with the prior art, the invention has the beneficial effects that: the method for predicting protein-protein interaction based on the sentence-embedded Infersend model provided by the invention effectively improves the PPI prediction accuracy and AUC by means of natural language processing model Infersend and combining gene ontology.
Drawings
Fig. 1 is a general flow chart of the operation of the present invention, divided into 4 modules: onto2Vec, screening, extracting and annotating axioms, combining and processing and InfersentPPI;
FIG. 2 is a specific implementation of the Onto2Vec generation GO vector of the present invention;
FIG. 3 is a schematic flow chart of the present invention for screening and extracting annotation axioms;
FIG. 4 is a specific implementation of the InfersentpI model of the present invention;
FIG. 5 is a specific implementation of the sentence coder of the InfersentpI model of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1-5, the present invention provides a method for predicting protein-protein interactions based on a sentence-embedded Infersent model (detailed below with PPI positive-negative data as an example), comprising the steps of:
step S1, the body of GO is constructed into a graph, where GO terms are nodes in the graph, relationships (also called object attributes) between GO terms are called edges, and GO general information is available in GO. Extracting and generating a GO structure axiom in a GO map structure file go.own by using the existing Onto2Vec technology, training the GO structure axiom to obtain a GO term word vector, wherein the GO structure axiom is a description composition that a GO term (except a root term representing each aspect) has a subclass relation with another GO term;
step S2, filtering and extracting annotation axioms: screening and extracting each GO annotation record with corresponding weight in a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom;
step S3, combining the GO annotation axiom in the step S1, replacing the proteins of the PPI positive and negative data set with the GO terms annotated to the PPI positive and negative data set line by line to obtain final training data;
and S4, modifying the InfersenttPI model into an InfersentpI model, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentpI model to finally obtain a model for predicting the PPI, and outputting a PPI prediction result.
As shown in fig. 2, the step S1 further includes the following steps:
s1.1, extracting GO graph structure records in GO documents, each GO graph structure record being composed of a plurality of relation words (e.g., supbolasof, disajoint length) between GO unique identification codes and GO terms, the GO graph structure records being organized into a file to obtain a GO structure axiom file, which is specifically shown in table 1:
table 1 is an example of the contents of the GO structural axiom file
S1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;
s1.3, training in a skip-gram model as follows:
given a sequence of training words x1,x2,...,xTSkip-gram modelThe objective is to maximize the following equation:
wherein c is the size of the training context window, T is the size of the training word set, and Wi is the ith training word in the sequence;
s1.4, obtaining word vectors of GO terms after training is finished and organizing the word vectors into a file to be output;
as shown in fig. 3, the step S2 further includes the following steps:
s2.1, screening each record of the GOA according to the contents of the event Code field of a Gene Ontology Annotation (GOA) file to be processed. And the Event Code is a valid Evidence Code annotated by GO, and records with the Event Code field content of 'IEA' or 'ND' are deleted to obtain the screened GOA file. ND evidence codes are used for annotation when no information is available about the molecular function, biological process or cellular composition of the annotated gene or gene product. IEA-supported annotations are ultimately based on homology and/or other experimental or sequence information, but are generally not traceable back to the source of the experiment. Extracting UniProtKB unique identification codes and GO unique identification codes recorded in each line of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record;
specific examples are shown in table 2 and table 3:
table 2 is an example of the contents of a GOA file
Table 3 is an example of the contents of the GO annotation record file
UniProtKB ID | Relation | GO ID |
A2P2R3 | hasFunction | GO:0006047 |
D6VTK4 | hasFunction | GO:0000750 |
D6VTK4 | hasFunction | GO:0000750 |
S2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;
specific examples are shown in table 4:
table 4 is an example of the contents of the GO annotated axiom file
UniProtKB ID | GO ID |
A2P2R3 | GO:0006002;GO:0006047 |
D6VTK4 | GO:0000750;GO:0000750 |
The step S3 further includes the steps of:
s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction (PPI) positive and negative dataset, mapping the pair of proteins to two UniProtKB unique identification codes, deleting the pair of proteins where the proteins cannot be mapped to the UniProtKB unique identification codes, and generating an attribute label 'positive' or 'negative' of the corresponding pair of proteins according to the property of the dataset, wherein 'positive' refers to PPI positive, and 'negative' refers to PPI negative. The protein pairs and the attribute tags are organized into a PPI record file, and the content of each line in the PPI record file is composed of two UniProtKB unique identification codes and the attribute tags;
specific examples are shown in table 5:
table 5 is an example of the contents of the PPI record file
ProteinA | ProteinB | Tag |
P16649 | P14922 | positive |
P07269 | P22035 | positive |
P53248 | P32366 | negative |
Q08558 | P31412 | negative |
Q06169 | P41807 | negative |
S3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;
specific examples are shown in table 6:
table 6 is an example of the contents of a PPI corpus
S3.3, in the PPI corpus in the step S1, randomly selecting 80%, 10% and 10% as a training set, a verification set and a test set as final training data.
The step S4 further includes the steps of:
s4.1, modifying based on an Infersend model, wherein a sentence encoder of the Infersend model is set as a convolutional neural network, a classifier of the Infersend model is set as two classes, and labels of the two classes are 'positive' and 'negative', so that an Infersend PPPI model is obtained;
s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;
as shown in fig. 4, the iterative training in step S4.2 includes the following steps:
s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are the word vectors of GO terms, the sentence encoders use convolutional neural networks, and as shown in FIG. 5, the generated sentence vectors u and v are the protein vectors u and v;
s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1;
s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2;
s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;
s4.2-5, the formula of the predicted PPI is as follows:
InfersentPPI(a,b)=P(positive)>P(negative)?positive:negative
for example: one input of InfersentPPI is sentences a and B, and UniProtKB unique identification codes for protein a and protein B:
sentence a: P16649;
sentence b: P14922;
then replacing the proteins line by line with the unique GO identification code annotated to it according to step S3.2, resulting in the test data being word set GOs1, word set GOs 2:
GOs1:{GO_0000329,GO_0005739,GO_0005739,GO_0006623,GO _0022857,GO_0055085}
GOs2:{GO_0005783,GO_0006633,GO_0006892,GO_0009922,GO _0009922,GO_0019367,GO_0030148,GO_0030148,GO_0030176,GO_0 030497,GO_0032511,GO_0034625,GO_0034626,GO_0042761,GO_004 2761}
finally, p (positive) of the texts a and b is calculated according to the formula in step S4.2, and p (negative) is 0.724 and 0.276, so that InfersentPPI (a, b) is calculated.
S4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times;
s4.3, obtaining a model for predicting the PPI with the best effect after the iterative training is finished;
s4.4, the model for predicting the PPI in the step S4.3 predicts on the test set, the prediction result of the test set is organized into a file to be output, and the parameter Batch _ size is 2 trained model, so that the best prediction effect is achieved on the test set;
in step S4.1, the "positive" classification represents PPI positive and the "negative" classification represents PPI negative, and in step S4.3, the predicted PPI is the PPI prediction of the protein annotated by the gene ontology.
In conclusion, the method for predicting protein-protein interaction based on the sentence-embedded Infersent model provided by the invention effectively improves the accuracy and AUC of PPI prediction by means of natural language processing model Infersent in combination with gene ontology.
The invention can be applied not only to proteins, but also to other examples annotated with an ontology. In addition, the sentence encoder of the natural language processing model infersense can be replaced, and the implementation of the whole model is not influenced. The user can select a suitable sentence encoder according to the requirement.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.
Claims (1)
1. A method for prediction of protein-protein interactions based on gene ontology, comprising the steps of:
s1, the body of GO is constructed into a graph, wherein GO terms are used as nodes in the graph, and the relationship between GO terms is called as edges. Obtaining a GO term word vector from a GO map structure file GO. own by using an Onto2Vec technology;
s2, is creating GO annotation by associating a gene or gene product with GO terms; screening and extracting each GO annotation record with corresponding weight in a GOA file, and organizing and generating a GO annotation axiom;
s3, combining the GO annotation axiom in the step S1, replacing the proteins of the protein interaction positive and negative data set with the GO terms annotated to the protein interaction positive and negative data set line by line to obtain final training data;
s4, constructing an InfersentPPI model based on Infersent, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentPPI model to finally obtain a model for predicting PPI, and outputting a PPI prediction result;
the step S1 further includes the steps of:
s1.1, taking out GO graph structure records in go.owl files, wherein each GO graph structure record is composed of a plurality of unique GO identification codes and relation words thereof, and the GO graph structure records are organized into files to obtain GO structure axiom files;
s1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;
s1.3, training is carried out in a Skip-gram model as follows:
given a sequence of training words x1,x2,.....,x3The objective of the Skip-gram model is to maximize the following formula:
where c is the size of the training context window, T is the size of the training word set, wiIs the ith training word in the sequence;
s1.4, obtaining word vectors of GO terms after training is finished and organizing the word vectors into a file to be output;
the step S2 further includes the steps of:
s2.1, screening each record of the GOA according to the Event Code field content of the gene ontology annotation file to be processed, deleting the Event Code field content to be 'IEA' or 'ND' records to obtain a screened GOA file, extracting UniProtKB unique identification codes and GO unique identification codes of each line of records of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of the repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record;
s2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;
the step S3 further includes the steps of:
s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction positive and negative dataset, mapping the proteins into two UniProtKB unique identification codes, deleting a protein pair in which the proteins cannot be mapped into the UniProtKB unique identification codes, generating an attribute label 'positive' or 'negative' of the corresponding protein pair according to the property of the dataset, organizing the protein pair and the attribute label into a PPI record file, and enabling the content of each line in the PPI record file to be composed of the two UniProtKB unique identification codes and the attribute label;
s3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;
s3.3, randomly selecting 80%, 10% and 10% of the PPI corpus in the step S1 as a training set, a verification set and a test set as final training data;
the step S4 further includes the steps of:
s4.1, modifying based on an Infersent model, wherein a sentence encoder of the Infersent model is set as a convolutional neural network, a classifier of the Infersent model is set as a second classification, labels of the second classification are 'positive' and 'negative', the 'positive' represents PPI positive, the 'negative' represents PPI negative, and an Infersent PPPI model is obtained;
s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;
the iterative training in step S4.2 comprises the following steps:
s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are word vectors of GO terms, the sentence encoders use convolutional neural networks, and the generated sentence vectors u and v are protein vectors u and v;
s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1;
s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2;
s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;
s4.2-5, the formula of the predicted PPI is as follows:
InfersentPPI(a,b)=P(positive)>P(negative)?positive:negative
s4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times;
s4.3, obtaining a model with the best effect for predicting PPI after the iterative training is finished, wherein the PPI refers to the PPI of the protein annotated by the gene body;
s4.4, the model for predicting the PPI in the step S4.3 predicts on the test set, and the prediction result of the test set is organized into a file to be output.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011085576.4A CN112185457A (en) | 2020-10-12 | 2020-10-12 | Protein-protein interaction prediction method based on sentence embedding Infersent model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011085576.4A CN112185457A (en) | 2020-10-12 | 2020-10-12 | Protein-protein interaction prediction method based on sentence embedding Infersent model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112185457A true CN112185457A (en) | 2021-01-05 |
Family
ID=73949329
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011085576.4A Withdrawn CN112185457A (en) | 2020-10-12 | 2020-10-12 | Protein-protein interaction prediction method based on sentence embedding Infersent model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112185457A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115565607A (en) * | 2022-10-20 | 2023-01-03 | 抖音视界有限公司 | Method, device, readable medium and electronic equipment for determining protein information |
-
2020
- 2020-10-12 CN CN202011085576.4A patent/CN112185457A/en not_active Withdrawn
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115565607A (en) * | 2022-10-20 | 2023-01-03 | 抖音视界有限公司 | Method, device, readable medium and electronic equipment for determining protein information |
CN115565607B (en) * | 2022-10-20 | 2024-02-23 | 抖音视界有限公司 | Method, device, readable medium and electronic equipment for determining protein information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Varma et al. | Snuba: Automating weak supervision to label training data | |
CN113707235B (en) | Drug micromolecule property prediction method, device and equipment based on self-supervision learning | |
CN109697285B (en) | Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation | |
CN108984724B (en) | Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation | |
Nadif et al. | Unsupervised and self-supervised deep learning approaches for biomedical text mining | |
CN110362723B (en) | Topic feature representation method, device and storage medium | |
CN114450751A (en) | System and method for training a machine learning algorithm to process biologically relevant data, microscope and trained machine learning algorithm | |
CN110032739A (en) | Chinese electronic health record name entity abstracting method and system | |
CN109036577A (en) | Diabetic complication analysis method and device | |
CN111582506A (en) | Multi-label learning method based on global and local label relation | |
CN116312915B (en) | Method and system for standardized association of drug terms in electronic medical records | |
CN113591955A (en) | Method, system, equipment and medium for extracting global information of graph data | |
Iqbal et al. | A dynamic weighted tabular method for convolutional neural networks | |
CN117370736A (en) | Fine granularity emotion recognition method, electronic equipment and storage medium | |
Wang et al. | Fusang: a framework for phylogenetic tree inference via deep learning | |
Zhou et al. | Review for Handling Missing Data with special missing mechanism | |
CN112185457A (en) | Protein-protein interaction prediction method based on sentence embedding Infersent model | |
CN111259176B (en) | Cross-modal Hash retrieval method based on matrix decomposition and integrated with supervision information | |
CN117436522A (en) | Biological event relation extraction method and large-scale biological event relation knowledge base construction method of cancer subject | |
Zaghir et al. | Real-world patient trajectory prediction from clinical notes using artificial neural networks and UMLS-based extraction of concepts | |
WO2024058915A1 (en) | Classification using a machine learning model trained with triplet loss | |
Louati et al. | Design and compression study for convolutional neural networks based on evolutionary optimization for thoracic X-Ray image classification | |
Zhu et al. | Uni-Fold MuSSe: De Novo Protein Complex Prediction with Protein Language Models | |
Ranjan et al. | MCWS-transformers: towards an efficient modeling of protein sequences via multi context-window based scaled self-attention | |
CN115116549A (en) | Cell data annotation method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20210105 |