CN112185457A - Protein-protein interaction prediction method based on sentence embedding Infersent model - Google Patents

Protein-protein interaction prediction method based on sentence embedding Infersent model Download PDF

Info

Publication number
CN112185457A
CN112185457A CN202011085576.4A CN202011085576A CN112185457A CN 112185457 A CN112185457 A CN 112185457A CN 202011085576 A CN202011085576 A CN 202011085576A CN 112185457 A CN112185457 A CN 112185457A
Authority
CN
China
Prior art keywords
training
ppi
model
protein
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011085576.4A
Other languages
Chinese (zh)
Inventor
江莹莹
李美晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN202011085576.4A priority Critical patent/CN112185457A/en
Publication of CN112185457A publication Critical patent/CN112185457A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biotechnology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Physiology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for predicting Protein-Protein Interaction based on a sentence-embedded Infersend model, which is used for predicting Protein-Protein Interaction (PPI) based on a natural language processing model Infersend combined gene ontology. The method comprises the steps of combining a GO graph structure to obtain a GO term word vector; screening and extracting a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom; and training a PPI positive and negative data set by combining GO annotation axioms and GO term word vectors on a model based on sentence embedding Infersents, and finally obtaining a model for predicting PPIs. The invention provides a novel method for predicting PPI aiming at the problem that the PPI prediction accuracy and AUC are not high enough at the present stage, and the prediction accuracy and AUC are improved.

Description

Protein-protein interaction prediction method based on sentence embedding Infersent model
Technical Field
The invention relates to the fields of biological information and natural language processing, in particular to application of a gene ontology and sentence embedding model in the field of protein-protein interaction (PPI) prediction.
Background
Protein-protein interactions (PPIs) are a fundamental indicator of many bioinformatics applications such as protein function and drug discovery. Therefore, accurate prediction of protein-protein interactions will help us to understand underlying molecular mechanisms and significantly facilitate drug discovery. The PPI can be predicted more accurately through Gene Ontology (GO) information. Most previous studies of gene ontology information to predict PPI have utilized Information Content (IC). Recently, some studies have used word embedding techniques in the field of natural language processing to learn vectors representing GO terms and proteins to predict PPIs.
Gene ontology is a standard lexical term for biological functional annotation, a uniform term used to describe the function of homologous genes and gene products across species. The invention utilizes a supervised sentence embedding technology to capture GO structure and GO annotation information to predict PPI. Combining gene ontology with powerful natural language processing techniques, our method provides a general computational flow to predict protein-protein interactions even without the use of protein sequence information.
Disclosure of Invention
The invention aims to provide a protein-protein interaction prediction method based on a sentence embedding insersent model, which predicts protein-protein interaction (PPI) based on a natural language processing model insersent binding Gene Ontology (GO). In the method, each record of GO annotation axiom has a corresponding weight; and training a PPI positive and negative data set on the model based on sentence embedding Infersent by combining GO annotation axiom and GO structure axiom to finally obtain a model for predicting PPI.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a method for predicting protein-protein interactions based on a sentence-embedded insersent model, comprising the steps of:
s1, the body of GO is constructed into a graph, wherein GO terms are used as nodes in the graph, and the relationship between GO terms is called as edges. Extracting and generating a GO structure axiom from a GO graph structure file by using the existing Onto2Vec technology, training the GO structure axiom, and obtaining a GO term word vector;
s2, screening and extracting annotation axioms: screening and extracting each GO annotation record with corresponding weight in a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom;
s3, combining the GO annotation axiom in the step S1, replacing the proteins of the PPI positive and negative data set with the GO terms annotated to the PPI positive and negative data set line by line to obtain final training data;
s4, modifying the InfersenttPI model into an InfersentpI model, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentpI model to finally obtain a model for predicting the PPI, and outputting a PPI prediction result.
Preferably, the step S1 further includes the steps of: s1.1, extracting GO graph structure records in GO. S1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;
s1.3, training in a skip-gram model as follows:
given a sequence of training words x1,x2,.....,x3The objective of the Skip-gram model is to maximize the following formula:
Figure 1
where c is the size of the training context window, T is the size of the training word set, wiIs the ith training word in the sequence;
s1.4, obtaining word vectors of GO terms after training is finished and organizing the word vectors into a file to be output;
preferably, the step S2 further includes the steps of: s2.1, screening each record of the GOA according to the Event Code field content of a Gene Ontology Annotation (GOA) file to be processed, deleting the Event Code field content to be 'IEA' or 'ND' to obtain a screened GOA file, extracting a UniProtKB unique identification Code and a GO unique identification Code of each row of records of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of the repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record; s2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;
preferably, the step S3 further includes the steps of: s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction (PPI) positive and negative dataset, mapping the proteins into two UniProtKB unique identification codes, deleting the protein pairs in which the proteins cannot be mapped into the UniProtKB unique identification codes, generating an attribute label 'positive' or 'negative' of the corresponding protein pair according to the property of the dataset, organizing the protein pairs and the attribute label into a PPI record file, wherein the content of each line in the PPI record file consists of the two UniProtKB unique identification codes and the attribute label; s3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;
s3.3, in the PPI corpus in the step S1, randomly selecting 80%, 10% and 10% as a training set, a verification set and a test set as final training data.
Preferably, the step S4 further includes the steps of: s4.1, modifying based on an Infersend model, wherein a sentence encoder of the Infersend model is set as a convolutional neural network, a classifier of the Infersend model is set as two classes, and labels of the two classes are 'positive' and 'negative', so that an Infersend PPPI model is obtained; s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;
preferably, the iterative training in step S4.2 comprises the steps of: s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are word vectors of GO terms, the sentence encoders use convolutional neural networks, and the generated sentence vectors u and v are protein vectors u and v; s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1; s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2; s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;
s4.2-5, the formula of the predicted PPI is as follows:
InfersentPPI(a,b)=P(positive)>P(negative)?positive:negative
s4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times; s4.3, obtaining a model for predicting the PPI with the best effect after the iterative training is finished;
s4.4, predicting the PPI model in the step S4.3 on a test set, and organizing the prediction result of the test set into a file to be output;
preferably, in step S4.1, the "positive" classification represents PPI positive and the "negative" classification represents PPI negative, and in step S4.3, the predicted PPI is the PPI prediction of the protein annotated by the gene ontology.
Compared with the prior art, the invention has the beneficial effects that: the method for predicting protein-protein interaction based on the sentence-embedded Infersend model provided by the invention effectively improves the PPI prediction accuracy and AUC by means of natural language processing model Infersend and combining gene ontology.
Drawings
Fig. 1 is a general flow chart of the operation of the present invention, divided into 4 modules: onto2Vec, screening, extracting and annotating axioms, combining and processing and InfersentPPI;
FIG. 2 is a specific implementation of the Onto2Vec generation GO vector of the present invention;
FIG. 3 is a schematic flow chart of the present invention for screening and extracting annotation axioms;
FIG. 4 is a specific implementation of the InfersentpI model of the present invention;
FIG. 5 is a specific implementation of the sentence coder of the InfersentpI model of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1-5, the present invention provides a method for predicting protein-protein interactions based on a sentence-embedded Infersent model (detailed below with PPI positive-negative data as an example), comprising the steps of:
step S1, the body of GO is constructed into a graph, where GO terms are nodes in the graph, relationships (also called object attributes) between GO terms are called edges, and GO general information is available in GO. Extracting and generating a GO structure axiom in a GO map structure file go.own by using the existing Onto2Vec technology, training the GO structure axiom to obtain a GO term word vector, wherein the GO structure axiom is a description composition that a GO term (except a root term representing each aspect) has a subclass relation with another GO term;
step S2, filtering and extracting annotation axioms: screening and extracting each GO annotation record with corresponding weight in a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom;
step S3, combining the GO annotation axiom in the step S1, replacing the proteins of the PPI positive and negative data set with the GO terms annotated to the PPI positive and negative data set line by line to obtain final training data;
and S4, modifying the InfersenttPI model into an InfersentpI model, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentpI model to finally obtain a model for predicting the PPI, and outputting a PPI prediction result.
As shown in fig. 2, the step S1 further includes the following steps:
s1.1, extracting GO graph structure records in GO documents, each GO graph structure record being composed of a plurality of relation words (e.g., supbolasof, disajoint length) between GO unique identification codes and GO terms, the GO graph structure records being organized into a file to obtain a GO structure axiom file, which is specifically shown in table 1:
table 1 is an example of the contents of the GO structural axiom file
Figure BDA0002720222190000051
Figure BDA0002720222190000061
S1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;
s1.3, training in a skip-gram model as follows:
given a sequence of training words x1,x2,...,xTSkip-gram modelThe objective is to maximize the following equation:
Figure BDA0002720222190000062
wherein c is the size of the training context window, T is the size of the training word set, and Wi is the ith training word in the sequence;
s1.4, obtaining word vectors of GO terms after training is finished and organizing the word vectors into a file to be output;
as shown in fig. 3, the step S2 further includes the following steps:
s2.1, screening each record of the GOA according to the contents of the event Code field of a Gene Ontology Annotation (GOA) file to be processed. And the Event Code is a valid Evidence Code annotated by GO, and records with the Event Code field content of 'IEA' or 'ND' are deleted to obtain the screened GOA file. ND evidence codes are used for annotation when no information is available about the molecular function, biological process or cellular composition of the annotated gene or gene product. IEA-supported annotations are ultimately based on homology and/or other experimental or sequence information, but are generally not traceable back to the source of the experiment. Extracting UniProtKB unique identification codes and GO unique identification codes recorded in each line of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record;
specific examples are shown in table 2 and table 3:
table 2 is an example of the contents of a GOA file
Figure BDA0002720222190000063
Figure BDA0002720222190000071
Table 3 is an example of the contents of the GO annotation record file
UniProtKB ID Relation GO ID
A2P2R3 hasFunction GO:0006047
D6VTK4 hasFunction GO:0000750
D6VTK4 hasFunction GO:0000750
S2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;
specific examples are shown in table 4:
table 4 is an example of the contents of the GO annotated axiom file
UniProtKB ID GO ID
A2P2R3 GO:0006002;GO:0006047
D6VTK4 GO:0000750;GO:0000750
The step S3 further includes the steps of:
s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction (PPI) positive and negative dataset, mapping the pair of proteins to two UniProtKB unique identification codes, deleting the pair of proteins where the proteins cannot be mapped to the UniProtKB unique identification codes, and generating an attribute label 'positive' or 'negative' of the corresponding pair of proteins according to the property of the dataset, wherein 'positive' refers to PPI positive, and 'negative' refers to PPI negative. The protein pairs and the attribute tags are organized into a PPI record file, and the content of each line in the PPI record file is composed of two UniProtKB unique identification codes and the attribute tags;
specific examples are shown in table 5:
table 5 is an example of the contents of the PPI record file
ProteinA ProteinB Tag
P16649 P14922 positive
P07269 P22035 positive
P53248 P32366 negative
Q08558 P31412 negative
Q06169 P41807 negative
S3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;
specific examples are shown in table 6:
table 6 is an example of the contents of a PPI corpus
Figure BDA0002720222190000081
Figure BDA0002720222190000091
S3.3, in the PPI corpus in the step S1, randomly selecting 80%, 10% and 10% as a training set, a verification set and a test set as final training data.
The step S4 further includes the steps of:
s4.1, modifying based on an Infersend model, wherein a sentence encoder of the Infersend model is set as a convolutional neural network, a classifier of the Infersend model is set as two classes, and labels of the two classes are 'positive' and 'negative', so that an Infersend PPPI model is obtained;
s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;
as shown in fig. 4, the iterative training in step S4.2 includes the following steps:
s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are the word vectors of GO terms, the sentence encoders use convolutional neural networks, and as shown in FIG. 5, the generated sentence vectors u and v are the protein vectors u and v;
s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1;
s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2;
s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;
s4.2-5, the formula of the predicted PPI is as follows:
InfersentPPI(a,b)=P(positive)>P(negative)?positive:negative
for example: one input of InfersentPPI is sentences a and B, and UniProtKB unique identification codes for protein a and protein B:
sentence a: P16649;
sentence b: P14922;
then replacing the proteins line by line with the unique GO identification code annotated to it according to step S3.2, resulting in the test data being word set GOs1, word set GOs 2:
GOs1:{GO_0000329,GO_0005739,GO_0005739,GO_0006623,GO _0022857,GO_0055085}
GOs2:{GO_0005783,GO_0006633,GO_0006892,GO_0009922,GO _0009922,GO_0019367,GO_0030148,GO_0030148,GO_0030176,GO_0 030497,GO_0032511,GO_0034625,GO_0034626,GO_0042761,GO_004 2761}
finally, p (positive) of the texts a and b is calculated according to the formula in step S4.2, and p (negative) is 0.724 and 0.276, so that InfersentPPI (a, b) is calculated.
S4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times;
s4.3, obtaining a model for predicting the PPI with the best effect after the iterative training is finished;
s4.4, the model for predicting the PPI in the step S4.3 predicts on the test set, the prediction result of the test set is organized into a file to be output, and the parameter Batch _ size is 2 trained model, so that the best prediction effect is achieved on the test set;
in step S4.1, the "positive" classification represents PPI positive and the "negative" classification represents PPI negative, and in step S4.3, the predicted PPI is the PPI prediction of the protein annotated by the gene ontology.
In conclusion, the method for predicting protein-protein interaction based on the sentence-embedded Infersent model provided by the invention effectively improves the accuracy and AUC of PPI prediction by means of natural language processing model Infersent in combination with gene ontology.
The invention can be applied not only to proteins, but also to other examples annotated with an ontology. In addition, the sentence encoder of the natural language processing model infersense can be replaced, and the implementation of the whole model is not influenced. The user can select a suitable sentence encoder according to the requirement.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims (1)

1. A method for prediction of protein-protein interactions based on gene ontology, comprising the steps of:
s1, the body of GO is constructed into a graph, wherein GO terms are used as nodes in the graph, and the relationship between GO terms is called as edges. Obtaining a GO term word vector from a GO map structure file GO. own by using an Onto2Vec technology;
s2, is creating GO annotation by associating a gene or gene product with GO terms; screening and extracting each GO annotation record with corresponding weight in a GOA file, and organizing and generating a GO annotation axiom;
s3, combining the GO annotation axiom in the step S1, replacing the proteins of the protein interaction positive and negative data set with the GO terms annotated to the protein interaction positive and negative data set line by line to obtain final training data;
s4, constructing an InfersentPPI model based on Infersent, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentPPI model to finally obtain a model for predicting PPI, and outputting a PPI prediction result;
the step S1 further includes the steps of:
s1.1, taking out GO graph structure records in go.owl files, wherein each GO graph structure record is composed of a plurality of unique GO identification codes and relation words thereof, and the GO graph structure records are organized into files to obtain GO structure axiom files;
s1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;
s1.3, training is carried out in a Skip-gram model as follows:
given a sequence of training words x1,x2,.....,x3The objective of the Skip-gram model is to maximize the following formula:
Figure FDA0002720222180000011
where c is the size of the training context window, T is the size of the training word set, wiIs the ith training word in the sequence;
s1.4, obtaining word vectors of GO terms after training is finished and organizing the word vectors into a file to be output;
the step S2 further includes the steps of:
s2.1, screening each record of the GOA according to the Event Code field content of the gene ontology annotation file to be processed, deleting the Event Code field content to be 'IEA' or 'ND' records to obtain a screened GOA file, extracting UniProtKB unique identification codes and GO unique identification codes of each line of records of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of the repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record;
s2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;
the step S3 further includes the steps of:
s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction positive and negative dataset, mapping the proteins into two UniProtKB unique identification codes, deleting a protein pair in which the proteins cannot be mapped into the UniProtKB unique identification codes, generating an attribute label 'positive' or 'negative' of the corresponding protein pair according to the property of the dataset, organizing the protein pair and the attribute label into a PPI record file, and enabling the content of each line in the PPI record file to be composed of the two UniProtKB unique identification codes and the attribute label;
s3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;
s3.3, randomly selecting 80%, 10% and 10% of the PPI corpus in the step S1 as a training set, a verification set and a test set as final training data;
the step S4 further includes the steps of:
s4.1, modifying based on an Infersent model, wherein a sentence encoder of the Infersent model is set as a convolutional neural network, a classifier of the Infersent model is set as a second classification, labels of the second classification are 'positive' and 'negative', the 'positive' represents PPI positive, the 'negative' represents PPI negative, and an Infersent PPPI model is obtained;
s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;
the iterative training in step S4.2 comprises the following steps:
s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are word vectors of GO terms, the sentence encoders use convolutional neural networks, and the generated sentence vectors u and v are protein vectors u and v;
s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1;
s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2;
s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;
s4.2-5, the formula of the predicted PPI is as follows:
InfersentPPI(a,b)=P(positive)>P(negative)?positive:negative
s4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times;
s4.3, obtaining a model with the best effect for predicting PPI after the iterative training is finished, wherein the PPI refers to the PPI of the protein annotated by the gene body;
s4.4, the model for predicting the PPI in the step S4.3 predicts on the test set, and the prediction result of the test set is organized into a file to be output.
CN202011085576.4A 2020-10-12 2020-10-12 Protein-protein interaction prediction method based on sentence embedding Infersent model Withdrawn CN112185457A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011085576.4A CN112185457A (en) 2020-10-12 2020-10-12 Protein-protein interaction prediction method based on sentence embedding Infersent model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011085576.4A CN112185457A (en) 2020-10-12 2020-10-12 Protein-protein interaction prediction method based on sentence embedding Infersent model

Publications (1)

Publication Number Publication Date
CN112185457A true CN112185457A (en) 2021-01-05

Family

ID=73949329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011085576.4A Withdrawn CN112185457A (en) 2020-10-12 2020-10-12 Protein-protein interaction prediction method based on sentence embedding Infersent model

Country Status (1)

Country Link
CN (1) CN112185457A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115565607A (en) * 2022-10-20 2023-01-03 抖音视界有限公司 Method, device, readable medium and electronic equipment for determining protein information

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115565607A (en) * 2022-10-20 2023-01-03 抖音视界有限公司 Method, device, readable medium and electronic equipment for determining protein information
CN115565607B (en) * 2022-10-20 2024-02-23 抖音视界有限公司 Method, device, readable medium and electronic equipment for determining protein information

Similar Documents

Publication Publication Date Title
Varma et al. Snuba: Automating weak supervision to label training data
CN113707235B (en) Drug micromolecule property prediction method, device and equipment based on self-supervision learning
CN109697285B (en) Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation
CN108984724B (en) Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation
Nadif et al. Unsupervised and self-supervised deep learning approaches for biomedical text mining
CN110362723B (en) Topic feature representation method, device and storage medium
CN114450751A (en) System and method for training a machine learning algorithm to process biologically relevant data, microscope and trained machine learning algorithm
CN110032739A (en) Chinese electronic health record name entity abstracting method and system
CN109036577A (en) Diabetic complication analysis method and device
CN111582506A (en) Multi-label learning method based on global and local label relation
CN116312915B (en) Method and system for standardized association of drug terms in electronic medical records
CN113591955A (en) Method, system, equipment and medium for extracting global information of graph data
Iqbal et al. A dynamic weighted tabular method for convolutional neural networks
CN117370736A (en) Fine granularity emotion recognition method, electronic equipment and storage medium
Wang et al. Fusang: a framework for phylogenetic tree inference via deep learning
Zhou et al. Review for Handling Missing Data with special missing mechanism
CN112185457A (en) Protein-protein interaction prediction method based on sentence embedding Infersent model
CN111259176B (en) Cross-modal Hash retrieval method based on matrix decomposition and integrated with supervision information
CN117436522A (en) Biological event relation extraction method and large-scale biological event relation knowledge base construction method of cancer subject
Zaghir et al. Real-world patient trajectory prediction from clinical notes using artificial neural networks and UMLS-based extraction of concepts
WO2024058915A1 (en) Classification using a machine learning model trained with triplet loss
Louati et al. Design and compression study for convolutional neural networks based on evolutionary optimization for thoracic X-Ray image classification
Zhu et al. Uni-Fold MuSSe: De Novo Protein Complex Prediction with Protein Language Models
Ranjan et al. MCWS-transformers: towards an efficient modeling of protein sequences via multi context-window based scaled self-attention
CN115116549A (en) Cell data annotation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210105