CN112185457A - Protein-protein interaction prediction method based on sentence embedding Infersent model - Google Patents

Protein-protein interaction prediction method based on sentence embedding Infersent model Download PDF

Info

Publication number
CN112185457A
CN112185457A CN202011085576.4A CN202011085576A CN112185457A CN 112185457 A CN112185457 A CN 112185457A CN 202011085576 A CN202011085576 A CN 202011085576A CN 112185457 A CN112185457 A CN 112185457A
Authority
CN
China
Prior art keywords
ppi
training
model
protein
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011085576.4A
Other languages
Chinese (zh)
Inventor
江莹莹
李美晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN202011085576.4A priority Critical patent/CN112185457A/en
Publication of CN112185457A publication Critical patent/CN112185457A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biotechnology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Physiology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种基于句嵌入Infersent的模型预测蛋白质‑蛋白质相互作用的方法,其是基于自然语言处理模型Infersent结合基因本体去预测蛋白质‑蛋白质相互作用(Protein‑Protein Interaction,简称PPI)。该方法包括结合GO图结构,得到GO术语词向量;对基因本体论注释(GOA)文件进行筛选提取,生成GO注释公理;在基于句嵌入Infersent的模型上结合GO注释公理与GO术语词向量去训练PPI阳性阴性数据集,最终得到预测PPI的模型。本发明针对现阶段预测PPI准确率与AUC不够高的问题,提出了一种新的预测PPI的方法,提高了预测准确率与AUC。

Figure 202011085576

The invention discloses a method for predicting protein-protein interaction based on a sentence embedding Infersent model, which is based on the natural language processing model Infersent combined with gene ontology to predict protein-protein interaction (Protein-Protein Interaction, referred to as PPI). The method includes combining GO graph structure to obtain GO term word vectors; screening and extracting Gene Ontology Annotation (GOA) files to generate GO annotation axioms; combining GO annotation axioms and GO term word vectors on the model based on sentence embedding Infersent Train the PPI positive and negative dataset, and finally get a model that predicts PPI. Aiming at the problem that the prediction accuracy rate and AUC of PPI are not high enough at present, the present invention proposes a new method for predicting PPI, which improves the prediction accuracy rate and AUC.

Figure 202011085576

Description

Protein-protein interaction prediction method based on sentence embedding Infersent model
Technical Field
The invention relates to the fields of biological information and natural language processing, in particular to application of a gene ontology and sentence embedding model in the field of protein-protein interaction (PPI) prediction.
Background
Protein-protein interactions (PPIs) are a fundamental indicator of many bioinformatics applications such as protein function and drug discovery. Therefore, accurate prediction of protein-protein interactions will help us to understand underlying molecular mechanisms and significantly facilitate drug discovery. The PPI can be predicted more accurately through Gene Ontology (GO) information. Most previous studies of gene ontology information to predict PPI have utilized Information Content (IC). Recently, some studies have used word embedding techniques in the field of natural language processing to learn vectors representing GO terms and proteins to predict PPIs.
Gene ontology is a standard lexical term for biological functional annotation, a uniform term used to describe the function of homologous genes and gene products across species. The invention utilizes a supervised sentence embedding technology to capture GO structure and GO annotation information to predict PPI. Combining gene ontology with powerful natural language processing techniques, our method provides a general computational flow to predict protein-protein interactions even without the use of protein sequence information.
Disclosure of Invention
The invention aims to provide a protein-protein interaction prediction method based on a sentence embedding insersent model, which predicts protein-protein interaction (PPI) based on a natural language processing model insersent binding Gene Ontology (GO). In the method, each record of GO annotation axiom has a corresponding weight; and training a PPI positive and negative data set on the model based on sentence embedding Infersent by combining GO annotation axiom and GO structure axiom to finally obtain a model for predicting PPI.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a method for predicting protein-protein interactions based on a sentence-embedded insersent model, comprising the steps of:
s1, the body of GO is constructed into a graph, wherein GO terms are used as nodes in the graph, and the relationship between GO terms is called as edges. Extracting and generating a GO structure axiom from a GO graph structure file by using the existing Onto2Vec technology, training the GO structure axiom, and obtaining a GO term word vector;
s2, screening and extracting annotation axioms: screening and extracting each GO annotation record with corresponding weight in a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom;
s3, combining the GO annotation axiom in the step S1, replacing the proteins of the PPI positive and negative data set with the GO terms annotated to the PPI positive and negative data set line by line to obtain final training data;
s4, modifying the InfersenttPI model into an InfersentpI model, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentpI model to finally obtain a model for predicting the PPI, and outputting a PPI prediction result.
Preferably, the step S1 further includes the steps of: s1.1, extracting GO graph structure records in GO. S1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;
s1.3, training in a skip-gram model as follows:
given a sequence of training words x1,x2,.....,x3The objective of the Skip-gram model is to maximize the following formula:
Figure 1
where c is the size of the training context window, T is the size of the training word set, wiIs the ith training word in the sequence;
s1.4, obtaining word vectors of GO terms after training is finished and organizing the word vectors into a file to be output;
preferably, the step S2 further includes the steps of: s2.1, screening each record of the GOA according to the Event Code field content of a Gene Ontology Annotation (GOA) file to be processed, deleting the Event Code field content to be 'IEA' or 'ND' to obtain a screened GOA file, extracting a UniProtKB unique identification Code and a GO unique identification Code of each row of records of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of the repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record; s2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;
preferably, the step S3 further includes the steps of: s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction (PPI) positive and negative dataset, mapping the proteins into two UniProtKB unique identification codes, deleting the protein pairs in which the proteins cannot be mapped into the UniProtKB unique identification codes, generating an attribute label 'positive' or 'negative' of the corresponding protein pair according to the property of the dataset, organizing the protein pairs and the attribute label into a PPI record file, wherein the content of each line in the PPI record file consists of the two UniProtKB unique identification codes and the attribute label; s3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;
s3.3, in the PPI corpus in the step S1, randomly selecting 80%, 10% and 10% as a training set, a verification set and a test set as final training data.
Preferably, the step S4 further includes the steps of: s4.1, modifying based on an Infersend model, wherein a sentence encoder of the Infersend model is set as a convolutional neural network, a classifier of the Infersend model is set as two classes, and labels of the two classes are 'positive' and 'negative', so that an Infersend PPPI model is obtained; s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;
preferably, the iterative training in step S4.2 comprises the steps of: s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are word vectors of GO terms, the sentence encoders use convolutional neural networks, and the generated sentence vectors u and v are protein vectors u and v; s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1; s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2; s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;
s4.2-5, the formula of the predicted PPI is as follows:
InfersentPPI(a,b)=P(positive)>P(negative)?positive:negative
s4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times; s4.3, obtaining a model for predicting the PPI with the best effect after the iterative training is finished;
s4.4, predicting the PPI model in the step S4.3 on a test set, and organizing the prediction result of the test set into a file to be output;
preferably, in step S4.1, the "positive" classification represents PPI positive and the "negative" classification represents PPI negative, and in step S4.3, the predicted PPI is the PPI prediction of the protein annotated by the gene ontology.
Compared with the prior art, the invention has the beneficial effects that: the method for predicting protein-protein interaction based on the sentence-embedded Infersend model provided by the invention effectively improves the PPI prediction accuracy and AUC by means of natural language processing model Infersend and combining gene ontology.
Drawings
Fig. 1 is a general flow chart of the operation of the present invention, divided into 4 modules: onto2Vec, screening, extracting and annotating axioms, combining and processing and InfersentPPI;
FIG. 2 is a specific implementation of the Onto2Vec generation GO vector of the present invention;
FIG. 3 is a schematic flow chart of the present invention for screening and extracting annotation axioms;
FIG. 4 is a specific implementation of the InfersentpI model of the present invention;
FIG. 5 is a specific implementation of the sentence coder of the InfersentpI model of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1-5, the present invention provides a method for predicting protein-protein interactions based on a sentence-embedded Infersent model (detailed below with PPI positive-negative data as an example), comprising the steps of:
step S1, the body of GO is constructed into a graph, where GO terms are nodes in the graph, relationships (also called object attributes) between GO terms are called edges, and GO general information is available in GO. Extracting and generating a GO structure axiom in a GO map structure file go.own by using the existing Onto2Vec technology, training the GO structure axiom to obtain a GO term word vector, wherein the GO structure axiom is a description composition that a GO term (except a root term representing each aspect) has a subclass relation with another GO term;
step S2, filtering and extracting annotation axioms: screening and extracting each GO annotation record with corresponding weight in a Gene Ontology Annotation (GOA) file to generate a GO annotation axiom;
step S3, combining the GO annotation axiom in the step S1, replacing the proteins of the PPI positive and negative data set with the GO terms annotated to the PPI positive and negative data set line by line to obtain final training data;
and S4, modifying the InfersenttPI model into an InfersentpI model, combining the GO term word vector in the step S2, performing iterative training on the training data in the step S3 on the InfersentpI model to finally obtain a model for predicting the PPI, and outputting a PPI prediction result.
As shown in fig. 2, the step S1 further includes the following steps:
s1.1, extracting GO graph structure records in GO documents, each GO graph structure record being composed of a plurality of relation words (e.g., supbolasof, disajoint length) between GO unique identification codes and GO terms, the GO graph structure records being organized into a file to obtain a GO structure axiom file, which is specifically shown in table 1:
table 1 is an example of the contents of the GO structural axiom file
Figure BDA0002720222190000051
Figure BDA0002720222190000061
S1.2, inputting the GO structure axiom file in the step S1.1 into a skip-gram model of Word2vec line by line;
s1.3, training in a skip-gram model as follows:
given a sequence of training words x1,x2,...,xTSkip-gram modelThe objective is to maximize the following equation:
Figure BDA0002720222190000062
wherein c is the size of the training context window, T is the size of the training word set, and Wi is the ith training word in the sequence;
s1.4, obtaining word vectors of GO terms after training is finished and organizing the word vectors into a file to be output;
as shown in fig. 3, the step S2 further includes the following steps:
s2.1, screening each record of the GOA according to the contents of the event Code field of a Gene Ontology Annotation (GOA) file to be processed. And the Event Code is a valid Evidence Code annotated by GO, and records with the Event Code field content of 'IEA' or 'ND' are deleted to obtain the screened GOA file. ND evidence codes are used for annotation when no information is available about the molecular function, biological process or cellular composition of the annotated gene or gene product. IEA-supported annotations are ultimately based on homology and/or other experimental or sequence information, but are generally not traceable back to the source of the experiment. Extracting UniProtKB unique identification codes and GO unique identification codes recorded in each line of the screened GOA file to obtain a GO annotation record file, wherein repeated records in the GO annotation record file are not deleted, and the number of repeated records represents the number of effective references of the annotation record and can be used as the weight of the corresponding annotation record;
specific examples are shown in table 2 and table 3:
table 2 is an example of the contents of a GOA file
Figure BDA0002720222190000063
Figure BDA0002720222190000071
Table 3 is an example of the contents of the GO annotation record file
UniProtKB ID Relation GO ID
A2P2R3 hasFunction GO:0006047
D6VTK4 hasFunction GO:0000750
D6VTK4 hasFunction GO:0000750
S2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in the step S1.2, concentrating the UniProtKB unique identification codes in the same line, and organizing the UniProtKB unique identification codes into a file to obtain a GO annotation axiom file;
specific examples are shown in table 4:
table 4 is an example of the contents of the GO annotated axiom file
UniProtKB ID GO ID
A2P2R3 GO:0006002;GO:0006047
D6VTK4 GO:0000750;GO:0000750
The step S3 further includes the steps of:
s3.1, extracting a pair of proteins recorded in each line of a protein-protein interaction (PPI) positive and negative dataset, mapping the pair of proteins to two UniProtKB unique identification codes, deleting the pair of proteins where the proteins cannot be mapped to the UniProtKB unique identification codes, and generating an attribute label 'positive' or 'negative' of the corresponding pair of proteins according to the property of the dataset, wherein 'positive' refers to PPI positive, and 'negative' refers to PPI negative. The protein pairs and the attribute tags are organized into a PPI record file, and the content of each line in the PPI record file is composed of two UniProtKB unique identification codes and the attribute tags;
specific examples are shown in table 5:
table 5 is an example of the contents of the PPI record file
ProteinA ProteinB Tag
P16649 P14922 positive
P07269 P22035 positive
P53248 P32366 negative
Q08558 P31412 negative
Q06169 P41807 negative
S3.2, replacing the protein of the PPI record file in the step S3.1 with the unique GO identification code of the PPI record file by line by using the gene ontology annotation axiom in the step S1 to obtain a PPI corpus of a training model;
specific examples are shown in table 6:
table 6 is an example of the contents of a PPI corpus
Figure BDA0002720222190000081
Figure BDA0002720222190000091
S3.3, in the PPI corpus in the step S1, randomly selecting 80%, 10% and 10% as a training set, a verification set and a test set as final training data.
The step S4 further includes the steps of:
s4.1, modifying based on an Infersend model, wherein a sentence encoder of the Infersend model is set as a convolutional neural network, a classifier of the Infersend model is set as two classes, and labels of the two classes are 'positive' and 'negative', so that an Infersend PPPI model is obtained;
s4.2, combining the word vector of the GO term in the step S2, and performing iterative training on the training data in the step S3 in the InfersentPPI model in the step S4.1;
as shown in fig. 4, the iterative training in step S4.2 includes the following steps:
s4.2-1, respectively inputting the unique GO identification codes of two sets extracted by a training set of training data according to rows as a sentence A and a sentence B into two sentence encoders, wherein the word vectors used by the sentence encoders are the word vectors of GO terms, the sentence encoders use convolutional neural networks, and as shown in FIG. 5, the generated sentence vectors u and v are the protein vectors u and v;
s4.2-2, calculating the end-to-end connection of u and v to obtain (u, v) by using the sentence vector u and the sentence vector v in the step S4.1, calculating the multiplication of u and v to obtain u v, and subtracting the u and v to obtain | u-v |, and finally sending the obtained (u, v, u v, | u-v |) result to a classifier of 2 classification, wherein the classifier consists of a plurality of fully-connected layers and a softmax layer, and finally obtaining probability distribution predicted values of labels 'positive' and 'negative' of the sentence A and the sentence B in the step S4.2-1;
s4.2-3, minimizing the error of the labels of the training set and the probability distribution predicted values of the labels 'positive' and 'negative' in the step S4.2-2;
s4.2-4, repeating the steps S4.2-1 to S4.2-3 until the data of all the training sets are iterated once;
s4.2-5, the formula of the predicted PPI is as follows:
InfersentPPI(a,b)=P(positive)>P(negative)?positive:negative
for example: one input of InfersentPPI is sentences a and B, and UniProtKB unique identification codes for protein a and protein B:
sentence a: P16649;
sentence b: P14922;
then replacing the proteins line by line with the unique GO identification code annotated to it according to step S3.2, resulting in the test data being word set GOs1, word set GOs 2:
GOs1:{GO_0000329,GO_0005739,GO_0005739,GO_0006623,GO _0022857,GO_0055085}
GOs2:{GO_0005783,GO_0006633,GO_0006892,GO_0009922,GO _0009922,GO_0019367,GO_0030148,GO_0030148,GO_0030176,GO_0 030497,GO_0032511,GO_0034625,GO_0034626,GO_0042761,GO_004 2761}
finally, p (positive) of the texts a and b is calculated according to the formula in step S4.2, and p (negative) is 0.724 and 0.276, so that InfersentPPI (a, b) is calculated.
S4.2-6, predicting on the verification set, stopping training if the result of the verification set is worse than that of the last verification set, not storing the model, if the result of the verification set is better than that of the last verification set, storing the model, adjusting the learning rate, stopping training when the learning rate is lower than the set minimum learning rate, repeating the steps S4.2-1 to S4.2-4 when the learning rate is higher than the set minimum learning rate, continuing the next round of iterative training, and stopping training when the iterative times reaches the set maximum iterative times;
s4.3, obtaining a model for predicting the PPI with the best effect after the iterative training is finished;
s4.4, the model for predicting the PPI in the step S4.3 predicts on the test set, the prediction result of the test set is organized into a file to be output, and the parameter Batch _ size is 2 trained model, so that the best prediction effect is achieved on the test set;
in step S4.1, the "positive" classification represents PPI positive and the "negative" classification represents PPI negative, and in step S4.3, the predicted PPI is the PPI prediction of the protein annotated by the gene ontology.
In conclusion, the method for predicting protein-protein interaction based on the sentence-embedded Infersent model provided by the invention effectively improves the accuracy and AUC of PPI prediction by means of natural language processing model Infersent in combination with gene ontology.
The invention can be applied not only to proteins, but also to other examples annotated with an ontology. In addition, the sentence encoder of the natural language processing model infersense can be replaced, and the implementation of the whole model is not influenced. The user can select a suitable sentence encoder according to the requirement.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims (1)

1.一种基于基因本体的蛋白质-蛋白质相互作用预测的方法,其特征在于,包含以下步骤:1. A method for protein-protein interaction prediction based on gene ontology, characterized in that, comprising the following steps: S1、GO的本体被构造成一个图,其中GO术语作为图中的节点,GO术语之间的关系称为边。使用Onto2Vec技术,从GO图结构文件go.owl中得到GO术语词向量;The ontology of S1 and GO is constructed as a graph, in which GO terms are used as nodes in the graph, and the relationships between GO terms are called edges. Using Onto2Vec technology, the GO term word vector is obtained from the GO graph structure file go.owl; S2、是通过将基因或基因产物与GO术语相关联来创建GO注释;在GOA文件中筛选提取有相应权重的每条GO注释记录,组织生成GO注释公理;S2. Create GO annotations by associating genes or gene products with GO terms; filter and extract each GO annotation record with corresponding weight in the GOA file, and organize and generate GO annotation axioms; S3、结合步骤S1中的所述GO注释公理,将蛋白质相互作用阳性阴性数据集的蛋白质逐行替换为注释它的GO术语,得到最终的训练数据;S3. Combined with the GO annotation axiom in step S1, replace the proteins of the protein interaction positive and negative data set with the GO terms that annotate it line by line to obtain the final training data; S4、构建基于Infersent的InfersentPPI模型,结合步骤S2中的所述GO术语词向量,在InfersentPPI模型上对步骤S3中的所述训练数据进行迭代训练,最终得到预测PPI的模型,输出PPI预测结果;S4, construct the InfersentPPI model based on Infersent, combine the described GO term word vector in step S2, carry out iterative training to the described training data in step S3 on InfersentPPI model, finally obtain the model of predicting PPI, output PPI prediction result; 所述步骤S1进一步包含以下步骤:The step S1 further includes the following steps: S1.1、取出go.owl文件中的GO图结构记录,每条GO图结构记录由多个GO唯一标识码与其关系词组成,GO图结构记录组织成文件,得到GO结构公理文件;S1.1. Take out the GO graph structure record in the go.owl file. Each GO graph structure record is composed of multiple GO unique identifiers and related words. The GO graph structure records are organized into files, and the GO structure axiom file is obtained; S1.2、将步骤S1.1中的所述GO结构公理文件逐行输入Word2vec的skip-gram模型;S1.2, input the GO structure axiom file in step S1.1 line by line into the skip-gram model of Word2vec; S1.3、在Skip-gram模型中进行训练,如下:S1.3, train in the Skip-gram model, as follows: 给定一个序列的训练单词x1,x2,.....,x3,Skip-gram模型的目的是最大化下列公式:Given a sequence of training words x 1 , x 2 , ....., x 3 , the purpose of the Skip-gram model is to maximize the following formula:
Figure FDA0002720222180000011
Figure FDA0002720222180000011
其中c是训练上下文窗口的大小,T是训练词集合的大小,wi是序列中的第i个训练词;where c is the size of the training context window, T is the size of the training word set, and w i is the ith training word in the sequence; S1.4、训练结束得到GO术语的词向量组织成文件输出;S1.4. After training, the word vector of GO terms obtained is organized into file output; 所述步骤S2进一步包含以下步骤:The step S2 further includes the following steps: S2.1、根据待处理基因本体论注释文件的Evidence Code字段内容,对GOA的每条记录进行筛选,删除Evidence Code字段内容为‘IEA’或’ND’的记录,得到筛选后的GOA文件,提取出筛选后的GOA文件的每一行记录的UniProtKB唯一标识码与GO唯一标识码,得到GO注释记录文件,GO注释记录文件中重复的记录不删除,重复的次数代表这条注释记录的有效引用的数量,可作为对应注释记录的权重;S2.1. According to the content of the Evidence Code field of the Gene Ontology annotation file to be processed, filter each record of the GOA, delete the records whose Evidence Code field content is 'IEA' or 'ND', and obtain the filtered GOA file, Extract the UniProtKB unique identification code and GO unique identification code recorded in each line of the filtered GOA file, and obtain the GO annotation record file. The repeated records in the GO annotation record file are not deleted, and the number of repetitions represents the valid reference of this annotation record. The number of , which can be used as the weight of the corresponding annotation record; S2.2、提取步骤S1.2中的所述GO注释记录文件的相同UniProtKB唯一标识码以及对应的所有GO唯一标识码,将其集中在同一行,组织成文件,得到GO注释公理文件;S2.2, extracting the same UniProtKB unique identification code and all corresponding GO unique identification codes of the GO annotation record file in step S1.2, gather them in the same line, organize them into files, and obtain the GO annotation axiom file; 所述步骤S3进一步包含以下步骤:The step S3 further includes the following steps: S3.1、提取出蛋白质-蛋白质相互作用阳性阴性数据集每一行记录的一对蛋白质,映射为两个UniProtKB唯一标识码,无法映射为UniProtKB唯一标识码的蛋白质将其所在的蛋白质对进行删除,根据数据集的性质生成对应蛋白质对的属性标签’positive’或’negative’,蛋白质对与属性标签组织成PPI记录文件,该PPI记录文件中每一行的内容是由两个UniProtKB唯一标识码与属性标签组成;S3.1. Extract a pair of proteins recorded in each row of the protein-protein interaction positive and negative data set, and map them to two UniProtKB unique identifiers. For proteins that cannot be mapped to UniProtKB unique identifiers, delete their protein pair. Generate attribute labels 'positive' or 'negative' of corresponding protein pairs according to the properties of the dataset. Protein pairs and attribute labels are organized into a PPI record file. The content of each line in the PPI record file is composed of two UniProtKB unique identifiers and attributes. label composition; S3.2、利用步骤S1中的所述基因本体注释公理,对步骤S3.1中的所述PPI记录文件的蛋白质逐行替换为注释它的GO唯一标识码,得到训练模型的PPI语料库;S3.2, using the gene ontology annotation axiom in step S1, replace the protein of the PPI record file in step S3.1 line by line with the GO unique identification code annotating it, to obtain the PPI corpus of the training model; S3.3、步骤S1中的所述PPI语料库,随机选取80%、10%、10%作为训练集、验证集、测试集,作为最终的训练数据;S3.3. For the PPI corpus in step S1, randomly select 80%, 10%, and 10% as the training set, verification set, and test set, as the final training data; 所述步骤S4进一步包含以下步骤:The step S4 further includes the following steps: S4.1、基于Infersent模型进行改造,其中Infersent模型的句子编码器设置为卷积神经网络,Infersent模型的分类器设置为二分类,二分类的标签为’positive’与’negative’,’positive’是代表PPI阳性,’negative’是代表PPI阴性,得到InfersentPPI模型;S4.1. Transform based on the Infersent model, where the sentence encoder of the Infersent model is set to a convolutional neural network, the classifier of the Infersent model is set to two-class, and the labels of the two-class are 'positive' and 'negative', 'positive' is PPI positive, 'negative' is PPI negative, and the InfersentPPI model is obtained; S4.2、结合步骤S2中的所述GO术语的词向量,在步骤S4.1中的所述InfersentPPI模型中对步骤S3中的所述训练数据进行迭代训练;S4.2, in conjunction with the word vector of the GO term in step S2, perform iterative training on the training data in step S3 in the InfersentPPI model in step S4.1; 所述步骤S4.2中的迭代训练包含以下步骤:The iterative training in step S4.2 includes the following steps: S4.2-1、训练数据的训练集按行提取的两个集合的GO唯一标识码作为句子A与句子B分别输入两个句子编码器,句子编码器使用的词向量为GO术语的词向量,句子编码器使用卷积神经网络,生成的句向量u与句向量v就是蛋白质向量u与蛋白质向量v;S4.2-1. The GO unique identification codes of the two sets extracted from the training set of training data are input into two sentence encoders as sentence A and sentence B respectively. The word vector used by the sentence encoder is the word vector of the GO term. , the sentence encoder uses a convolutional neural network, and the generated sentence vector u and sentence vector v are the protein vector u and the protein vector v; S4.2-2、利用步骤S4.1中的所述句向量u与句向量v,计算u和v的首尾相连得到(u,v)、计算u和v相乘得到u*v、计算u和v相减得到|u-v|,最后将得到的(u,v,u*v,|u-v|)结果送入一个2分类的分类器,分类器由多个全连接层和一个softmax层组成,最终得到步骤S4.2-1中的所述句子A和句子B的标签’positive’与’negative’的概率分布预测值;S4.2-2. Using the sentence vector u and the sentence vector v in step S4.1, calculate the end-to-end connection of u and v to obtain (u, v), multiply u and v to obtain u*v, calculate u Subtract it from v to get |u-v|, and finally send the obtained (u, v, u*v, |u-v|) result to a 2-class classifier, which consists of multiple fully connected layers and a softmax layer, Finally, the predicted values of the probability distributions of the labels 'positive' and 'negative' of the sentence A and sentence B in step S4.2-1 are obtained; S4.2-3、使训练集的标签与步骤S4.2-2中的所述标签’positive’与’negative’的概率分布预测值的误差其最小化;S4.2-3. Minimize the error between the labels of the training set and the predicted values of the probability distributions of the labels 'positive' and 'negative' in step S4.2-2; S4.2-4、重复步骤S4.2-1到S4.2-3,直到所有训练集的数据迭代完一次;S4.2-4. Repeat steps S4.2-1 to S4.2-3 until all the data of the training set is iterated once; S4.2-5、预测PPI的公式如下:S4.2-5. The formula for predicting PPI is as follows: InfersentPPI(a,b)=P(positive)>P(negative)?positive:negativeInfersentPPI(a,b)=P(positive)>P(negative)? positive: negative S4.2-6、在验证集上进行预测,若验证集结果比上一次验证集结果差则停止训练,不保存模型,若验证集结果比上一次验证集结果好,则保存模型,并调整学习率,当学习率低于设置的最小学习率时停止训练,当学习率高于参数设置的最小学习率时重复步骤S4.2-1到S4.2-4继续下一轮迭代训练,迭代次数达到参数设置的最大迭代次数时,停止训练;S4.2-6. Predict on the validation set. If the validation set result is worse than the last validation set result, stop training and do not save the model. If the validation set result is better than the last validation set result, save the model and adjust Learning rate, stop training when the learning rate is lower than the set minimum learning rate, repeat steps S4.2-1 to S4.2-4 when the learning rate is higher than the minimum learning rate set by the parameters to continue the next round of iterative training, iterative When the number of iterations reaches the maximum number of iterations set by the parameter, the training is stopped; S4.3、迭代训练结束,得到了效果最好的预测PPI的模型,PPI是指被基因本体注释过的蛋白质的PPI;S4.3. The iterative training is over, and the best model for predicting PPI is obtained. PPI refers to the PPI of the protein annotated by the gene ontology; S4.4、步骤S4.3中的所述预测PPI的模型在测试集上进行预测,将测试集的预测结果组织成文件输出。S4.4. The model for predicting PPI in step S4.3 performs prediction on the test set, and organizes the prediction results of the test set into a file for output.
CN202011085576.4A 2020-10-12 2020-10-12 Protein-protein interaction prediction method based on sentence embedding Infersent model Withdrawn CN112185457A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011085576.4A CN112185457A (en) 2020-10-12 2020-10-12 Protein-protein interaction prediction method based on sentence embedding Infersent model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011085576.4A CN112185457A (en) 2020-10-12 2020-10-12 Protein-protein interaction prediction method based on sentence embedding Infersent model

Publications (1)

Publication Number Publication Date
CN112185457A true CN112185457A (en) 2021-01-05

Family

ID=73949329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011085576.4A Withdrawn CN112185457A (en) 2020-10-12 2020-10-12 Protein-protein interaction prediction method based on sentence embedding Infersent model

Country Status (1)

Country Link
CN (1) CN112185457A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115565607A (en) * 2022-10-20 2023-01-03 抖音视界有限公司 Method, device, readable medium and electronic equipment for determining protein information

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115565607A (en) * 2022-10-20 2023-01-03 抖音视界有限公司 Method, device, readable medium and electronic equipment for determining protein information
CN115565607B (en) * 2022-10-20 2024-02-23 抖音视界有限公司 Method, device, readable medium and electronic equipment for determining protein information

Similar Documents

Publication Publication Date Title
Cai et al. A hybrid BERT model that incorporates label semantics via adjustive attention for multi-label text classification
Varma et al. Snuba: Automating weak supervision to label training data
CN110597970A (en) Method and device for joint identification of multi-granularity medical entities
CN112016318B (en) Triage information recommendation method, device, equipment and medium based on interpretation model
Hasan et al. Integrating text embedding with traditional NLP features for clinical relation extraction
Stylianou et al. EBM+: advancing evidence-based medicine via two level automatic identification of populations, interventions, outcomes in medical literature
Liao et al. Coarse-to-fine knowledge graph domain adaptation based on distantly-supervised iterative training
Zhou et al. Review for Handling Missing Data with special missing mechanism
Wang et al. Fusang: a framework for phylogenetic tree inference via deep learning
WO2014130287A1 (en) Method and system for propagating labels to patient encounter data
CN118761412B (en) A method for predicting antiviral peptides based on biological word segmentation
CN112185457A (en) Protein-protein interaction prediction method based on sentence embedding Infersent model
Theodorou et al. Synthesize extremely high-dimensional longitudinal electronic health records via hierarchical autoregressive language model
CN113298160A (en) Triple verification method, apparatus, device and medium
Stylianou et al. Improved biomedical entity recognition via longer context modeling
CN111581469A (en) A partial multi-label learning method based on multi-subspace representation
Tran et al. Exploring a deep learning pipeline for the BioCreative VI precision medicine task
Al-Ash et al. Indonesian protected health information removal using named entity recognition
Phan et al. Deep learning based biomedical NER framework
Yuan et al. Optimized Drug-Drug Interaction Extraction With BioGPT and Focal Loss-Based Attention
Cai et al. ESM-MHC: an improved predictor of MHC using ESM protein language model
Molaei et al. CliqueFluxNet: unveiling EHR insights with stochastic edge fluxing and maximal clique utilisation using graph neural networks
Subrmanian et al. Deep Learning Based Algorithm for Efficient Information Retrieval in Blockchain Transactions
Jardim et al. Feature engineered embeddings for classification of molecular data
CN119046453B (en) Method and system for detecting problems of same name of cataloging of various equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210105

WW01 Invention patent application withdrawn after publication