CN112562784B - Protein function prediction method combining multitask learning and self-attention mechanism - Google Patents

Protein function prediction method combining multitask learning and self-attention mechanism Download PDF

Info

Publication number
CN112562784B
CN112562784B CN202011467595.3A CN202011467595A CN112562784B CN 112562784 B CN112562784 B CN 112562784B CN 202011467595 A CN202011467595 A CN 202011467595A CN 112562784 B CN112562784 B CN 112562784B
Authority
CN
China
Prior art keywords
layer
self
attention
protein
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011467595.3A
Other languages
Chinese (zh)
Other versions
CN112562784A (en
Inventor
杨跃东
黄伟林
赵慧英
卢宇彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202011467595.3A priority Critical patent/CN112562784B/en
Publication of CN112562784A publication Critical patent/CN112562784A/en
Application granted granted Critical
Publication of CN112562784B publication Critical patent/CN112562784B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a protein function prediction method combining multitask learning and self-attention mechanisms, which comprises the following steps: according to the molecular function type MF prediction task, the biological process type BP prediction task and the cell component type CC prediction task, a protein function prediction system model based on a multi-task learning and self-attention mechanism is constructed; acquiring a sample data set, extracting characteristic information of a protein sequence in the sample data set, and constructing a training set and a testing set; the training set is preprocessed and then is input into a protein function prediction system model, and the protein function prediction system model is trained; and preprocessing the test set, inputting the test set into a trained protein function prediction system model, and predicting the protein function. According to the application, the prediction of the three ontologies is regarded as three prediction tasks, and the prediction is performed by establishing a protein function prediction system model based on a multi-task learning and self-attention mechanism, so that the accuracy of protein function prediction is improved.

Description

Protein function prediction method combining multitask learning and self-attention mechanism
Technical Field
0001. The present application relates to the field of protein function prediction, and more particularly, to a protein function prediction method combining multitasking learning and self-attention mechanisms.
Background
Protein function prediction is a very important task in the field of biology, and plays a key role in the aspects of new drug development, pathological understanding and the like. The functional annotation of the protein is mainly carried out through in vivo or in vitro experiments in the early stage, but the time-consuming and expensive characteristics of the protein cannot adapt to the development speed of the current high-throughput sequencing technology, so that the calculation-based method gradually becomes an important research direction by virtue of the characteristics of low cost and high speed.
Protein functions can be labeled by Gene Ontology (Gene ontologiy), which contains over 40000 functional items, and can be divided into three major categories: molecular functions (Molecular Function, MF), biological processes (Biological Process, BP) and cellular components (Cellular Component, CC). Thus, protein functional prediction can be seen as a large-scale multi-tag classification problem.
The current calculation-based method mainly comprises the following three methods. The first method extracts features from protein sequences and performs functional classification (e.g., deep learning). The second method uses BLAST, PSI-BLAST or Diamond software to search for similar sequences from the training set for each query protein, and performs annotation migration based on sequence similarity. However, only about 1% of the protein sequences in the UniProt database are experimentally annotated, so it may be difficult to find annotated proteins with a high degree of sequence similarity to many of the unlabeled proteins. Both of the former methods utilize protein sequences as inputs to the model, but because of the large scale of functional classes and the complex hierarchical structure, constructing a map that accurately reveals complex relationships between protein sequences and functions is a difficult task. Thus, the third approach uses other protein metadata such as protein structure, protein-protein interaction network, gene expression, genetic interactions, biomedical literature, and the like. Experiments prove that the prediction performance can be further improved by combining the auxiliary information. However, there is a lack of sufficient side information for a large number of newly discovered proteins, and only their sequence information is available, so precise sequence-based prediction methods are of greater interest.
In the past, two training schemes are mainly adopted, one scheme is to divide labels according to three categories and respectively construct three models for independent training, and the other scheme is to combine all GO categories into a final label and establish a model for training. Considering that the GO category can be divided into three major branches (MF, BP and CC), on the one hand, the three major branches describe the protein function from different angles, i.e. the functions between the branches possess different characteristics; on the other hand, there is also a certain semantic relationship (such as is a, part of, etc.) between the functions of different branches. Thus, protein function prediction can also be viewed as a classification problem for three different tasks. The multi-task learning can improve the prediction performance and generalization capability of the model through hard parameter sharing, soft parameter sharing, hierarchical sharing and other modes. However, little work has focused on multitasking methods of learning to more accurately predict protein function.
In the prior art, chinese patent publication No. CN106126972a discloses a hierarchical multi-tag classification method for protein function prediction on day 11 and 16 of 2016, which includes the following steps: 1. training phase: training a data set of each node in the class label hierarchical structure by adopting an SVM classifier in a training stage to obtain a group of basic classifiers; 2. prediction stage: in the prediction stage, the set of basic classifiers obtained in the training stage is firstly used for obtaining a preliminary result of an unknown sample, and then a weighted TPR algorithm is used for processing the result to obtain a final result meeting the level constraint condition, so that the prediction of the protein function is realized. Although the scheme can solve the multi-label problem existing when the existing classification method is used for predicting the protein function to a certain extent, the multi-task learning method is not adopted, and the function prediction of the protein is difficult to accurately perform, so that a method for predicting the function of the protein by combining the multi-task learning and the self-attention mechanism is urgently needed.
Disclosure of Invention
The application provides a protein function prediction method combining a multi-task learning and self-attention mechanism, which aims to solve the problem that the prior art lacks the use of a multi-task learning method and is difficult to accurately predict the protein function.
The primary purpose of the application is to solve the technical problems, and the technical scheme of the application is as follows:
a method of protein function prediction combining multitasking and self-attention mechanisms, comprising the steps of:
s1: according to the molecular function type MF prediction task, the biological process type BP prediction task and the cell component type CC prediction task, a protein function prediction system model based on a multi-task learning and self-attention mechanism is constructed;
s2: acquiring a sample data set, extracting characteristic information of a protein sequence in the sample data set, and constructing a training set and a testing set;
s3: the training set is preprocessed and then is input into a protein function prediction system model, and the protein function prediction system model is trained;
s4: and preprocessing the test set, inputting the test set into a trained protein function prediction system model, and predicting the protein function.
In the above scheme, the Gene Ontology (GO) can be divided into three main categories (i.e., MF, BP, and CC), and each category has its characteristics, so that it can be regarded as three different prediction tasks for multitasking learning; according to three different prediction tasks, a protein function prediction system model based on multi-task learning and self-attention mechanisms is constructed, then training is carried out, and after training is completed, the protein function prediction is carried out.
Preferably, the step S1 specifically includes:
s101: constructing an MF sub-network based on a self-attention mechanism according to a molecular function class MF prediction task;
s102: constructing a BP sub-network based on a self-attention mechanism according to a molecular function class MF prediction task;
s103: constructing a CC sub-network based on a self-attention mechanism according to a molecular function class MF prediction task;
s104: cross-stitch units are arranged among the MF sub-network, the BP sub-network and the CC sub-network, so that connection and parameter sharing among the sub-networks are realized, and a protein function prediction system model based on multi-task learning and self-attention mechanisms is constructed.
In the above scheme, independent sub-networks (an MF sub-network, a BP sub-network and a CC sub-network) having the same structure are constructed for each prediction task, and constraints are applied to parameters of each sub-network by the cross-stitch unit.
Preferably, each of the MF sub-network, the BP sub-network and the CC sub-network comprises a one-dimensional convolution layer, a residual convolution layer, a multi-head self-attention layer and a full connection layer; wherein: the input of the one-dimensional convolution layer is used as the input of the protein function prediction system model; the one-dimensional convolution layer output end is connected with the residual error convolution layer input end; the residual convolution layer output end is connected with the multi-head self-attention layer input end; the multi-head self-attention layer output end is connected with the full-connection layer input end; the output of the full-connection layer is used as the output of the protein function prediction system model; cross stitch units are arranged between the one-dimensional convolution layer and the residual convolution layer and between the multi-head self-attention layer and the full-connection layer.
In the above scheme, after inputting characteristic information (L×84, where L is the maximum protein length) of the protein in each sub-network, the protein is encoded by a residual convolution layer to extract abstract characteristics of the protein (to obtain a size of L×)Is characterized by (1)>The number of convolution kernels in each convolution layer) and takes this as an input to a multi-head self-attention layer in which self-attention learning is performed using 20 attention heads; the model is obtained with a size of 20 x × through the preceding modules>Finally predicting the tendency of the protein function of each ontology corresponding sub-network through the full-junction layer. In addition, parameter sharing is realized by using a cross stitch unit among three sub-networks, and constraint is applied to the sub-network parameters by using the cross stitch unit after a one-dimensional convolution layer and a multi-head self-attention layer.
Preferably, the residual convolution layer comprises a number of one-dimensional residual convolution blocks, wherein: a cross stitch unit is arranged between each one-dimensional residual convolution block; the input end of the first one-dimensional residual convolution block is connected with the output end of the one-dimensional convolution layer; a cross stitch unit is arranged between the last one-dimensional residual convolution block and the multi-head self-attention layer, and the output end of the one-dimensional residual convolution block is connected with the input end of the multi-head self-attention layer.
In the above scheme, a cross stitch unit is used to apply constraints to the sub-network parameters after each one-dimensional residual convolution block.
Preferably, as shown in fig. 3, each one-dimensional residual convolution block has the same structure and includes a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer, a second normalization layer, and a second activation layer; wherein: the first convolution layer input end receives the input of the one-dimensional residual convolution block; the first convolution layer output end is connected with the first normalization layer input end; the first normalization layer output end is connected with the activation layer input end; the output end of the activation layer is connected with the input end of the second convolution layer; the output end of the second convolution layer is connected with the input end of the second normalization layer; and the second activation layer input end receives the input of the one-dimensional residual convolution block and the output of the second normalization layer.
In the scheme, each one-dimensional residual convolution block has the same structure and consists of two convolution layers, two activation layers and two normalization layers; a normalization layer is used before each activation layer to improve training speed; the one-dimensional residual convolution block may be defined as: wherein /> and />Respectively representing the input and output of a one-dimensional residual convolution block, < >>Representing residual mapping function, ++>Representing an activation function.
Preferably, the multi-headed self-focusing layer passesNormalizing the function so that the sum of the weights of the respective positions in the protein sequence is 1, obtaining +.>Judging the importance of each position in the protein sequence: />The larger the specific gravity is, the higher the attention weight is; the method is specifically expressed as follows:
wherein :/>Representing the output of the multi-headed self-attention layer; />Representing an attention weight matrix; />Representing a feature matrix as input to the multi-headed self-attention layer; /> and />Representing a weight matrix.
In the above scheme, the self-attention mechanism is widely applied in the field of natural language processing, so that sentence embedding becomes visual and has interpretability, in addition, the self-attention mechanism can calculate the importance of each word in the whole sentence, and the weighted summation of word characteristics is carried out according to the attention degree of the words, so that final sentence embedding is generated. Proteins with the same function often contain the same motif, in particular, motifs can be associated with specific functions, so learning sequence motifs can be useful in predicting protein function. According to this, the present approach utilizes a multi-headed self-focusing layer learning motif to focus on different parts of the protein residue features independently to obtain a better protein-embedded representation. Preferably, the cross stitch unit is configured to learn weights among three prediction tasks, and calculate according to the weights, to obtain a better feature map, which is specifically expressed as:
wherein ,/>、/> and />Representing a characteristic diagram input by the cross stitch unit; /> and />Representing a characteristic diagram output by the cross-stitch unit; />Represents a cross stitch unit; />Representing the position in the feature map; />、/>、/>Representing the weight of each task on the self; />、/>、/>、/>、/>、/>Representing weights shared between tasks.
In the above scheme, each channel of the feature map will encode different input features, and for fully utilizing the cross-stitch unit, independent application is applied to different channelsParameters, sharing weights between tasks (which are difficult to set manually for a particular problem) can be automatically learned, resulting in better characterization. The results show that in the multi-task learning, the use of the cross stitch unit gives better prediction performance than the use of the cross stitch unit.
Preferably, in said step S2, the characteristic information of the extracted protein sequence comprises the amino acid sequence Seq, the position-specific scoring matrix PSSM, the sequence spectrum HMM of the hidden markov model, the structural information SPIDER3 of the protein predicted by SPIDER3.
In the above scheme, the characteristic information of the protein sequence includes amino acid sequence, position-specific scoring matrix, sequence spectrum of hidden markov model, and structure information of the protein predicted by SPIDER3, which are respectively denoted as "Seq", "PSSM", "HMM" and "SPIDER 3", wherein: seq: one-hot coding for an amino acid sequence can be represented as a matrix of L×20, where L is the length of the sequence; PSSM: performing 3 iterative searches by running PSI-BLAST on UniRef90 database and generating PSSM features for each protein, which can be expressed as an Lx 20 matrix; when features cannot be generated by PSI-BLAST, BLOSUM matrix can be used for substitution; HMM: by aggregation of UniProtKB protein sequences in the Unicluster 30 database (by 30% sequence identityClass generated) run HHblits v3.0.0 (an open source toolkit for sequence search and alignment) using default parameters to generate a sequence spectrum of the hidden markov model, which can be represented as an lx30 size matrix; SPIDER3: generated by the SPIDER3 software, containing structural information of the protein, the inputs to the SPIDER3 software include the protein sequence and the PSSM and HMM features obtained by PSI-BLAST and HHblits, and the output of SPIDER3 can be divided into four parts: solvent Accessible Surface Area (ASA), sine and cosine values of four backbone angles (i.e., θ, τ, Φ, and ψ), hemispherical exposures (HSE), and predicted probabilities of three secondary structures (i.e., α -helix, β -sheet, and random coil), i.e., 14 @ by SPIDER3) Structural features, which may be represented as an lx 14 matrix. After the characteristics are generated, 84 ℃ is obtained) The individual sequential features are used as inputs of a protein function prediction system model; considering that most protein sequences in the dataset are less than 2000 in length, sequences with exactly 2000 in length can be formed by zero padding, resulting in a matrix of size 2000 x 84 as an input feature of the model.
Preferably, in the step S3, the protein function prediction system model is trained, and the loss function is:
wherein ,representing the task number; />Representative sample number; />Representing the number of GO categories considered; />A tag value representing a functional class k in a sample j of task i; />A tag value representing a functional class k in a sample j of task i; />The prediction probability of the functional class k in the sample j representing task i; />Representing a binary cross entropy cost function; />An attention weight matrix representing task i; />Representing an identity matrix; />The Frobenius norm of the matrix; />Penalty terms representing multiple attention heads for task i; />Representing penalty coefficients.
In the above-described arrangement, the first and second embodiments,added to the loss function to enhance the difference between the multiple attention heads to obtain better protein characteristicsAnd (3) representing.
Preferably, in the steps S3 and S4, the preprocessing is z-score normalization.
In the scheme, common normalization means are min-max, function conversion, z-score and the like, and the z-score normalization means are adopted in the pretreatment process.
Compared with the prior art, the technical scheme of the application has the beneficial effects that:
according to the application, the prediction of the three ontologies is regarded as three prediction tasks, and the prediction is performed by establishing a protein function prediction system model based on a multi-task learning and self-attention mechanism, so that the accuracy of protein function prediction is improved.
Drawings
FIG. 1 is a flow chart of the method of the present application;
FIG. 2 is a schematic diagram of the structure of the protein function prediction system model;
FIG. 3 is a schematic diagram of the one-dimensional residual convolution block structure;
FIG. 4 is a schematic representation of the attention score learned in the model of the method (protein sequence input) of the application (A) and the method (B) of the application and the corresponding motif searched for by MAST (C) for training sample "P17121" in example 1;
FIG. 5 is a schematic diagram of the attention score learned in the model of the method (protein sequence input) (A) and the method (B) of the application and the corresponding motif (C) searched for by MAST for in the test sample "T96060014484" in example 1.
Detailed Description
In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited to the specific embodiments disclosed below.
Example 1
As shown in fig. 1, a protein function prediction method combining multitasking learning and self-attention mechanisms includes the following steps:
s1: according to the molecular function type MF prediction task, the biological process type BP prediction task and the cell component type CC prediction task, a protein function prediction system model based on a multi-task learning and self-attention mechanism is constructed;
s2: acquiring a sample data set, extracting characteristic information of a protein sequence in the sample data set, and constructing a training set and a testing set;
s3: the training set is preprocessed and then is input into a protein function prediction system model, and the protein function prediction system model is trained;
s4: and preprocessing the test set, inputting the test set into a trained protein function prediction system model, and predicting the protein function.
Further, the step S1 specifically includes:
s101: constructing an MF sub-network based on a self-attention mechanism according to a molecular function class MF prediction task;
s102: constructing a BP sub-network based on a self-attention mechanism according to a molecular function class MF prediction task;
s103: constructing a CC sub-network based on a self-attention mechanism according to a molecular function class MF prediction task;
s104: cross-stitch units are arranged among the MF sub-network, the BP sub-network and the CC sub-network, so that connection and parameter sharing among the sub-networks are realized, and a protein function prediction system model based on multi-task learning and self-attention mechanisms is constructed.
As shown in fig. 2, further, each of the MF sub-network, the BP sub-network, and the CC sub-network includes a one-dimensional convolution layer, a residual convolution layer, a multi-head self-attention layer, and a full connection layer; wherein: the input of the one-dimensional convolution layer is used as the input of the protein function prediction system model; the one-dimensional convolution layer output end is connected with the residual error convolution layer input end; the residual convolution layer output end is connected with the multi-head self-attention layer input end; the multi-head self-attention layer output end is connected with the full-connection layer input end; the output of the full-connection layer is used as the output of the protein function prediction system model; cross stitch units are arranged between the one-dimensional convolution layer and the residual convolution layer and between the multi-head self-attention layer and the full-connection layer.
Wherein, flattening treatment is carried out between the multi-head self-attention layer and the full-connection layer.
Further, the residual convolution layer comprises a plurality of one-dimensional residual convolution blocks, wherein: a cross stitch unit is arranged between each one-dimensional residual convolution block; the input end of the first one-dimensional residual convolution block is connected with the output end of the one-dimensional convolution layer; a cross stitch unit is arranged between the last one-dimensional residual convolution block and the multi-head self-attention layer, and the output end of the one-dimensional residual convolution block is connected with the input end of the multi-head self-attention layer.
Further, each one-dimensional residual convolution block has the same structure and comprises a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer, a second normalization layer and a second activation layer; wherein: the first convolution layer input end receives the input of the one-dimensional residual convolution block; the first convolution layer output end is connected with the first normalization layer input end; the first normalization layer output end is connected with the activation layer input end; the output end of the activation layer is connected with the input end of the second convolution layer; the output end of the second convolution layer is connected with the input end of the second normalization layer; and the second activation layer input end receives the input of the one-dimensional residual convolution block and the output of the second normalization layer.
Further, the multi-head self-attention layer passesNormalizing the function so that the sum of the weights of the respective positions in the protein sequence is 1, obtaining +.>Judging the importance of each position in the protein sequence: />The larger the specific gravity is, the higher the attention weight is; the method is specifically expressed as follows:
wherein :/>Representing the output of the multi-headed self-attention layer; />Representing an attention weight matrix; />Representing a feature matrix as input to the multi-headed self-attention layer; /> and />Representing a weight matrix. Further, the cross stitch unit is configured to learn weights among three prediction tasks, and calculate according to the weights, to obtain a better feature map, which is specifically expressed as:
wherein ,/>、/> and />Representing a characteristic diagram input by the cross stitch unit; /> and />Representing a characteristic diagram output by the cross-stitch unit; />Represents a cross stitch unit; />Representing the position in the feature map; />、/>、/>Representing the weight of each task on the self; />、/>、/>、/>、/>、/>Representing weights shared between tasks.
Further, in said step S2, the characteristic information of the extracted protein sequence comprises the amino acid sequence Seq, the position-specific scoring matrix PSSM, the sequence spectrum HMM of the hidden markov model, the structural information SPIDER3 of the protein predicted by SPIDER3.
Further, in the step S3, the protein function prediction system model is trained, and the loss function is:
wherein ,representing the task number; />Representative sample number; />Representing the number of GO categories considered; />A tag value representing a functional class k in a sample j of task i; />A tag value representing a functional class k in a sample j of task i; />The prediction probability of the functional class k in the sample j representing task i; />Representing a binary cross entropy cost function; />An attention weight matrix representing task i; />Representing an identity matrix; />The Frobenius norm of the matrix; />Penalty terms representing multiple attention heads for task i; />Representing penalty coefficients.
Further, in the steps S3 and S4, the preprocessing is z-score normalization.
In the present embodiment, use is made of、/>And AUPR as an index for evaluating model performance (+.>And the greater the AUPR,Smaller, the more accurate the predictive result is obtained by the representation model). In order to prevent over fitting, 10% of a sample data set is randomly selected as a training set, training of the neural network is finished in advance through an early-stop method, and then super-parameter tuning is performed according to a prediction result of the training set. The protein function prediction system model is realized based on a deep learning framework Pytorch and runs on Ubuntu Linux 16 and NVIDIA GP102 GPU.
In this embodiment, two sample data sets, the CAFA3 data set and the SwissProt 2016 data set, are downloaded from http:// deepgplus. Bio2vec. Net/data. Since the prediction performance of the test set with low similarity to the training set is more of a concern, the use of Diamond (e-value of 0.001) for both data sets removes samples similar to the training set. As shown in table 1, the training set, the size of the test set, and the number of GO terms for the three ontologies (MF, BP, and CC) in the CAFA3 dataset and SwissProt 2016 dataset after removal of similar samples are summarized.
TABLE 1
Model comparison
In this example, three comparative models were used to verify the effectiveness of the model constructed in accordance with the present application in protein function prediction. Comparison model one: only three independent sub-networks are used for prediction, and parameter sharing is not performed; and (3) comparing a model II: using a hard parameter sharing mechanism, namely using a shared convolutional neural network for three tasks, wherein each task is provided with an independent self-attention module and a linear classification layer; and (3) comparing a model III: the self-attention module is replaced by averaging to explore whether the self-attention mechanism can improve the prediction performance.
As shown in table 2, the predictions between the method of the application and each of the comparison models were compared on the CAFA3 and SwissProt 2016 data sets. It can be seen that the model of the present application achieves the best average over both data sets、/>And AUPR. The results show that the use of a shared network results in improved predictive performance compared to a network that does not use parameter sharing (model two of the present application and comparative model two). Compared with the hard parameter sharing strategy, the model of the application has the advantages that three average evaluation indexes are greatly improved, and the effectiveness of realizing soft parameter sharing by using the cross stitch unit in multi-task prediction is illustrated. At the same time, the results show that eliminating the self-attention module results in performance degradation, because the main purpose of using the self-attention mechanism is to learn the orderThe motifs of the columns. In summary, the model of the present application achieves the best predictive performance due to the adoption of the self-attention mechanism and the cross stitch unit.
TABLE 2
(II) comparison with other methods
In this example, a comparison was also made with other sequence-based representative methods (including Naive, deep-Seq, and deep-cnn) on both datasets. Because other comparison methods all input protein sequences as models, the embodiment adds a set of comparison based on the method by taking protein sequences as inputs. As shown in table 3, there is a comparison of the predicted performance of the various methods in the test set of CAFA3 and SwissProt 2016. The results show that the method (protein sequence input) of the application is superior to Naive, deep-Seq and deep GOCNN in average evaluation index on three sub-ontologies in two data sets by taking the same characteristics as the input of the model. Although the prediction performance of the method (protein sequence input) is only slightly better than that of deep GOCNN, the prediction speed is obviously improved. Using a single Intel (R) Xeon (R) E5-2650 v4 CPU and NVIDIA GP102 GPU, deep GOCNN can predict the function of 43 protein sequences per second, while the method (protein sequence input) can annotate the function of 81 protein sequences per second, which is about twice as fast as deep GOCNN, indicating that the method is an accurate and rapid protein function prediction method.
Notably, the inventive method achieves a very significant performance improvement when all the feature information (seq+pssm+hmm+spider 3) is used. In the CAFA3 dataset, the average obtained by the method=0.512, average->= 13.649, average aupr=0.480, significantly betterIn deep GOCNN (0.469, 13.984,0.432 respectively) and other comparative methods; likewise, consistent results were obtained on the SwissProt 2016 dataset. In summary, the method of the present application is a more efficient method for predicting protein function.
TABLE 3 Table 3
(III) self-attention mechanism
Since protein sequences with the same function often contain the same motif, learning the motif helps predict the function of the protein. Thus, the self-attention score and sequence motifs of specific functions are compared in a visual manner to reveal the effectiveness of the self-attention mechanism in protein function prediction.
In this example, a set of protein sequences with specific functions were searched for motifs using MAST, and the motifs searched for by MAST were then compared with the attention scores obtained in the model of the application. Take the function of "enzyme activator Activity" (GO: 0008047) as an example. The CAFA3 test set had 470 training proteins and 5 test proteins with this function, and these 475 protein sequences were input into the MAST for motif searching. As shown in fig. 4 and 5, training samples "P17121" and test samples "T96060014484" respectively, were learned attention scores in the method (protein sequence input) of the present application (a), the model of the method (B) of the present application, and the corresponding motifs searched for by MAST (C). The results show that the attention scores of the two models around the motif region are obviously higher than those of other regions, which indicates that the method can effectively learn the sequence motif. It is noted that the motif region learned by the method of the present application is more accurate in position and can obtain a higher attention score than the motif region learned by the method of the present application (protein sequence input), which means that the protein features used are favorable for motif learning and can improve the accuracy of protein function prediction.
It can be seen that the method of the application predicts three different tasks by adopting a multi-task learning method by regarding the problem of predicting three ontologies (MF, BP, CC); the distinction and the relation between three ontologies are utilized, and parameter sharing is realized by using a cross stitch unit so as to obtain better embedded representation, so that the protein function prediction accuracy is improved; the motif of the sequence is learned and visualized by adopting a self-attention mechanism, so that the interpretability of the neural network is improved.
The method can be further popularized to other tasks related to multi-label classification, such as gene function prediction.
The same or similar reference numerals correspond to the same or similar components;
the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;
it is to be understood that the above examples of the present application are provided by way of illustration only and not by way of limitation of the embodiments of the present application. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are desired to be protected by the following claims.

Claims (9)

1. A method for predicting protein function combining multitasking and self-attention mechanisms, comprising the steps of:
s1: according to the molecular function type MF prediction task, the biological process type BP prediction task and the cell component type CC prediction task, a protein function prediction system model based on a multi-task learning and self-attention mechanism is constructed;
s2: acquiring a sample data set, extracting characteristic information of a protein sequence in the sample data set, and constructing a training set and a testing set;
the characteristic information of the protein sequence comprises an amino acid sequence, a position specificity scoring matrix, a sequence spectrum of a hidden Markov model and structural information of the protein predicted by SPIDER3, which are respectively marked as 'Seq', 'PSSM', 'HMM' and 'SPIDER 3';
wherein, seq is the one-hot code of the amino acid sequence, which is expressed as a matrix of L×20, L is the sequence length;
PSSM is a feature generated by iterative search of each protein by running PSI-BLAST on UniRef90 database, denoted as a matrix of L×20, L being the sequence length;
HMM is a matrix expressed as a size of lx 30, L being the sequence length, by running HHblits v3.0.0 on the unicrout 30 database to generate a sequence spectrum of the hidden markov model;
SPIDER3 is a structural feature generated by SPIDER3 software whose inputs include protein sequences and PSSM and HMM features obtained by PSI-BLAST and HHblits, and whose structural features output by SPIDER3 software include solvent accessible surface area, sine and cosine values of four backbone angles, hemispherical exposures, alpha-helical structures, beta-sheet structures, and random coil structures;
s3: the training set is preprocessed and then is input into a protein function prediction system model, and the protein function prediction system model is trained;
s4: and preprocessing the test set, inputting the test set into a trained protein function prediction system model, and predicting the protein function.
2. The method for predicting protein function in combination with multitasking learning and self-attention mechanism as recited in claim 1, wherein said step S1 is specifically:
s101: constructing an MF sub-network based on a self-attention mechanism according to a molecular function class MF prediction task;
s102: constructing a BP sub-network based on a self-attention mechanism according to a molecular function class MF prediction task;
s103: constructing a CC sub-network based on a self-attention mechanism according to a molecular function class MF prediction task;
s104: cross-stitch units are arranged among the MF sub-network, the BP sub-network and the CC sub-network, so that connection and parameter sharing among the sub-networks are realized, and a protein function prediction system model based on multi-task learning and self-attention mechanisms is constructed.
3. The method for predicting protein functions by combining multitasking learning and self-attention mechanisms according to claim 2, wherein each of the MF sub-network, the BP sub-network and the CC sub-network comprises a one-dimensional convolution layer, a residual convolution layer, a multi-headed self-attention layer and a full-connection layer; wherein:
the input of the one-dimensional convolution layer is used as the input of the protein function prediction system model;
the one-dimensional convolution layer output end is connected with the residual error convolution layer input end;
the residual convolution layer output end is connected with the multi-head self-attention layer input end;
the multi-head self-attention layer output end is connected with the full-connection layer input end;
the output of the full-connection layer is used as the output of the protein function prediction system model;
cross stitch units are arranged between the one-dimensional convolution layer and the residual convolution layer and between the multi-head self-attention layer and the full-connection layer.
4. A method of protein function prediction combining multitasking and self-attention mechanisms as claimed in claim 3, wherein said residual convolution layer comprises a number of one-dimensional residual convolution blocks, wherein:
a cross stitch unit is arranged between each one-dimensional residual convolution block;
the input end of the first one-dimensional residual convolution block is connected with the output end of the one-dimensional convolution layer;
a cross stitch unit is arranged between the last one-dimensional residual convolution block and the multi-head self-attention layer, and the output end of the one-dimensional residual convolution block is connected with the input end of the multi-head self-attention layer.
5. The method for predicting protein functionality in combination with a multitasking learning and self-attention mechanism of claim 4, wherein each of said one-dimensional residual convolution blocks has the same structure and comprises a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer, a second normalization layer, and a second activation layer; wherein:
the first convolution layer input end receives the input of the one-dimensional residual convolution block;
the first convolution layer output end is connected with the first normalization layer input end;
the first normalization layer output end is connected with the activation layer input end;
the output end of the activation layer is connected with the input end of the second convolution layer;
the output end of the second convolution layer is connected with the input end of the second normalization layer;
and the second activation layer input end receives the input of the one-dimensional residual convolution block and the output of the second normalization layer.
6. A method for protein function prediction in combination with multitasking learning and self-attention mechanisms as in claim 3 wherein said multi-headed self-attention layer is passed throughNormalizing the function so that the sum of the weights of the respective positions in the protein sequence is 1, obtaining +.>Judging the importance of each position in the protein sequence: />The larger the specific gravity is, the higher the attention weight is; the method is specifically expressed as follows:
wherein :/>Representing the output of the multi-headed self-attention layer; />Representing an attention weight matrix; />Representing a feature matrix as input to the multi-headed self-attention layer; /> and />Representing a weight matrix.
7. A protein function prediction method combining multi-task learning and self-attention mechanisms according to claim 3, wherein the cross-stitch unit is configured to learn weights among three prediction tasks, and calculate according to the weights, to obtain a better feature map, which is specifically expressed as:
wherein ,/>、/> and />Representing a characteristic diagram input by the cross stitch unit; />、/>Andrepresenting a characteristic diagram output by the cross-stitch unit; />Represents a cross stitch unit; />Representing the position in the feature map; />、/>、/>Representing the weight of each task on the self; />、/>、/>、/>、/>、/>Representing weights shared between tasks.
8. The method according to claim 1, wherein in step S3, the protein function prediction system model is trained by using a loss function of:
wherein ,representing the task number; />Representative sample number; />Representing the number of GO categories considered; />A tag value representing a functional class k in a sample j of task i; />The prediction probability of the functional class k in the sample j representing task i; />Representing a binary cross entropy cost function; />Representative taski an attention weight matrix; />Representing an identity matrix; />The Frobenius norm of the matrix; />Penalty terms representing multiple attention heads for task i; />Representing penalty coefficients.
9. A method for protein function prediction in combination with a multitasking and self-attention mechanism as claimed in claim 1, wherein in steps S3, S4 the preprocessing is a z-score normalization process.
CN202011467595.3A 2020-12-14 2020-12-14 Protein function prediction method combining multitask learning and self-attention mechanism Active CN112562784B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011467595.3A CN112562784B (en) 2020-12-14 2020-12-14 Protein function prediction method combining multitask learning and self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011467595.3A CN112562784B (en) 2020-12-14 2020-12-14 Protein function prediction method combining multitask learning and self-attention mechanism

Publications (2)

Publication Number Publication Date
CN112562784A CN112562784A (en) 2021-03-26
CN112562784B true CN112562784B (en) 2023-08-15

Family

ID=75064485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011467595.3A Active CN112562784B (en) 2020-12-14 2020-12-14 Protein function prediction method combining multitask learning and self-attention mechanism

Country Status (1)

Country Link
CN (1) CN112562784B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838520B (en) * 2021-09-27 2024-03-29 电子科技大学长三角研究院(衢州) III type secretion system effector protein identification method and device
CN114511918B (en) * 2022-04-20 2022-07-05 中国传媒大学 Face state judgment method and system based on multi-task learning
CN117037898A (en) * 2023-07-18 2023-11-10 哈尔滨工业大学 Molecular interaction prediction method based on knowledge graph and multi-task learning
CN117393050A (en) * 2023-10-17 2024-01-12 哈尔滨工业大学(威海) Protein function recognition method, device and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110310698A (en) * 2019-07-05 2019-10-08 齐鲁工业大学 Classification model construction method and system based on protein length and DCNN
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
CN111696624A (en) * 2020-06-08 2020-09-22 天津大学 DNA binding protein identification and function annotation deep learning method based on self-attention mechanism

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10423861B2 (en) * 2017-10-16 2019-09-24 Illumina, Inc. Deep learning-based techniques for training deep convolutional neural networks
US11581060B2 (en) * 2019-01-04 2023-02-14 President And Fellows Of Harvard College Protein structures from amino-acid sequences using neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110310698A (en) * 2019-07-05 2019-10-08 齐鲁工业大学 Classification model construction method and system based on protein length and DCNN
CN111696624A (en) * 2020-06-08 2020-09-22 天津大学 DNA binding protein identification and function annotation deep learning method based on self-attention mechanism
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism

Also Published As

Publication number Publication date
CN112562784A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN112562784B (en) Protein function prediction method combining multitask learning and self-attention mechanism
Bepler et al. Learning the protein language: Evolution, structure, and function
Habibi et al. A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli
Wu et al. EPSOL: sequence-based protein solubility prediction using multidimensional embedding
US11532378B2 (en) Protein database search using learned representations
Wang Application of support vector machines in bioinformatics
Wang et al. Improved fragment sampling for ab initio protein structure prediction using deep neural networks
US20230067528A1 (en) Multimodal domain embeddings via contrastive learning
Hu et al. Protein language models and structure prediction: Connection and progression
Zhou et al. Combining deep neural networks for protein secondary structure prediction
Ma et al. Retrieved sequence augmentation for protein representation learning
Penić et al. Rinalmo: General-purpose rna language models can generalize well on structure prediction tasks
Dotan et al. Effect of tokenization on transformers for biological sequences
Yu et al. KenDTI: An ensemble model for predicting drug-target interaction by integrating multi-source information
Ma et al. CRBP-HFEF: prediction of RBP-Binding sites on circRNAs based on hierarchical feature expansion and fusion
Chen et al. Attention is all you need for general-purpose protein structure embedding
Lee et al. BP-GAN: Interpretable human branchpoint prediction using attentive generative adversarial networks
Lawrence et al. Improving MHC class I antigen-processing predictions using representation learning and cleavage site-specific kernels
Tan et al. PETA: Evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications
Wekesa et al. LPI-DL: A recurrent deep learning model for plant lncRNA-protein interaction and function prediction with feature optimization
Pokharel et al. NLP-based encoding techniques for prediction of post-translational modification sites and protein functions
Yao et al. Protein subcellular localization prediction based on PSI-BLAST profile and principal component analysis
Li et al. ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention
Singh et al. SPOT-1D2: Improving protein secondary structure prediction using high sequence identity training set and an ensemble of recurrent and residual-convolutional neural networks
Mufassirin et al. Multi-S3p: protein secondary structure prediction with specialized multi-network and self-attention-based deep learning model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant