CN115472229A - Thermophilic protein prediction method and device - Google Patents

Thermophilic protein prediction method and device Download PDF

Info

Publication number
CN115472229A
CN115472229A CN202211109737.8A CN202211109737A CN115472229A CN 115472229 A CN115472229 A CN 115472229A CN 202211109737 A CN202211109737 A CN 202211109737A CN 115472229 A CN115472229 A CN 115472229A
Authority
CN
China
Prior art keywords
protein
feature vectors
protein sequence
thermophilic
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211109737.8A
Other languages
Chinese (zh)
Inventor
赵建君
杨洋
严文颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202211109737.8A priority Critical patent/CN115472229A/en
Publication of CN115472229A publication Critical patent/CN115472229A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Chemical & Material Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention relates to the technical field of protein engineering, in particular to a method and a device for predicting thermophilic protein; the method for predicting the thermophilic protein combines the protein sequence and biological characteristics derived from the sequence to predict the thermophilic protein. The method comprises the steps of extracting local key information of a protein sequence by using a convolutional neural network; then extracting remote dependence characteristics by using a bidirectional long-short term memory network; then weighting key information of the protein sequence through a self-attention mechanism; and fusing the biological characteristics of the protein sequence; finally, realizing the prediction of the thermophilic protein through a multilayer perceptron; the prediction method of the thermophilic protein effectively excavates the hidden important information in the protein sequence, the multi-channel feature fusion can more fully express the protein sequence, and the self sequence information of the protein is fully utilized to enable the prediction result to be more accurate.

Description

Thermophilic protein prediction method and device
Technical Field
The invention relates to the technical field of protein engineering, in particular to a method and a device for predicting thermophilic protein.
Background
The thermal stability of a protein refers to the ability of a protein to retain its unique chemical and steric structure under high temperature conditions. Protein engineering and biotechnology research is largely dependent on the thermal stability of proteins. Organisms with an optimal growth temperature below 50 ℃ are considered mesophiles and organisms above 50 ℃ are considered thermophiles. Thermophilic organisms can produce thermostable proteins (thermophilic proteins) that survive for long periods of time at high temperatures without denaturation, and some thermophilic proteins can survive even at 100 ℃. The high thermal stability of the thermophilic proteins gives them outstanding advantages for their use in industrial production. For example: in enzyme engineering, an enzyme preparation produced from thermophilic protein has the advantages of high heat resistance, high catalytic reaction rate and the like. Therefore, the prediction research on the thermophilic protein is not only important for the protein thermal stability engineering, but also has important value in the practical fields of industrial production and the like.
The method for distinguishing the thermophilic protein from the normal-temperature protein through a biological experiment consumes time and labor and is high in cost. Compared with a biological experiment method, the calculation method can quickly and accurately identify the thermophilic protein and the normal temperature protein from a large amount of protein sequence information, and is an important subject in the field of the thermal stability of the current protein.
The thermal stability of protein is closely related with the biological characteristics of amino acid composition, hydrogen bond, salt bridge and the like. Zhou et al found that thermophilic proteins have more hydrophobic, charged and aromatic residues than normal temperature proteins. Zhang Guanya et al found that various dipeptide contents affected the thermal stability of the protein. The presence of different types of hydrogen bonds on protein thermal stability was studied by the stamen et al. It was found in the Wu et al experiments that the introduction of salt bridges can improve the thermostability of the protein. It follows that the biological characteristics of the protein are very important for the prediction of the thermophilic protein.
The calculation method for prediction of the thermophilic protein is mostly based on the traditional machine learning method. Zhang et al analyzed the primary structure of the sequence based on the LogitBoost algorithm to detect thermophilic proteins. Lin et al constructed a data set containing 915 thermophilic proteins and 793 non-thermophilic proteins and predicted the thermophilic proteins from information on amino acid distribution and amino acid pairs. Charoenkwan et al compiled a new data set from published literature and constructed a predictor named scmptpp from the amino acid composition and dipeptide propensity scores. Meng et al constructed a predictor based on a support vector machine, named TMPpred, based on amino acid composition and eight physicochemical properties using the data set of Lin as raw data. SAPPHIRE is a thermophilin predictor constructed based on the compositional, combination-transition-distribution, physicochemical and evolutionary information characteristics of protein sequences. These studies have achieved some success in the prediction of thermophilic proteins, but the conventional machine learning-based prediction algorithm uses biological features calculated from protein sequences, and cannot sufficiently capture protein sequence information.
The rapid development of the deep learning technology plays a positive role in promoting the development of bioinformatics. Karimi et al will predict the affinity of proteins by the Recurrent Neural Network (RNN) and the Convolutional Neural Network (CNN). Haicheng et al predicted synthetic lethal gene by neural network. Ahmed et al used deep learning technology to predict thermophilic proteins for the first time, and proposed a thermophilic protein prediction model iThermo, which combines the biological characteristics of seven groups of protein sequences and distinguishes thermophilic proteins from normal temperature proteins by a multilayer perceptron (MLP). However, only sequence-derived biological features were used in the iThermo model, ignoring information from the protein sequence itself.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to overcome the problem that the information of the protein sequence is ignored by a thermophilic protein prediction model in the prior art.
In order to solve the technical problem, the invention provides a method for predicting thermophilic protein, which comprises the following steps:
obtaining biological characteristics of a protein sequence, and performing characteristic screening and standardization processing to obtain a biological characteristic vector;
respectively carrying out amino acid composition coding and amino acid physicochemical property coding on a protein sequence to obtain two coding feature vectors;
mapping the two coded feature vectors into two dense feature vectors through an embedding layer;
performing convolution operation on the two dense feature vectors respectively to obtain two local key feature vectors of the protein sequence;
the two local key characteristic vectors are used for capturing context dependence information through a bidirectional long-short term memory network respectively to obtain two hidden characteristic vectors of a protein sequence;
respectively carrying out weighting processing on the two hidden feature vectors through an attention mechanism to obtain two key feature vectors of the protein sequence;
fusing the two hidden feature vectors, the two key feature vectors and the biological feature vector to obtain fused features;
and performing nonlinear transformation on the fusion characteristics through a multilayer perceptron, and then performing thermophilic protein prediction by using a sigmoid layer to obtain a target prediction result.
Preferably, the processing of the protein sequence by amino acid composition coding and amino acid physicochemical property coding to obtain two coding feature vectors respectively comprises:
ordering the amino acids according to the shorthand letters of the amino acid residues, carrying out labeling, and carrying out amino acid composition coding on the protein sequence according to the labeling;
and classifying the amino acids according to the physicochemical properties of the amino acid residues, numbering the amino acids according to the classified classes, and coding the physicochemical properties of the amino acids for the protein sequence according to the numbering.
Preferably, the mapping the two coded feature vectors into two dense feature vectors via the embedding layer further comprises: and respectively losing part of information of the two dense feature vectors through the discarding layer.
Preferably, the performing the convolution operation on the two dense feature vectors respectively further includes performing a max-pooling operation respectively.
Preferably, the obtaining two hidden feature vectors of the protein sequence by capturing the context-dependent information through the two-way long-short term memory network respectively comprises:
respectively carrying out forward calculation on local key feature vectors from 1 to t moments by utilizing a forward layer of a bidirectional long-short term memory network to obtain the output of a forward hidden layer at each moment;
respectively carrying out reverse calculation on the local key characteristic vectors from t to 1 by utilizing a backward layer of the bidirectional long-short term memory network to obtain the output of a backward hidden layer at each moment;
and combining the outputs of the forward layer and the backward layer at each moment to obtain a final hidden feature vector.
Preferably, the weighting the two hidden feature vectors by an attention-free mechanism to obtain two key feature vectors of the protein sequence includes:
mapping the hidden characteristic vectors to three different spaces to obtain a query vector matrix, a key vector matrix and a value vector matrix;
calculating the similarity between the key vector matrix and the query vector matrix by using point multiplication, and performing normalization processing by using a softmax function to obtain attention weight;
and carrying out weighted summation on the attention weight and the value vector matrix to obtain a key feature vector.
The present invention also provides a thermophilic protein prediction apparatus, comprising:
the biological characteristic vector acquisition module is used for acquiring the biological characteristics of the protein sequence, and performing characteristic screening and standardization processing to obtain a biological characteristic vector;
the coding feature vector acquisition module is used for respectively carrying out amino acid composition coding and amino acid physicochemical property coding processing on the protein sequence to obtain two coding feature vectors;
a word embedding module for mapping the two encoded feature vectors into two dense feature vectors;
the convolution module is used for respectively carrying out convolution operation on the two dense feature vectors to obtain two local key feature vectors of the protein sequence;
the bidirectional long-short term memory module is used for respectively capturing context dependence information of the two local key characteristic vectors to obtain two hidden characteristic vectors of the protein sequence;
the attention module is used for respectively carrying out weighting processing on the two hidden feature vectors to obtain two key feature vectors of the protein sequence;
the feature fusion module is used for fusing the two hidden feature vectors, the two key feature vectors and the biological feature vector to obtain fusion features;
and the multilayer perceptron module is used for performing nonlinear transformation on the fusion characteristics and then performing thermophilic protein prediction by using the sigmoid layer to obtain a target prediction result.
Preferably, the thermophilic protein prediction device is applied to the determination of the thermal stability of proteins.
Preferably, the thermophilic protein predicting apparatus is applied to enzyme engineering.
Compared with the prior art, the technical scheme of the invention has the following advantages:
the method for predicting the thermophilic protein combines the protein sequence and biological characteristics derived from the sequence to predict the thermophilic protein. The method comprises the steps of extracting local key information of a protein sequence by using a convolutional neural network; then extracting remote dependence features by using a bidirectional long-short term memory network (BilSTM); then weighting the key information of the protein sequence by a Self-attention mechanism (Self-attention); and fusing the biological characteristics of the protein sequence; finally, realizing the prediction of the thermophilic protein through a multilayer perceptron (MLP); the thermophilic protein prediction method effectively excavates important information hidden in a protein sequence, multi-channel feature fusion can more fully express the protein sequence, and sequence information of the protein is fully utilized to enable a prediction result to be more accurate; the features used in the present invention are based on protein sequences, do not relate to protein structure, and have higher generalization.
Drawings
In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the present disclosure taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of an implementation of the present invention thermophilic protein prediction method;
FIG. 2 is a flow chart of an implementation of an embodiment of the present invention;
FIG. 3 is a flow chart of amino acid encoding;
FIG. 4 is a diagram of the structure of BilSTM;
fig. 5 is a block diagram of a thermophilic protein predicting apparatus according to an embodiment of the present invention.
Detailed Description
The core of the invention is to provide a method and a device for predicting the thermophilic protein, which make full use of the sequence information of the protein to ensure that the prediction result is more accurate.
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1 and fig. 2, fig. 1 is a flowchart illustrating an implementation of a method for predicting a thermophilic protein provided by the present invention, and fig. 2 is a flowchart illustrating an implementation of an embodiment of the present invention; the specific operation steps are as follows:
the thermophilic protein prediction model based on the self-attention mechanism is a multichannel fusion model DeepTTP combining sequence information and biological characteristics, and is input into two vectors of a protein sequence after amino acid composition coding and amino acid physicochemical property coding and biological characteristics after standardization processing; and the vectors processed by the two encoding forms concurrently execute subsequent operations.
S101, acquiring biological characteristics of a protein sequence, and performing characteristic screening and standardization processing to obtain a biological characteristic vector;
the dimensionality of the biological features also influences the prediction effect of the model, overlarge dimensionality of the biological features causes overlarge feature dimensionality obtained after the biological features are fused with the output features of deep learning, the complexity of model prediction is increased, and feature screening and standardization processing are performed on the biological features calculated through the sequence to obtain biological feature vectors B
S102, respectively carrying out amino acid composition coding and amino acid physicochemical property coding on a protein sequence to obtain two coding feature vectors;
as shown in fig. 3:
ordering the amino acids according to the shorthand letters of the amino acid residues, carrying out labeling (each amino acid corresponds to a specific real number respectively), and carrying out amino acid composition coding on the protein sequence according to the labeling;
amino acids are classified according to their physicochemical properties, which are closely related to thermophilic proteins, and are classified into 6 groups according to their physical and chemical properties: hydrophobic, negatively charged, positively charged, conformational, polar and other properties, numbering the amino acids according to the classified classes, as shown in table 1:
TABLE 1 amino acid physicochemical Property Classification Numbers
Figure BDA0003843477690000061
Figure BDA0003843477690000071
And coding the protein sequence according to the number by amino acid physicochemical property.
S103, mapping the two coding feature vectors into two dense feature vectors through an embedding layer;
injecting noise (e.g., dropout) into the hidden unit, preventing the model from overfitting;
and adding a Dropout layer with Dropout proportion of 0.5 after the embedding layer, respectively passing the two dense feature vectors through a discarding layer, and temporarily discarding some neural network units from the network.
S104, performing convolution operation on the two dense feature vectors respectively to obtain two local key feature vectors of the protein sequence;
CNN captures well local key features in the image, and hence CNN is used herein for protein sequence analysis. The convolution module is provided with three convolution network layers. The convolution layer performs convolution operation on the data by adopting a local connection and weight sharing method to acquire local key information. Each convolution layer has 64 filters with the length of 3, and each sliding step of the filter is 1, and a characteristic diagram c with higher dimension can be obtained through a series of convolution operations 1 And c 2
The parameter matrix size can be effectively reduced by adopting the pooling layer, and further, the parameter quantity in the model is reduced. Therefore, adding pooling layers can improve computational efficiency and avoid overfitting. Performing maximum pooling operation on the pooling layer to obtain an output c' 1 And c' 2
S105, capturing context dependent information by the two local key characteristic vectors through a bidirectional long-short term memory network respectively to obtain two hidden characteristic vectors of a protein sequence;
the prediction of the thermophilic protein uses the information of the whole sequence, and the performance of a prediction model is possibly influenced by some dependency relations existing among sequence contexts, so that parameter information and characteristic information hidden in the depth of the sequence are obtained through a BilSTM layer, the remote dependency relation is explored, and a corresponding hidden unit is extracted; the structure of BilSTM is shown in FIG. 4:
forward calculation is carried out on the forward layer of the BilSTM from 1 to t to obtain the output of the forward hidden layer at each moment; from t toAt the moment 1, the backward layer carries out reverse calculation to obtain the output of the backward hidden layer at each moment; on the basis, combining the outputs of the forward layer and the backward layer at each moment to obtain a final output result: c f =f(w 1 x t +w 2 C f-1 ),C b =f'(w 3 x t +w 5 C b-1 ),H m =g(w 4 C f +w 6 C b ) Wherein t represents time; x represents an input; w is a i (i =1,2, 3.., 6) represents a weight; c f Is the output of the forward layer; c b Is the output of the backward layer; f (), f' () function calculates the output of the forward layer, backward layer separately; the g () function combines and sums the outputs of the forward and backward layers. Finally obtaining the output H of the BiLSTM layer m
Inputting the two local key feature vectors into a BilSTM layer, and finally obtaining two 128-dimensional feature vectors H after training 1 And H 2
S106, weighting the two hidden feature vectors respectively through an attention mechanism to obtain two key feature vectors of the protein sequence;
weighting key information in the sequence by using an attention mechanism, and distributing more attention to important information to obtain the output of an attention layer;
the attention mechanism is introduced to help the model endow different weight values to each part of the input, so that key information is extracted, and the model can make a decision more accurately. Attention-boosting mechanisms were originally proposed in the field of Natural Language Processing (NLP) and are now widely used in various fields, such as Chen et al, which use an attention-boosting mechanism to classify images; the Jiangyu tung et al introduces an attention mechanism in the bidirectional GRU to process the emotion analysis problem and the like.
The self-attention mechanism is an efficient way to process the same level of information in parallel. On the basis of fully extracting protein sequence characteristic information through CNN and BilSTM modules, a self-attention mechanism is utilized for optimization, so that the model can pay attention to key information in the protein sequence more effectively, and the capability of the modules for extracting key characteristics is enhanced. The self-attention mechanism is calculated as follows:
for the input word vector matrix E, it is first mapped to three different spaces to obtain three vectors Q, K, V. The expression is as follows: q = EW i Q ,K=EW i K ,V=EW i V
Wherein Q, K and V respectively represent a matrix formed by a query vector, a key vector and a value vector. W is a group of i Q 、W i K And W i V Parameter matrixes of the ith linear mapping are respectively;
calculating the similarity between K and Q by using point multiplication, and then carrying out normalization processing on the attention weight by using softmax to obtain probability distribution, wherein the expression is as follows: a = softmax (K) T ·Q);
And finally, carrying out weighted summation on the weights A and V to obtain the Attention, wherein the expression is as follows:
Figure BDA0003843477690000091
after the hidden features of the extracted protein sequence are processed by the self-attention mechanism module, more attention is allocated to important features, less attention is allocated to unimportant features, and finally output, namely a key feature vector A is obtained 1 And A 2
S107, fusing the two hidden feature vectors, the two key feature vectors and the biological feature vector to obtain fused features;
and fusing the remote dependency relationship extracted by the BilSTM layer, the key information extracted by the attention layer and the biological characteristics.
And S108, performing nonlinear transformation on the fusion characteristics through a multilayer perceptron, and performing thermophilic protein prediction by using a sigmoid layer to obtain a target prediction result.
Processing the fusion features through three fully-connected layers, wherein each fully-connected layer is provided with a Relu activation function and a discarding layer:
in the multilayer perceptron module, the connection is carried out through three full connection layers, and each layer of nodes is provided with a Relu activation function; meanwhile, in order to avoid overfitting, three Dropout layers are added between all connection layers; finally, the output is changed to a value ranging between (0, 1) by the Sigmoid activation function.
The prediction method of the thermophilic protein combines the protein sequence and biological characteristics derived from the sequence to predict the thermophilic protein. The method comprises the steps of extracting local key information of a protein sequence by using a convolutional neural network; then extracting remote dependence characteristics by using a bidirectional long-short term memory network (BilSTM); then weighting key information of the protein sequence through a Self-attention mechanism (Self-attention); and fusing the biological characteristics of the protein sequence; finally, realizing the prediction of the thermophilic protein through a multilayer perceptron (MLP); the thermophilic protein prediction method effectively excavates important information hidden in a protein sequence, multi-channel feature fusion can more fully express the protein sequence, and sequence information of the protein is fully utilized to enable a prediction result to be more accurate; the features used in the present invention are based on protein sequences, do not relate to protein structure, and have higher generalization.
Based on the above embodiments, this embodiment provides a specific experiment to verify the model performance, which is as follows:
firstly, the method comprises the following steps: building a data set
No large public data set aiming at the thermophilic protein exists in the calculation method which is proposed at present, and the large public data set is small sample data. Li et al constructed a database containing the optimal growth temperatures for the proteins, and we used the data in the database for which the optimal growth temperatures were known from experiments. The baseline data set was generated by the stringent standards set forth by Lin et al, setting 60 ℃ to the lowest optimal growth temperature for thermophilic organisms and 30 ℃ to the highest optimal growth temperature for normothermic organisms. All protein sequences were extracted from Uniprot. The quality of the data set is ensured by the following steps:
(a) Protein sequences must be reviewed and labeled manually.
(b) Proteins containing undefined residues (e.g., "X", "B", "Z") are deleted.
(c) Sequences of other protein fragments were excluded.
(d) Highly similar sequences were removed by the CD-HIT program, using 40% sequence identity as a cut-off.
(e) Excessively long sequences may affect the predicted performance of the thermophilic protein, and proteins with sequence lengths not exceeding 1500 were screened here.
The ratio of thermophilic and non-thermophilic protein sequences in the data set obtained by the above procedure is in the range of 1: and 3, in order to avoid the influence caused by data imbalance, undersampling is carried out on the data by adopting a method of randomly deleting part of non-thermophilic proteins, so that the quantity of the thermophilic proteins is the same as that of the non-thermophilic proteins, and 10161 non-thermophilic proteins and thermophilic proteins are screened respectively.
To verify model performance, 721 strips were randomly screened from non-thermophilic and thermophilic proteins as blind data. The data distribution is shown in table 2:
TABLE 2 data distribution
Figure BDA0003843477690000101
II, secondly, the method comprises the following steps: feature extraction
To construct a model that accurately recognizes thermophilic and non-thermophilic proteins, we extracted six sets of proteins using the protr program, the features of which are: amino Acid Composition (AAC), dipeptide composition (DPC), composition-transition-distribution (CTD), quasi-sequence descriptors (QSO), pseudo-amino acid composition (PAAC), and amphiphilic pseudo-amino acid composition (APAAC). Finally 797 features were obtained. Table 3 lists the number of features of each class:
TABLE 3 number of characteristics
Figure BDA0003843477690000111
Thirdly, the steps of: feature screening
The extraneous and redundant features can affect the predictive effectiveness of the model. Too large a dimension of a feature may make the model less likely to converge when trained. In order to reduce the influence of the irrelevant features and the redundant features on the model and reduce the training time, a feature screening method is used for removing the irrelevant features and the redundant features.
Referring to the feature screening method used by ProTtab, the present invention employs LightGBM algorithm and a cross-validation based recursive feature elimination algorithm (RFECV) for feature screening. Recursive feature elimination algorithms (RFEs) require a specification of the number of features required, but generally cannot determine how many features are valid. The cross validation and RFE algorithms are used together to score different feature subsets and select the optimal feature set, which is an efficient feature screening scheme. The final 205 features were screened for training the model.
Fourthly, the method comprises the following steps: results and analysis of the experiments
The experimental language is Python3.6, an experimental model is built by using a deep learning framework Tensorflow2.4.0, and a GPU of NVIDIA GeForce GTX 960 is used as a computing unit for model training.
The model parameter settings are detailed in table 4:
TABLE 4 model parameters
Figure BDA0003843477690000121
max _ sequence _ length is the length of the input sequence; bio _ length is the length of the input biometric; input _ dim and output _ dim are respectively the input vocabulary size of the Embedding layer and the dimensionality of the output word vector; hidden _ size is the number of hidden nodes of the BilSTM layer; nums _ layers is the number of CNN layers in the DeepTTP model; epochs are the number of iterations in the model training process, and to avoid overfitting, an early stopping (earling stopping) mechanism is utilized; the batch _ size is the number of samples in single iteration batch processing; learning _ rate is the learning rate.
Evaluation indexes are as follows:
the prediction of the thermophilic protein is a two-classification problem, and 7 indexes are used for comprehensively evaluating a prediction model, wherein the two indexes are respectively as follows: positive Predictive Value (PPV), negative Predictive Value (NPV), sensitivity (SEN), specificity (SPE), accuracy (ACC), mathematic Correlation Coefficient (MCC), and Overall performance indicator (OPM). The above index is calculated as follows:
Figure BDA0003843477690000122
Figure BDA0003843477690000123
Figure BDA0003843477690000124
Figure BDA0003843477690000125
Figure BDA0003843477690000131
Figure BDA0003843477690000132
Figure BDA0003843477690000133
wherein TP represents the correct prediction of the amount of thermophilic protein; FP indicates the number of proteins that were mispredicted to be thermophilic; FN represents the number of mispredicted non-thermophilic proteins; TN indicates correct prediction of the amount of non-thermophilic proteins; the MCC evaluates the model's ability to balance the sample. OPM is defined by PON-P2, with MCC normalized to a value of 0-1, in combination with PPV, NPV, SEN, SPE and ACC. The value of OPM is between 0 and 1, and the closer the value is to 1, the better the classification effect.
Cross validation performance:
to evaluate the performance of the model, ten-fold cross-validation was used in the experiments. The cross-validation results for the DeepTTP model are shown in table 5:
TABLE 5 10 double cross-validation Performance
Figure BDA0003843477690000134
Wherein the accuracy rate reaches 0.888, MCC reaches 0.777, and OPM reaches 0.701.
Blind test performance:
among the available tools for predicting thermophiles are TMPPred, SCMTPP, iThermo and SAPPHIRE. The deep ttp was compared to other similar tools using the blind test set constructed in table 2. The results are shown in Table 6:
TABLE 6 Blind test Performance comparison
Figure BDA0003843477690000135
Figure BDA0003843477690000141
The DeepTTP has better comprehensive performance, the ACC reaches 0.893, the MCC reaches 0.785 and the OPM reaches 0.711. This shows that the deep learning technology can obtain better effect in the field of prediction of the thermophilic protein, and simultaneously verifies that the multichannel feature fusion is beneficial to improving the performance of prediction of the thermophilic protein.
TMPPred, SCMTPP and SAPPHIRE are methods based on traditional machine learning. Since the TMPPred uses less data and has more unique characteristics, the prediction performance on a blind measurement set is poor. The scmptpp and SAPPHIRE predictions were more biased towards negative examples, so the PPV and SPE performed better on blind tests. However, the combined performance of SCMTPP and SAPPHIRE is lower than DeepTTP.
The iThermo uses a deep learning approach that is 6.8%, 13.3%, 14.7% lower on ACC, MCC, and OPM, respectively, than the DeepTTP model. This is because the iThermo model only uses biological features and does not adequately read sequence information of the protein itself.
In conclusion, deepTTP has higher accuracy and generalization. The method utilizes the characteristics implicit in the CNN and BilSTM learning protein sequences, then utilizes the self-attention mechanism to extract key characteristics, fuses the key characteristics with the biological characteristics of the protein, and uses the fused characteristics to predict the thermophilic protein. Therefore, more important information can be obtained from the protein sequence, and the predicted performance of the thermophilic protein is improved.
And (3) analyzing a model:
the effect of the different modules in the DeepTTP model was verified by three different sets of comparative experiments.
Experiment one verifies the influence of the two encoding modes on the prediction of the thermophilic protein.
Three models were constructed for coding using only amino acid composition (code 1), coding using only amino acid physicochemical properties (code 2), and simultaneously using both codes, respectively.
As can be seen from table 7:
TABLE 7 Performance of different encodings
Figure BDA0003843477690000151
When the amino acid composition code or the amino acid physicochemical property code is used independently, the ACC of the model is 0.865 and 0.826, and the MCC is 0.734 and 0.652; when the two codes are combined, the model achieves an ACC of 0.879 and an MCC of 0.760. The performance of combining two codes is better than that of using a single code, which shows that the combined coding method used by the invention brings certain improvement to the prediction performance of the thermophilic protein.
And the second experiment verifies the influence of the fusion biological characteristics on the prediction of the thermophilic protein.
Comparative experiments were designed to fuse biological features using sequence encoding, biological features and sequence encoding, respectively.
As shown in table 8:
TABLE 8 model Performance Using different characteristics
Figure BDA0003843477690000152
When sequence coding alone is used, the model has an ACC of 0.879 and an mcc of 0.760; when only the biometric features are used, the model has an ACC of 0.880 and an mcc of 0.761; after the fusion sequence codes and biological characteristics, the ACC of the model is increased to 0.893, and the MCC is increased to 0.785, which indicates that the fusion biological characteristics can predict the thermophilic protein more effectively.
Experiment three verifies the influence of the self-attention mechanism on the prediction of the thermophilic protein.
A comparative experiment was designed with and without the addition of the self-attentive mechanism.
As can be seen from table 9:
TABLE 9 Effect of the self-attention mechanism on model Performance
Figure BDA0003843477690000161
The model with the self-attentiveness mechanism added is raised by 1.1% compared to the model ACC without the self-attentiveness mechanism, and the MCC is raised.
Most of the existing studies on thermophilic proteins are based on traditional machine learning, and the sequence information of the proteins cannot be fully utilized. The invention provides a multi-channel feature fusion model based on a self-attention mechanism. The model utilizes the characteristics implicit in the CNN and BilSTM learning protein sequence, then utilizes the self-attention mechanism to carry out weighting processing on the obtained characteristics, thereby extracting corresponding key characteristics and fusing the biological characteristics of the protein sequence to construct a thermophilic protein prediction model. Experimental results show that the multichannel feature fusion can more fully express the protein sequence, and the deep learning method can effectively dig out important information hidden in the protein sequence. DeepTTP has higher accuracy compared with other tools, and the characteristics used by the model are based on protein sequences, do not relate to protein structures, and have higher generalization.
In order to fully utilize the sequence information of the protein, more and more effective sequence-derived biological characteristics and a new model architecture are tried to construct a model in the future, so that the performance of the model is improved. Furthermore, we will also attempt to predict thermophiles using semi-supervised, unsupervised methods.
Referring to fig. 5, fig. 5 is a block diagram illustrating a thermophilic protein prediction apparatus according to an embodiment of the present invention; the specific device may include:
a biological characteristic vector obtaining module 100, configured to obtain biological characteristics of a protein sequence, and perform characteristic screening and standardization processing to obtain a biological characteristic vector;
the coding feature vector acquisition module 200 is configured to perform coding on the protein sequence by using amino acid composition and coding by using amino acid physicochemical properties to obtain two coding feature vectors;
a word embedding module 300 for mapping the two coded feature vectors into two dense feature vectors;
a convolution module 400, configured to perform convolution operations on the two dense feature vectors respectively to obtain two local key feature vectors of the protein sequence;
a bidirectional long-short term memory module 500, configured to capture context-dependent information of the two local key feature vectors, respectively, to obtain two hidden feature vectors of a protein sequence;
an attention module 600, configured to perform weighting processing on the two hidden feature vectors to obtain two key feature vectors of a protein sequence;
a feature fusion module 700, configured to fuse the two hidden feature vectors, the two key feature vectors, and the biometric feature vector to obtain a fusion feature;
and the multilayer perceptron module 800 is used for performing nonlinear transformation on the fusion characteristics and then performing thermophilic protein prediction by using a sigmoid layer to obtain a target prediction result.
The thermophilic protein prediction apparatus of this embodiment is used to implement the aforementioned thermophilic protein prediction method, and thus the specific implementation of the thermophilic protein prediction apparatus can be seen in the above embodiments of the thermophilic protein prediction method, for example, the biometric feature vector obtaining module 100, the coding feature vector obtaining module 200, the word embedding module 300, the convolution module 400, the bidirectional long-short term memory module 500, the attention module 600, the feature fusion module 700, and the multi-layer perceptron module 800 are respectively used to implement steps S101, S102, S103, S104, S105, S106, S107, and S108 in the aforementioned thermophilic protein prediction method, so the specific implementation thereof may refer to the description of the corresponding embodiments of each part, and will not be described herein again.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Various other modifications and alterations will occur to those skilled in the art upon reading the foregoing description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. A method for predicting a thermophilic protein, comprising:
obtaining biological characteristics of a protein sequence, and performing characteristic screening and standardization processing to obtain a biological characteristic vector;
respectively carrying out amino acid composition coding and amino acid physicochemical property coding on a protein sequence to obtain two coding feature vectors;
mapping the two encoded feature vectors into two dense feature vectors through an embedding layer;
performing convolution operation on the two dense feature vectors respectively to obtain two local key feature vectors of the protein sequence;
capturing context dependent information by the two local key characteristic vectors through a bidirectional long-short term memory network respectively to obtain two hidden characteristic vectors of a protein sequence;
respectively carrying out weighting processing on the two hidden feature vectors through an attention mechanism to obtain two key feature vectors of the protein sequence;
fusing the two hidden feature vectors, the two key feature vectors and the biological feature vector to obtain a fused feature;
and performing nonlinear transformation on the fusion characteristics through a multilayer perceptron, and performing thermophilic protein prediction by using a sigmoid layer to obtain a target prediction result.
2. The method for predicting the thermophilic protein according to claim 1, wherein the step of subjecting the protein sequence to amino acid composition coding and amino acid physicochemical property coding to obtain two coding feature vectors comprises the following steps:
sequencing the amino acids according to the short-hand letters of the amino acid residues, carrying out labeling, and carrying out amino acid composition coding on the protein sequence according to the labeling;
and classifying the amino acids according to the physicochemical properties of the amino acid residues, numbering the amino acids according to the classified classes, and coding the physicochemical properties of the amino acids for the protein sequence according to the numbering.
3. The method of predicting the thermophilic protein of claim 1, wherein the mapping the two encoded eigenvectors into two dense eigenvectors via an embedding layer further comprises: and respectively losing part of information of the two dense feature vectors through the discarding layer.
4. The method according to claim 1, wherein the convolving the two dense eigenvectors separately further comprises performing maximal pooling separately.
5. The method for predicting the thermophilic protein of claim 1, wherein the step of capturing the context dependent information of the two local key feature vectors through a bidirectional long-short term memory network to obtain two hidden feature vectors of the protein sequence comprises:
respectively carrying out forward calculation on local key feature vectors from 1 to t moments by utilizing a forward layer of a bidirectional long-short term memory network to obtain the output of a forward hidden layer at each moment;
respectively carrying out reverse calculation on the local key characteristic vectors from t to 1 by utilizing a backward layer of the bidirectional long-short term memory network to obtain the output of a backward hidden layer at each moment;
and combining the outputs of the forward layer and the backward layer at each moment to obtain the final hidden feature vector.
6. The method according to claim 1, wherein the weighting the two hidden feature vectors by an attention-based mechanism to obtain two key feature vectors of the protein sequence comprises:
mapping the hidden characteristic vectors to three different spaces to obtain a query vector matrix, a key vector matrix and a value vector matrix;
calculating the similarity between the key vector matrix and the query vector matrix by using point multiplication, and performing normalization processing by using a softmax function to obtain attention weight;
and carrying out weighted summation on the attention weight and the value vector matrix to obtain a key feature vector.
7. The method according to claim 1, wherein the non-linear transformation of the fused features by a multilayer perceptron comprises:
and processing the fusion features through three fully-connected layers, wherein each fully-connected layer is provided with a Relu activation function and a discarding layer.
8. A thermophilic protein predicting apparatus, comprising:
the biological characteristic vector acquisition module is used for acquiring the biological characteristics of the protein sequence, and performing characteristic screening and standardization processing to obtain a biological characteristic vector;
the coding feature vector acquisition module is used for respectively carrying out amino acid composition coding and amino acid physicochemical property coding processing on the protein sequence to obtain two coding feature vectors;
a word embedding module for mapping the two encoded feature vectors into two dense feature vectors;
the convolution module is used for respectively carrying out convolution operation on the two dense feature vectors to obtain two local key feature vectors of the protein sequence;
the bidirectional long-short term memory module is used for respectively capturing context dependent information of the two local key characteristic vectors to obtain two hidden characteristic vectors of the protein sequence;
the attention module is used for respectively carrying out weighting processing on the two hidden feature vectors to obtain two key feature vectors of the protein sequence;
the feature fusion module is used for fusing the two hidden feature vectors, the two key feature vectors and the biological feature vector to obtain fusion features;
and the multilayer perceptron module is used for performing nonlinear transformation on the fusion characteristics and then performing thermophilic protein prediction by using a sigmoid layer to obtain a target prediction result.
9. The device according to claim 8, which is used for determining the thermal stability of a protein.
10. The device according to claim 8, wherein the device is used in enzyme engineering.
CN202211109737.8A 2022-09-13 2022-09-13 Thermophilic protein prediction method and device Pending CN115472229A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211109737.8A CN115472229A (en) 2022-09-13 2022-09-13 Thermophilic protein prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211109737.8A CN115472229A (en) 2022-09-13 2022-09-13 Thermophilic protein prediction method and device

Publications (1)

Publication Number Publication Date
CN115472229A true CN115472229A (en) 2022-12-13

Family

ID=84371187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211109737.8A Pending CN115472229A (en) 2022-09-13 2022-09-13 Thermophilic protein prediction method and device

Country Status (1)

Country Link
CN (1) CN115472229A (en)

Similar Documents

Publication Publication Date Title
CN111210871B (en) Protein-protein interaction prediction method based on deep forests
CN112767997B (en) Protein secondary structure prediction method based on multi-scale convolution attention neural network
CN111312329B (en) Transcription factor binding site prediction method based on deep convolution automatic encoder
CN107622182B (en) Method and system for predicting local structural features of protein
CN113593631A (en) Method and system for predicting protein-polypeptide binding site
Guo et al. Context-aware poly (a) signal prediction model via deep spatial–temporal neural networks
CN112949823A (en) Industrial process performance diagnosis method based on one-dimensional multi-scale depth convolution neural network
CN111400494B (en) Emotion analysis method based on GCN-Attention
CN113052218A (en) Multi-scale residual convolution and LSTM fusion performance evaluation method for industrial process
EP3738122A1 (en) Methods for flow space quality score prediction by neural networks
CN116072227B (en) Marine nutrient biosynthesis pathway excavation method, apparatus, device and medium
CN114743600A (en) Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity
CN114420211A (en) Attention mechanism-based RNA-protein binding site prediction method
CN113643756A (en) Protein interaction site prediction method based on deep learning
Song et al. Importance weighted expectation-maximization for protein sequence design
CN113257357B (en) Protein residue contact map prediction method
Chen et al. DeepGly: A deep learning framework with recurrent and convolutional neural networks to identify protein glycation sites from imbalanced data
CN113362900A (en) Mixed model for predicting N4-acetylcytidine
CN116741265A (en) Machine learning-based nanopore protein sequencing data processing method and application thereof
CN116386733A (en) Protein function prediction method based on multi-view multi-scale multi-attention mechanism
Golenko et al. IMPLEMENTATION OF MACHINE LEARNING MODELS TO DETERMINE THE APPROPRIATE MODEL FOR PROTEIN FUNCTION PREDICTION.
CN115472229A (en) Thermophilic protein prediction method and device
CN116580848A (en) Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers
CN116612810A (en) Medicine target interaction prediction method based on interaction inference network
CN112365924B (en) Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination