CN116312750A

CN116312750A - Polypeptide function prediction method and device

Info

Publication number: CN116312750A
Application number: CN202310160375.3A
Authority: CN
Inventors: 赖仞; 容明强; 谷陟欣; 李宗煦
Original assignee: Chengdu Peide Biomedical Co ltd
Current assignee: Chengdu Peide Biomedical Co ltd
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-06-23

Abstract

The invention discloses a polypeptide function prediction method and device. The method comprises the steps of constructing a data set and preprocessing the data; establishing a multi-label prediction model, wherein the multi-label prediction model consists of a feature embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing a context related sequence and a classification module; training the constructed network model by using the acquired data set, and improving model feature extraction and parameter prediction performance by adopting an optimization algorithm; and evaluating the performance of the multi-label prediction model by calculating five indexes of precision, coverage rate, accuracy, absolute true value and absolute false value. The method effectively utilizes the association between the labels, improves the precision of polypeptide function prediction, has strong operability and practicability, can be simultaneously applied to the function prediction of eight bioactive peptides, and is a reasonable and effective prediction method, and the precision of predicting eight polypeptide functions is up to 80.4 percent.

Description

Polypeptide function prediction method and device

Technical Field

The invention belongs to the technical field of the intersection of computer application technology and biotechnology, and particularly relates to a polypeptide function prediction method and device.

Background

Bioactive peptides are a class of peptides that have hormonal or pharmaceutical activity that modulate physiological functions by binding to specific receptors on target cells. They are widely found in various species in nature, from lower bacteria to higher vertebrates. Bioactive peptides are generally produced by cleavage and modification of intracellular synthesized protein precursors, and have certain biological functions. Because of the wide variety of bioactive peptides and their functional importance, many researchers have focused on the study of these small molecule peptides and discovered a number of new bioactive peptides. Peptides, although relatively high in cost, have higher tolerability, better specificity and stronger interactions with proteins than small molecule compounds that are low in cost, easy to transport, but less specific. This brings over 70 peptides approved as therapeutic drugs in the united states, europe and japan to the market, over 200 in clinical trial phases, and over 600 in preclinical trial phases. By 2021, month 9, there were a total of 1264 polypeptide drugs, including polypeptides, worldwide. More and more bioactive peptides with multiple functions have been identified.

The identification of peptide function is particularly important in order to obtain more effective peptide therapies. However, the rate of identifying peptide function by experimental methods cannot keep pace with the rate of appearance of large-scale bioactive peptides in the genomic era. With the increasing number of polypeptides, a number of computational methods have emerged to guide researchers in pre-screening peptide functions, and therefore, computational methods such as machine learning are becoming increasingly important in predicting peptide functions. Despite great progress, the following challenges remain. First, many machine learning based prediction methods are less affected by the small number of samples, which tends to result in problems of over-fitting and poor generalization. Secondly, most of the previous researches focus on the prediction of single-function bioactive peptides, algorithms for jointly predicting multiple functions rarely appear, and the accuracy is low.

Disclosure of Invention

The invention aims at providing a polypeptide function prediction method and device aiming at the defects in the prior art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the first object of the present invention is to provide a polypeptide function prediction method, comprising the following specific steps:

step S1, data set construction and data preprocessing

Extracting eight bioactive peptides AMPs, ACPs, AIPs, AHPs, ADPs, AOPs, IMPs and NUPs from the database to construct a benchmark dataset; then, preprocessing the data of the reference data set through redundancy and homology deviation analysis, wherein 80% of sequences in the preprocessed data are used for constructing a training set, and 20% of sequences are used for constructing a test set;

step S2, a multi-label prediction model is established, and the multi-label prediction model consists of a feature embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing context related sequences and a classification module;

the characteristic embedding module converts the polypeptide biological sequence vector into a dense vector with a fixed size; amino acids { A, C, D, E, F, G, H, I, M, N, P, Q, R, S, T, V, W, Y } are assigned to natural numbers {1, 2,3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}, and a zero-filling method is used to prepare a secret vector with a fixed size of 500×500 dimensions; the feature embedding module has an embedding size value of any one of 50, 100, and 150;

the multi-scale convolutional neural network CNN module consists of an encoding module and a decoding module, is formed by stacking a plurality of parallel convolutional blocks, wherein the input of each node in a convolutional layer is a small block of the upper layer of neural network, and the sequence is encoded by a plurality of convolutional pooling layers to extract the global characteristics of the sequence; the pool size value of the CNN module is 3 or 5;

the BiGRU module for processing the context related sequence receives forward input and learns reverse input to extract the context information through the forward and reverse gating circulating units GRU; the GRU unit value of the BiGRU module is any one of 50, 100 and 150;

the classification module adopts a full communication layer as a classification module, converts the output of the convolution layer into a vector mode, and simultaneously connects each output neuron with each input neuron; the full-connection dimension value of the classification module is 128 or 256;

step S3, training the constructed multi-label prediction model by using the acquired data set, and

the optimization algorithm is adopted to improve the model feature extraction and parameter prediction performance; the training learning rate value is 0.01 or 0.001;

and S4, evaluating the performance of the multi-label prediction model through five indexes including calculation precision, coverage rate, accuracy, absolute true value and absolute false value.

Further, in step S1, the database includes DRAMP, antiCP, preAIP, MAHTPRED, bioDADpep and Uniprot; the pretreatment process is to delete sequences with similarity more than 90% by using a protein sequence or nucleic acid sequence clustering tool CD-HIT, so as to avoid redundancy and homology deviation in each functional peptide data set.

Furthermore, in step S2, before implementing the embedding mechanism, the polypeptide is converted into a filling carrier, and then the filling carrier is input into the multi-tag prediction model, wherein the length of the polypeptide is 5-500 amino acids, and the filling carrier with the same length can be treated as an input file.

Further, in step S2, in the biglu module of the processing context-related sequence, the formula for calculating activation of the j-th hidden unit in the GRU layer is as follows:

where σ is a logistic sigmoid function, wz and Uz represent different weight matrices learned, xt is input, and ht-1 is the previous hidden state. And the output of the multi-scale convolutional network layer is used as an input.

At time t, the reset gate r of the jth hidden unit is similarly calculated by:

the activation of the jth hidden unit at time t is then calculated by:

wherein the candidate activations are calculated accordingly using the following formula:

where tanh is a hyperbolic tangent function and rt is an element-wise multiplication and a set of reset gates, respectively.

Furthermore, the classification module adopts a full communication layer as the classification module, the output of the convolution layer is converted into a vector mode, and each output neuron and each input neuron are connected.

Further, in step S4, the five indexes of precision, coverage, accuracy, absolute true value and absolute false value are defined as follows:

wherein N is the total number of the multifunctional bioactive peptides, U represents the union in the ensemble theory, and refers to the operation of counting the number of elements by the intersection of the set theory, L (L) _i Representing the subset of the ith sample with true tags,

representing a subset of labels with predictions for the ith sample.

Further, in step S4, the sensitivity SEN and the specificity SPE are also used to evaluate the predictive performance of different multi-labeling methods on different peptide functions, and the calculation formula is as follows:

wherein TP, TN, FP, and FN correspond to the number of true positives, true negatives, false positives, and false negatives, respectively.

The second object of the invention is to provide a polypeptide function prediction device, which comprises a data set construction and data preprocessing module, a characteristic embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing context related sequences and a classification module; the data set construction and data preprocessing module is used for constructing a reference data set by extracting eight bioactive peptides AMPs, ACPs, AIPs, AHPs, ADPs, AOPs, IMPs and NUPs from a database; then, redundancy and homology deviation processing are carried out by adopting a tool CD-HIT of protein sequence or nucleic acid sequence clustering, sequences with similarity more than 90% are deleted, 80% of the sequences in the preprocessed data are used for constructing a training set, and 20% of the sequences are used for constructing a test set; the characteristic embedding module takes 5-500 amino acids of polypeptide as filling carriers, and the filling carriers with the same length can be used as input files for processing, so that polypeptide biological sequence vectors are converted into dense vectors with fixed sizes; the multi-scale convolutional neural network CNN module consists of an encoding module and a decoding module, is formed by stacking a plurality of parallel convolutional blocks, wherein the input of each node in a convolutional layer is a small block of the upper layer of neural network, and the sequence is encoded by a plurality of convolutional pooling layers to extract the global characteristics of the sequence; the BiGRU module for processing the context related sequence receives forward input and learns reverse input to extract the context information through the forward and reverse gating circulating units GRU; the classification module adopts a full communication layer as the classification module, converts the output of the convolution layer into a vector mode, and simultaneously connects each output neuron with each input neuron.

A third object of the present invention is to provide a polypeptide function prediction apparatus, comprising a processor and a memory, wherein the memory stores an application program executable by the processor, for causing the processor to execute the above-mentioned polypeptide function prediction method.

A fourth object of the present invention is to provide a computer-readable storage medium having stored therein computer-readable instructions for performing the above-described polypeptide function prediction method.

Compared with the prior art, the technical scheme provided by the invention has the beneficial effects that:

(1) The method comprises the steps of constructing a data set and preprocessing the data; establishing a multi-label prediction model, wherein the multi-label prediction model consists of a feature embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing a context related sequence and a classification module; training the constructed network model by using the acquired data set, and improving model feature extraction and parameter prediction performance by adopting an optimization algorithm; and evaluating the performance of the multi-label prediction model by calculating five indexes of precision, coverage rate, accuracy, absolute true value and absolute false value. In the process, the method effectively utilizes the association among the labels, improves the accuracy of polypeptide function prediction, has strong operability and practicability, can be simultaneously applied to the function prediction of eight bioactive peptides AMPs, ACPs, AIPs, AHPs, ADPs, AOPs, IMPs and NUPs, and is a reasonable and effective prediction method, and the accuracy of predicting eight polypeptide functions is as high as 80.4 percent.

Drawings

FIG. 1 is an overall ranking graph comparing accuracy and absolute fidelity of models over a training set and a testing set;

fig. 2 is a graph of accuracy versus tester predictions AIP, AMP, AHP, ACP, ADP, AMP, IMP and NUP.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following detailed description of the specific embodiments of the present invention will be given with reference to the accompanying drawings. The examples are not to be construed as limiting the specific techniques or conditions described in the literature in this field or as per the specifications of the product. The reagents or apparatus used were conventional products commercially available without the manufacturer's attention.

The database information used in the implementation of the invention is as follows:

DRAMP: antibacterial peptide database, website http:// drain. Cpu-bioin for. Org/;

AntiCP: anticancer peptide database, website https:// webs. Iiitd. Edu. In/raghava/anti-icp 2/;

PreAIP: the website http:// kurta14. Bio. Kurtech. Ac. Jp/PreAIP/;

mAHTPred: the website http:// the gleelab. Org/mAHTPred;

BioDADpep: the bioinformatics database of antidiabetic peptides, website http:// omicsbase. Com/BioDADPep/;

uniprot (Swiss-prot): protein database, website https:// www.uniprot.org/.

Example 1

S1, data set construction and data preprocessing

S1-1, determining a reference data set: the literature search was performed using 'bioactive peptide' as a keyword, and 8 functional peptides related to health, namely, antibacterial peptide (AMP), anticancer peptide (ACP), anti-inflammatory peptide (AIP), antihypertensive peptide (AHP), antidiabetic peptide (ADP), antioxidant peptide (AOP), immunomodulatory peptide (IMP) and neuroactive peptide (NUP), were collected from the literature, and the present invention used eight bioactive peptides (AMPs, ACPs, AIPs, AHPs, ADPs, AOPs, IMPs and NUPs) in an amount >500 to train the deep learning model since a small amount of functional peptide data was insufficient to train the deep learning model.

S1-2 and AMP are basic polypeptide substances with in vivo induced antibacterial activity. ACP is a series of peptides that can inhibit proliferation or migration of cancer cells, or inhibit the formation of cancer blood vessels. AIP is an endogenous peptide with immunotherapeutic capacity to inhibit antigen-specific T (H) 1-driven responses and regulatory T cell production. AHP is a peptide with the potential to ameliorate hypertension by scavenging free radicals, inhibiting the activity of angiotensin converting enzyme and renin. ADP is a peptide which acts on beta cells or T cells to regulate insulin production and is also helpful for evaluating diabetes symptoms, AOP is a peptide which inhibits or delays lipid oxidation and eliminates free radical damage, IMP is a peptide which exists in vivo and has immune function, is mainly applicable to cancers, lung cancers, leukemia, osteosarcoma and other diseases, and NUP is a peptide which can act as hormone and neurotransmitter to interact with alpha, delta and gamma receptors in vivo and can play roles in easing pain, regulating respiration and body temperature. Preparation of a reference dataset: AMP, ACP, AIP, AHP, ADP, AOP, IMP and NUP were extracted from DRAMP, antiCP, preAIP, MAHTPRED, bioDADpep and Uniprot (Swiss-prot), respectively. To avoid redundancy and homology bias within each functional peptide dataset, the present application uses a protein sequence or nucleic acid sequence clustering tool (CD-HIT) to delete sequences with a similarity >90%, resulting in a total of 1891 AMPs, 864 ACPs, 1651 AIPs, 1421 AHPs, 992 ADPs, 1201 AOPs, 1192 IMPs, and 1097 NUPs. Finally, the present application uses 80% of the sequences selected for constructing the training set and 20% of the sequences selected for constructing the test set, and table 1 lists the details of the data set.

TABLE 1 details of datasets

S2, construction of feature prediction algorithm

S2-1, model overview: in this work, the present application uses a variety of deep neural network algorithms to construct a multi-label predictor. FIG. 1 illustrates the construction of proposed multi-tag predictors for multifunctional bioactive peptides. There are mainly four parts. The first part is a feature embedding module that converts sequence vectors into dense vectors of fixed size. The second part is a multi-scale Convolutional Neural Network (CNN) module. The third part is the biglu module that processes context-dependent sequences. The last part is the classification module. In each output unit, sigmoid is used as an activation function, and a score between 0 and 1 is output. These modules will be described in the next section. Further details of the experiment will be given in the implementation details section.

S2-2, a feature embedding module: before the embedding mechanism is implemented, the peptide needs to be converted into a filled carrier, which is then input into the predictive model. First, amino acids { A, C, D, E, F, G, H, I, M, N, P, Q, R, S, T, V, W, Y } are given natural numbers {1, 2,3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}, respectively. These peptides range in length from 5 (minimum length of all sequences) to 500 (maximum length of all sequences), but only peptide vectors of equal length can be handled as input files. Thus, peptides of a fixed length of 500 amino acids were prepared using the zero-fill method. Finally, the sequences were encoded into 500-dimensional vectors. After the sequence vectors are obtained, positive integers of the vectors can be converted into dense continuous feature vectors having a fixed size by using an embedding layer. Such mechanisms are commonly used for natural language processing. In this study, the weight of the embedded layer may be updated during the training process. Finally, the embedding matrix is learned to represent the peptide sequence.

S2-3, a multi-scale Convolutional Neural Network (CNN) module: the method mainly comprises an encoding module and a decoding module, wherein the encoding module and the decoding module are formed by stacking a plurality of parallel convolution blocks, the input of each node in a convolution layer is a small block of a neural network of the upper layer, the global features of the sequences are extracted by encoding the sequences through a plurality of convolution pooling layers, the deeper analysis is facilitated, the features with higher abstract degree are obtained, after a convolution feature matrix is obtained, feature graphs with different scales are extracted through a deconvolution layer, the feature quantity is reduced, and the excessive superposition is prevented by adopting a maximum pooling mechanism.

S2-4, a BiGRU module for processing a context-related sequence: in the Recurrent Neural Network (RNN) and in particular Bidirectional GRU, significant achievements are achieved in the classification task. Sequence flow is analyzed by Bidirectional GRU, and sequence context information is extracted by forward input and learning reverse input through two gating loop units (GRUs) in forward and reverse directions. In this study, biglu uses its internal state to process sequence vectors, where it leverages sequence context information in both directions. Also, a Gating Recursion Unit (GRU) is an integral part of a BiGRU for dynamically remembering or forgetting sequence information. To get more details of the GRU, the following describes how the activation of the j-th hidden unit in the GRU layer is calculated. First, the update gate z of the jth hidden unit at time t is calculated by the following formula:

where σ is a logistic sigmoid function, W _z And U _z Representing learned different weight matrices, X _t Is input, h _t-1 Is the previous hidden state. And the output of the multi-scale convolutional network layer is used as an input.

At time t, the reset gate r of the jth hidden unit is similarly calculated by:

then, activation of the jth hidden unit at time t

Calculated from the following formula:

wherein the candidate activations are calculated accordingly using the following formula

Wherein tan h is a hyperbolic tangent function, as indicated by the terms "& r _t Respectively an element-by-element multiplication and a set of reset gates.

S2-5, a classification module: the application adopts the full communication layer as a classification module. The output of the convolutional layer is conveniently converted into a vector mode, and each output neuron and each input neuron can be connected. A linear rectifying function (Rectified Linear Unit) is used as a rectifying linear activation unit. And, the vector from the fully connected layer is used as input to the output layer. In the multi-label problem, the probabilities of each node are independent of the others, with binary cross entropy as a loss function. A score for each node between 0 and 1 is obtained using sigmoid as the activation function. Finally, the present application uses a 0.7 threshold to derive a prediction tag for each category, and the elements in the multidimensional prediction vector correspond to the tags of ACP, ADP, AHP, AIP, AMP, AOP, IMP and NUP

S2, evaluating indexes:

in previous studies, several evaluation metrics were presented to evaluate the performance of the multi-label model. To evaluate the performance of the methods of the present application, these measurements were taken, including precision, coverage, accuracy, absolute true values, and absolute false values. These indices are defined as follows:

representing a subset of labels with predictions for an ith sample, and

furthermore, in a specific class of functional peptides, if a peptide having such a function is considered as a positive sample, other peptides not having such a function are defined as negative samples. To evaluate the predictive performance of different multi-labeling methods on different peptide functions, the application applied Sensitivity (SEN) and Specificity (SPE), calculated as follows:

wherein TP, TN, FP, and FN correspond to the number of true positives, true negatives, false positives, and false negatives, respectively. For example, on the test set, ACP is a positive sample when the present application calculates SEN and SPE of ACP, and other peptides in the group that have no anticancer function are considered negative samples. Using the prediction results of these positive and negative samples, SEN and SPE are obtained through the confusion matrix. Details of the prediction model are shown in table 2:

TABLE 2 details of predictive models

S3, implementing the following steps:

the TensorFlow's advanced Keras API is used to build and train the predictive model of the present application. The API can be used for rapid prototyping, tip research and actual production, has the advantages of simplicity, easiness in use, modularization, combinability and easiness in expansion, and simultaneously, the GPU is used for acceleration.

Training process. In the multi-label predictive model of the application, a grid search program is adopted, and the super parameters on the training set are adjusted through 5 times of cross validation. As shown in table 2, these parameters, including the embedding size, pool size, GRU units, full connection dimension, and learning rate, were adjusted. For the CNN layer, it was constructed using Conv1D functions in Keras. To extract various convolution features, three convolution kernel (size, number of channels, depth) sizes ks e {2,3,8}, are selected. The application then uses a Adam (adaptive moment estimation) optimizer training model that integrates two popular algorithms, "adagard" (for processing sparse gradients) and "RMSPro" (for processing non-steady state data) for solving the large data volume, high feature latitude optimization problem in machine learning. Finally, the present application trains a batch size of 64 and epochs of 30 models. Due to the random initializations in the deep learning framework, the nine identical deep framework models are trained in parallel with different random initializations. The final predicted score for the test sample is obtained by averaging the scores of all models.

S4, results and discussion:

s4-1, performance comparison of different multi-label prediction models: the multi-label prediction model adopts an embedding mechanism, CNN and RNN to build a model. For the CNN layer, the application can utilize multi-scale CNN to extract a plurality of convolution characteristics, and early researches report that CNN with small convolution kernel shows more economical and better performance, and for the RNN layer, a gating cycle unit (GRU) and a Long and Short Time Memory (LSTM) are commonly used for classification tasks. In order to select the best-performance model, five deep learning models are created for comparison, including basic models CNN, CNN and BiGRU (CNN-BiGRU), CNN and bidirectional LSTM (CNN-BiLSTM), CNN and small convolution kernel (CNN-sk), and CNN-sk and BIGR (CNN-sk-BIGRU).

The present application evaluates their performance on training and testing data sets. Among the evaluation metrics listed in the method, the most important metrics are accuracy and absolute true values. Two metrics for the various models on the training and testing dataset are shown in fig. 1. The CNN-biglu achieved the best performance compared to the other four multi-label predictive models on the training dataset. The CNN-BiGRU achieved the best performance with an accuracy of 0.749 and an absolute true value of 0.740. Table 3 lists other performance metrics for these different multi-label predictive models.

TABLE 3 Performance of various multi-tag predictive models

As shown in Table 3, on the training set, the accuracy of CNN-BiGRU was 0.749, the coverage was 0.752, the absolute error was 0.108, and the cross validation was 5 times better than the other four models. On the test set, CNN biglu also achieved the best performance on these four indicators. Because the comparison is in a CNN-based model only, the model incorporating the biglu mechanism can not only obtain convolution characteristics, but also dynamically remember or forget the information flow. On these evaluation criteria, CNN-BiLSTM has similar performance as CNN-BiGRU, but CNN-BiLSTM requires longer convergence time. GRU is an RNN algorithm proposed to solve the problems of long-term memory and gradient in back propagation, similar to LSTM. In one aspect, the GRU has fewer parameters than LSTM. LSTM has three gating mechanisms (forget gate, input gate and output gate), while GRU has only two gating mechanisms, update gate and reset gate. Therefore, the convergence run time of biglu is shorter. On the other hand, the use of the GRU can achieve a comparable effect and is comparatively easier to train, the training efficiency can be greatly improved, and the GRU has better problem solving capability than the traditional RNN. Finally, the model CNN-BiGRU with the best comprehensive performance is selected to construct the multi-marker bioactive peptide predictor.

S4-2, influence of different lengths of sequences on model performance: in the baseline dataset, the length of each bioactive peptide ranged from 5 to 500 residues. In view of the broad length range of bioactive peptides, the present application uses carriers with different fixed lengths as inputs to test whether there is an impact on the performance of the model. First, to ensure that the amount of data in each experiment is the same and as much as possible, the present application extracts samples with lengths from the baseline dataset of 100. Then 2001 sequences are selected for further testing, then converted to digital vectors, and padded with zeros to different lengths, including 100, 200, 230, 400, and 500. As shown in table 4, the precision values of the different fixed lengths are similar to each other. The highest value (0.751) is only 1.3% higher than the lowest value (0.738). Meanwhile, the values of the other four evaluation indexes are similar in each experiment only in terms of accuracy. Overall, the sequence length had a slight effect on the model.

TABLE 4 prediction accuracy of various lengths

S4-3, case prediction: as described above, the application constructs a multi-marker bioactive peptide prediction model, and the performance comparison of different multi-marker methods shows that the prediction model is superior to other methods in the aspect of multi-functional bioactive peptide prediction. To further illustrate the superior performance of this predictive model, the present application conducted a case study of these multi-labeled predictors for predicting given negative and positive samples, which covers all functions of eight bioactive peptides. The prediction model shows good accuracy in all eight active function predictions, and other algorithms only find two to three true positives, as shown in Table 5.

TABLE 5 comparative accuracy with other algorithms

S4-4, predicting results of the test set: the present application tested this procedure with a new test set (7100 AMPs, 2800 ACPs, 3834 AIPs, 2462 AHPs, 1956 ADP, 2824 AOPs, 2473 IMPs, and 2821 NUPs) with accuracy of AHPs, respectively, as shown in table 6 and fig. 2: 78.4%, AMP:82.7%, AIP:80.3%, ACP:79.5%, ADP:80.4%, AOP:80.1%, IMP:79.9%, NUP:80.2% and overall accuracy of 80.4%.

TABLE 6 test set prediction results

The embodiments described above and features of the embodiments herein may be combined with each other without conflict.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A polypeptide function prediction method is characterized by comprising the following specific steps:

s1, data set construction and data preprocessing

s2, a multi-label prediction model is established, and the multi-label prediction model consists of a feature embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing context related sequences and a classification module;

s3, training the constructed multi-label prediction model by using the acquired data set, and improving model feature extraction and parameter prediction performance by adopting an optimization algorithm; the training learning rate value is 0.01 or 0.001;

2. The method of claim 1, wherein in step S1, the database comprises DRAMP, antiCP, preAIP, MAHTPRED, bioDADpep and Uniprot; the pretreatment process is to delete sequences with similarity more than 90% by using a protein sequence or nucleic acid sequence clustering tool CD-HIT, so as to avoid redundancy and homology deviation in each functional peptide data set.

3. The method of claim 2, wherein in step S2, before implementing the embedding mechanism, the polypeptide is converted into a filler carrier, and then the filler carrier is input into the multi-tag prediction model, wherein the length of the polypeptide is 5-500 amino acids, and the filler carriers with equal lengths can be processed as input files.

4. A method of predicting polypeptide function as claimed in claim 3, wherein in step S2, in the biglu module of the processing context-dependent sequence, the formula for calculating activation of the j-th hidden unit in the GRU layer is as follows:

where σ is a logistic sigmoid function, wz and Uz represent different weight matrices learned, xt is input, and ht-1 is the previous hidden state. And the output of the multi-scale convolutional network layer is used as an input;

at time t, the reset gate r of the jth hidden unit is similarly calculated by:

the activation of the jth hidden unit at time t is then calculated by:

5. The method of claim 4, wherein in step S4, the five indexes of precision, coverage, accuracy, absolute true value and absolute false value are defined as follows:

representing a subset of labels with predictions for the ith sample.

6. The method according to claim 5, wherein in step S4, the prediction performance of different multi-labeling methods on different peptide functions is further evaluated by using sensitivity SEN and specificity SPE, and the calculation formula is as follows:

7. The polypeptide function prediction device is characterized by comprising a data set construction and data preprocessing module, a characteristic embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing a context related sequence and a classification module; the data set construction and data preprocessing module is used for constructing a reference data set by extracting eight bioactive peptides AMPs, ACPs, AIPs, AHPs, ADPs, AOPs, IMPs and NUPs from a database; then, redundancy and homology deviation processing are carried out by adopting a tool CD-HIT of protein sequence or nucleic acid sequence clustering, sequences with similarity more than 90% are deleted, 80% of the sequences in the preprocessed data are used for constructing a training set, and 20% of the sequences are used for constructing a test set; the characteristic embedding module takes 5-500 amino acids of polypeptide as filling carriers, and the filling carriers with the same length can be used as input files for processing, so that polypeptide biological sequence vectors are converted into dense vectors with fixed sizes; the multi-scale convolutional neural network CNN module consists of an encoding module and a decoding module, is formed by stacking a plurality of parallel convolutional blocks, wherein the input of each node in a convolutional layer is a small block of the upper layer of neural network, and the sequence is encoded by a plurality of convolutional pooling layers to extract the global characteristics of the sequence; the BiGRU module for processing the context related sequence receives forward input and learns reverse input to extract the context information through the forward and reverse gating circulating units GRU; the classification module adopts a full communication layer as the classification module, converts the output of the convolution layer into a vector mode, and simultaneously connects each output neuron with each input neuron.

8. A polypeptide function prediction device, characterized by comprising a processor and a memory, in which an application executable by the processor is stored, for causing the processor to execute the polypeptide function prediction method according to any one of claims 1 to 6.

9. A computer-readable storage medium having stored therein computer-readable instructions for performing the polypeptide function prediction method according to any one of claims 1 to 6.