CN116312750A - Polypeptide function prediction method and device - Google Patents

Polypeptide function prediction method and device Download PDF

Info

Publication number
CN116312750A
CN116312750A CN202310160375.3A CN202310160375A CN116312750A CN 116312750 A CN116312750 A CN 116312750A CN 202310160375 A CN202310160375 A CN 202310160375A CN 116312750 A CN116312750 A CN 116312750A
Authority
CN
China
Prior art keywords
module
input
sequence
polypeptide
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310160375.3A
Other languages
Chinese (zh)
Inventor
赖仞
容明强
谷陟欣
李宗煦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Peide Biomedical Co ltd
Original Assignee
Chengdu Peide Biomedical Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Peide Biomedical Co ltd filed Critical Chengdu Peide Biomedical Co ltd
Priority to CN202310160375.3A priority Critical patent/CN116312750A/en
Publication of CN116312750A publication Critical patent/CN116312750A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a polypeptide function prediction method and device. The method comprises the steps of constructing a data set and preprocessing the data; establishing a multi-label prediction model, wherein the multi-label prediction model consists of a feature embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing a context related sequence and a classification module; training the constructed network model by using the acquired data set, and improving model feature extraction and parameter prediction performance by adopting an optimization algorithm; and evaluating the performance of the multi-label prediction model by calculating five indexes of precision, coverage rate, accuracy, absolute true value and absolute false value. The method effectively utilizes the association between the labels, improves the precision of polypeptide function prediction, has strong operability and practicability, can be simultaneously applied to the function prediction of eight bioactive peptides, and is a reasonable and effective prediction method, and the precision of predicting eight polypeptide functions is up to 80.4 percent.

Description

Polypeptide function prediction method and device
Technical Field
The invention belongs to the technical field of the intersection of computer application technology and biotechnology, and particularly relates to a polypeptide function prediction method and device.
Background
Bioactive peptides are a class of peptides that have hormonal or pharmaceutical activity that modulate physiological functions by binding to specific receptors on target cells. They are widely found in various species in nature, from lower bacteria to higher vertebrates. Bioactive peptides are generally produced by cleavage and modification of intracellular synthesized protein precursors, and have certain biological functions. Because of the wide variety of bioactive peptides and their functional importance, many researchers have focused on the study of these small molecule peptides and discovered a number of new bioactive peptides. Peptides, although relatively high in cost, have higher tolerability, better specificity and stronger interactions with proteins than small molecule compounds that are low in cost, easy to transport, but less specific. This brings over 70 peptides approved as therapeutic drugs in the united states, europe and japan to the market, over 200 in clinical trial phases, and over 600 in preclinical trial phases. By 2021, month 9, there were a total of 1264 polypeptide drugs, including polypeptides, worldwide. More and more bioactive peptides with multiple functions have been identified.
The identification of peptide function is particularly important in order to obtain more effective peptide therapies. However, the rate of identifying peptide function by experimental methods cannot keep pace with the rate of appearance of large-scale bioactive peptides in the genomic era. With the increasing number of polypeptides, a number of computational methods have emerged to guide researchers in pre-screening peptide functions, and therefore, computational methods such as machine learning are becoming increasingly important in predicting peptide functions. Despite great progress, the following challenges remain. First, many machine learning based prediction methods are less affected by the small number of samples, which tends to result in problems of over-fitting and poor generalization. Secondly, most of the previous researches focus on the prediction of single-function bioactive peptides, algorithms for jointly predicting multiple functions rarely appear, and the accuracy is low.
Disclosure of Invention
The invention aims at providing a polypeptide function prediction method and device aiming at the defects in the prior art.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the first object of the present invention is to provide a polypeptide function prediction method, comprising the following specific steps:
step S1, data set construction and data preprocessing
Extracting eight bioactive peptides AMPs, ACPs, AIPs, AHPs, ADPs, AOPs, IMPs and NUPs from the database to construct a benchmark dataset; then, preprocessing the data of the reference data set through redundancy and homology deviation analysis, wherein 80% of sequences in the preprocessed data are used for constructing a training set, and 20% of sequences are used for constructing a test set;
step S2, a multi-label prediction model is established, and the multi-label prediction model consists of a feature embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing context related sequences and a classification module;
the characteristic embedding module converts the polypeptide biological sequence vector into a dense vector with a fixed size; amino acids { A, C, D, E, F, G, H, I, M, N, P, Q, R, S, T, V, W, Y } are assigned to natural numbers {1, 2,3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}, and a zero-filling method is used to prepare a secret vector with a fixed size of 500×500 dimensions; the feature embedding module has an embedding size value of any one of 50, 100, and 150;
the multi-scale convolutional neural network CNN module consists of an encoding module and a decoding module, is formed by stacking a plurality of parallel convolutional blocks, wherein the input of each node in a convolutional layer is a small block of the upper layer of neural network, and the sequence is encoded by a plurality of convolutional pooling layers to extract the global characteristics of the sequence; the pool size value of the CNN module is 3 or 5;
the BiGRU module for processing the context related sequence receives forward input and learns reverse input to extract the context information through the forward and reverse gating circulating units GRU; the GRU unit value of the BiGRU module is any one of 50, 100 and 150;
the classification module adopts a full communication layer as a classification module, converts the output of the convolution layer into a vector mode, and simultaneously connects each output neuron with each input neuron; the full-connection dimension value of the classification module is 128 or 256;
step S3, training the constructed multi-label prediction model by using the acquired data set, and
the optimization algorithm is adopted to improve the model feature extraction and parameter prediction performance; the training learning rate value is 0.01 or 0.001;
and S4, evaluating the performance of the multi-label prediction model through five indexes including calculation precision, coverage rate, accuracy, absolute true value and absolute false value.
Further, in step S1, the database includes DRAMP, antiCP, preAIP, MAHTPRED, bioDADpep and Uniprot; the pretreatment process is to delete sequences with similarity more than 90% by using a protein sequence or nucleic acid sequence clustering tool CD-HIT, so as to avoid redundancy and homology deviation in each functional peptide data set.
Furthermore, in step S2, before implementing the embedding mechanism, the polypeptide is converted into a filling carrier, and then the filling carrier is input into the multi-tag prediction model, wherein the length of the polypeptide is 5-500 amino acids, and the filling carrier with the same length can be treated as an input file.
Further, in step S2, in the biglu module of the processing context-related sequence, the formula for calculating activation of the j-th hidden unit in the GRU layer is as follows:
Figure BDA0004093921900000031
where σ is a logistic sigmoid function, wz and Uz represent different weight matrices learned, xt is input, and ht-1 is the previous hidden state. And the output of the multi-scale convolutional network layer is used as an input.
At time t, the reset gate r of the jth hidden unit is similarly calculated by:
Figure BDA0004093921900000041
the activation of the jth hidden unit at time t is then calculated by:
Figure BDA0004093921900000042
wherein the candidate activations are calculated accordingly using the following formula:
Figure BDA0004093921900000043
where tanh is a hyperbolic tangent function and rt is an element-wise multiplication and a set of reset gates, respectively.
Furthermore, the classification module adopts a full communication layer as the classification module, the output of the convolution layer is converted into a vector mode, and each output neuron and each input neuron are connected.
Further, in step S4, the five indexes of precision, coverage, accuracy, absolute true value and absolute false value are defined as follows:
Figure BDA0004093921900000044
Figure BDA0004093921900000045
Figure BDA0004093921900000046
Figure BDA0004093921900000047
wherein N is the total number of the multifunctional bioactive peptides, U represents the union in the ensemble theory, and refers to the operation of counting the number of elements by the intersection of the set theory, L (L) i Representing the subset of the ith sample with true tags,
Figure BDA0004093921900000048
representing a subset of labels with predictions for the ith sample.
Further, in step S4, the sensitivity SEN and the specificity SPE are also used to evaluate the predictive performance of different multi-labeling methods on different peptide functions, and the calculation formula is as follows:
Figure BDA0004093921900000051
Figure BDA0004093921900000052
wherein TP, TN, FP, and FN correspond to the number of true positives, true negatives, false positives, and false negatives, respectively.
The second object of the invention is to provide a polypeptide function prediction device, which comprises a data set construction and data preprocessing module, a characteristic embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing context related sequences and a classification module; the data set construction and data preprocessing module is used for constructing a reference data set by extracting eight bioactive peptides AMPs, ACPs, AIPs, AHPs, ADPs, AOPs, IMPs and NUPs from a database; then, redundancy and homology deviation processing are carried out by adopting a tool CD-HIT of protein sequence or nucleic acid sequence clustering, sequences with similarity more than 90% are deleted, 80% of the sequences in the preprocessed data are used for constructing a training set, and 20% of the sequences are used for constructing a test set; the characteristic embedding module takes 5-500 amino acids of polypeptide as filling carriers, and the filling carriers with the same length can be used as input files for processing, so that polypeptide biological sequence vectors are converted into dense vectors with fixed sizes; the multi-scale convolutional neural network CNN module consists of an encoding module and a decoding module, is formed by stacking a plurality of parallel convolutional blocks, wherein the input of each node in a convolutional layer is a small block of the upper layer of neural network, and the sequence is encoded by a plurality of convolutional pooling layers to extract the global characteristics of the sequence; the BiGRU module for processing the context related sequence receives forward input and learns reverse input to extract the context information through the forward and reverse gating circulating units GRU; the classification module adopts a full communication layer as the classification module, converts the output of the convolution layer into a vector mode, and simultaneously connects each output neuron with each input neuron.
A third object of the present invention is to provide a polypeptide function prediction apparatus, comprising a processor and a memory, wherein the memory stores an application program executable by the processor, for causing the processor to execute the above-mentioned polypeptide function prediction method.
A fourth object of the present invention is to provide a computer-readable storage medium having stored therein computer-readable instructions for performing the above-described polypeptide function prediction method.
Compared with the prior art, the technical scheme provided by the invention has the beneficial effects that:
(1) The method comprises the steps of constructing a data set and preprocessing the data; establishing a multi-label prediction model, wherein the multi-label prediction model consists of a feature embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing a context related sequence and a classification module; training the constructed network model by using the acquired data set, and improving model feature extraction and parameter prediction performance by adopting an optimization algorithm; and evaluating the performance of the multi-label prediction model by calculating five indexes of precision, coverage rate, accuracy, absolute true value and absolute false value. In the process, the method effectively utilizes the association among the labels, improves the accuracy of polypeptide function prediction, has strong operability and practicability, can be simultaneously applied to the function prediction of eight bioactive peptides AMPs, ACPs, AIPs, AHPs, ADPs, AOPs, IMPs and NUPs, and is a reasonable and effective prediction method, and the accuracy of predicting eight polypeptide functions is as high as 80.4 percent.
Drawings
FIG. 1 is an overall ranking graph comparing accuracy and absolute fidelity of models over a training set and a testing set;
fig. 2 is a graph of accuracy versus tester predictions AIP, AMP, AHP, ACP, ADP, AMP, IMP and NUP.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following detailed description of the specific embodiments of the present invention will be given with reference to the accompanying drawings. The examples are not to be construed as limiting the specific techniques or conditions described in the literature in this field or as per the specifications of the product. The reagents or apparatus used were conventional products commercially available without the manufacturer's attention.
The database information used in the implementation of the invention is as follows:
DRAMP: antibacterial peptide database, website http:// drain. Cpu-bioin for. Org/;
AntiCP: anticancer peptide database, website https:// webs. Iiitd. Edu. In/raghava/anti-icp 2/;
PreAIP: the website http:// kurta14. Bio. Kurtech. Ac. Jp/PreAIP/;
mAHTPred: the website http:// the gleelab. Org/mAHTPred;
BioDADpep: the bioinformatics database of antidiabetic peptides, website http:// omicsbase. Com/BioDADPep/;
uniprot (Swiss-prot): protein database, website https:// www.uniprot.org/.
Example 1
S1, data set construction and data preprocessing
S1-1, determining a reference data set: the literature search was performed using 'bioactive peptide' as a keyword, and 8 functional peptides related to health, namely, antibacterial peptide (AMP), anticancer peptide (ACP), anti-inflammatory peptide (AIP), antihypertensive peptide (AHP), antidiabetic peptide (ADP), antioxidant peptide (AOP), immunomodulatory peptide (IMP) and neuroactive peptide (NUP), were collected from the literature, and the present invention used eight bioactive peptides (AMPs, ACPs, AIPs, AHPs, ADPs, AOPs, IMPs and NUPs) in an amount >500 to train the deep learning model since a small amount of functional peptide data was insufficient to train the deep learning model.
S1-2 and AMP are basic polypeptide substances with in vivo induced antibacterial activity. ACP is a series of peptides that can inhibit proliferation or migration of cancer cells, or inhibit the formation of cancer blood vessels. AIP is an endogenous peptide with immunotherapeutic capacity to inhibit antigen-specific T (H) 1-driven responses and regulatory T cell production. AHP is a peptide with the potential to ameliorate hypertension by scavenging free radicals, inhibiting the activity of angiotensin converting enzyme and renin. ADP is a peptide which acts on beta cells or T cells to regulate insulin production and is also helpful for evaluating diabetes symptoms, AOP is a peptide which inhibits or delays lipid oxidation and eliminates free radical damage, IMP is a peptide which exists in vivo and has immune function, is mainly applicable to cancers, lung cancers, leukemia, osteosarcoma and other diseases, and NUP is a peptide which can act as hormone and neurotransmitter to interact with alpha, delta and gamma receptors in vivo and can play roles in easing pain, regulating respiration and body temperature. Preparation of a reference dataset: AMP, ACP, AIP, AHP, ADP, AOP, IMP and NUP were extracted from DRAMP, antiCP, preAIP, MAHTPRED, bioDADpep and Uniprot (Swiss-prot), respectively. To avoid redundancy and homology bias within each functional peptide dataset, the present application uses a protein sequence or nucleic acid sequence clustering tool (CD-HIT) to delete sequences with a similarity >90%, resulting in a total of 1891 AMPs, 864 ACPs, 1651 AIPs, 1421 AHPs, 992 ADPs, 1201 AOPs, 1192 IMPs, and 1097 NUPs. Finally, the present application uses 80% of the sequences selected for constructing the training set and 20% of the sequences selected for constructing the test set, and table 1 lists the details of the data set.
TABLE 1 details of datasets
Figure BDA0004093921900000081
S2, construction of feature prediction algorithm
S2-1, model overview: in this work, the present application uses a variety of deep neural network algorithms to construct a multi-label predictor. FIG. 1 illustrates the construction of proposed multi-tag predictors for multifunctional bioactive peptides. There are mainly four parts. The first part is a feature embedding module that converts sequence vectors into dense vectors of fixed size. The second part is a multi-scale Convolutional Neural Network (CNN) module. The third part is the biglu module that processes context-dependent sequences. The last part is the classification module. In each output unit, sigmoid is used as an activation function, and a score between 0 and 1 is output. These modules will be described in the next section. Further details of the experiment will be given in the implementation details section.
S2-2, a feature embedding module: before the embedding mechanism is implemented, the peptide needs to be converted into a filled carrier, which is then input into the predictive model. First, amino acids { A, C, D, E, F, G, H, I, M, N, P, Q, R, S, T, V, W, Y } are given natural numbers {1, 2,3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}, respectively. These peptides range in length from 5 (minimum length of all sequences) to 500 (maximum length of all sequences), but only peptide vectors of equal length can be handled as input files. Thus, peptides of a fixed length of 500 amino acids were prepared using the zero-fill method. Finally, the sequences were encoded into 500-dimensional vectors. After the sequence vectors are obtained, positive integers of the vectors can be converted into dense continuous feature vectors having a fixed size by using an embedding layer. Such mechanisms are commonly used for natural language processing. In this study, the weight of the embedded layer may be updated during the training process. Finally, the embedding matrix is learned to represent the peptide sequence.
S2-3, a multi-scale Convolutional Neural Network (CNN) module: the method mainly comprises an encoding module and a decoding module, wherein the encoding module and the decoding module are formed by stacking a plurality of parallel convolution blocks, the input of each node in a convolution layer is a small block of a neural network of the upper layer, the global features of the sequences are extracted by encoding the sequences through a plurality of convolution pooling layers, the deeper analysis is facilitated, the features with higher abstract degree are obtained, after a convolution feature matrix is obtained, feature graphs with different scales are extracted through a deconvolution layer, the feature quantity is reduced, and the excessive superposition is prevented by adopting a maximum pooling mechanism.
S2-4, a BiGRU module for processing a context-related sequence: in the Recurrent Neural Network (RNN) and in particular Bidirectional GRU, significant achievements are achieved in the classification task. Sequence flow is analyzed by Bidirectional GRU, and sequence context information is extracted by forward input and learning reverse input through two gating loop units (GRUs) in forward and reverse directions. In this study, biglu uses its internal state to process sequence vectors, where it leverages sequence context information in both directions. Also, a Gating Recursion Unit (GRU) is an integral part of a BiGRU for dynamically remembering or forgetting sequence information. To get more details of the GRU, the following describes how the activation of the j-th hidden unit in the GRU layer is calculated. First, the update gate z of the jth hidden unit at time t is calculated by the following formula:
Figure BDA0004093921900000091
where σ is a logistic sigmoid function, W z And U z Representing learned different weight matrices, X t Is input, h t-1 Is the previous hidden state. And the output of the multi-scale convolutional network layer is used as an input.
At time t, the reset gate r of the jth hidden unit is similarly calculated by:
Figure BDA0004093921900000101
then, activation of the jth hidden unit at time t
Figure BDA0004093921900000102
Calculated from the following formula:
Figure BDA0004093921900000103
wherein the candidate activations are calculated accordingly using the following formula
Figure BDA0004093921900000104
Figure BDA0004093921900000105
Wherein tan h is a hyperbolic tangent function, as indicated by the terms "& r t Respectively an element-by-element multiplication and a set of reset gates.
S2-5, a classification module: the application adopts the full communication layer as a classification module. The output of the convolutional layer is conveniently converted into a vector mode, and each output neuron and each input neuron can be connected. A linear rectifying function (Rectified Linear Unit) is used as a rectifying linear activation unit. And, the vector from the fully connected layer is used as input to the output layer. In the multi-label problem, the probabilities of each node are independent of the others, with binary cross entropy as a loss function. A score for each node between 0 and 1 is obtained using sigmoid as the activation function. Finally, the present application uses a 0.7 threshold to derive a prediction tag for each category, and the elements in the multidimensional prediction vector correspond to the tags of ACP, ADP, AHP, AIP, AMP, AOP, IMP and NUP
S2, evaluating indexes:
in previous studies, several evaluation metrics were presented to evaluate the performance of the multi-label model. To evaluate the performance of the methods of the present application, these measurements were taken, including precision, coverage, accuracy, absolute true values, and absolute false values. These indices are defined as follows:
Figure BDA0004093921900000111
Figure BDA0004093921900000112
Figure BDA0004093921900000113
Figure BDA0004093921900000114
wherein N is the total number of the multifunctional bioactive peptides, U represents the union in the ensemble theory, and refers to the operation of counting the number of elements by the intersection of the set theory, L (L) i Representing the subset of the ith sample with true tags,
Figure BDA0004093921900000115
representing a subset of labels with predictions for an ith sample, and
Figure BDA0004093921900000116
furthermore, in a specific class of functional peptides, if a peptide having such a function is considered as a positive sample, other peptides not having such a function are defined as negative samples. To evaluate the predictive performance of different multi-labeling methods on different peptide functions, the application applied Sensitivity (SEN) and Specificity (SPE), calculated as follows:
Figure BDA0004093921900000117
Figure BDA0004093921900000118
wherein TP, TN, FP, and FN correspond to the number of true positives, true negatives, false positives, and false negatives, respectively. For example, on the test set, ACP is a positive sample when the present application calculates SEN and SPE of ACP, and other peptides in the group that have no anticancer function are considered negative samples. Using the prediction results of these positive and negative samples, SEN and SPE are obtained through the confusion matrix. Details of the prediction model are shown in table 2:
TABLE 2 details of predictive models
Figure BDA0004093921900000119
Figure BDA0004093921900000121
S3, implementing the following steps:
the TensorFlow's advanced Keras API is used to build and train the predictive model of the present application. The API can be used for rapid prototyping, tip research and actual production, has the advantages of simplicity, easiness in use, modularization, combinability and easiness in expansion, and simultaneously, the GPU is used for acceleration.
Training process. In the multi-label predictive model of the application, a grid search program is adopted, and the super parameters on the training set are adjusted through 5 times of cross validation. As shown in table 2, these parameters, including the embedding size, pool size, GRU units, full connection dimension, and learning rate, were adjusted. For the CNN layer, it was constructed using Conv1D functions in Keras. To extract various convolution features, three convolution kernel (size, number of channels, depth) sizes ks e {2,3,8}, are selected. The application then uses a Adam (adaptive moment estimation) optimizer training model that integrates two popular algorithms, "adagard" (for processing sparse gradients) and "RMSPro" (for processing non-steady state data) for solving the large data volume, high feature latitude optimization problem in machine learning. Finally, the present application trains a batch size of 64 and epochs of 30 models. Due to the random initializations in the deep learning framework, the nine identical deep framework models are trained in parallel with different random initializations. The final predicted score for the test sample is obtained by averaging the scores of all models.
S4, results and discussion:
s4-1, performance comparison of different multi-label prediction models: the multi-label prediction model adopts an embedding mechanism, CNN and RNN to build a model. For the CNN layer, the application can utilize multi-scale CNN to extract a plurality of convolution characteristics, and early researches report that CNN with small convolution kernel shows more economical and better performance, and for the RNN layer, a gating cycle unit (GRU) and a Long and Short Time Memory (LSTM) are commonly used for classification tasks. In order to select the best-performance model, five deep learning models are created for comparison, including basic models CNN, CNN and BiGRU (CNN-BiGRU), CNN and bidirectional LSTM (CNN-BiLSTM), CNN and small convolution kernel (CNN-sk), and CNN-sk and BIGR (CNN-sk-BIGRU).
The present application evaluates their performance on training and testing data sets. Among the evaluation metrics listed in the method, the most important metrics are accuracy and absolute true values. Two metrics for the various models on the training and testing dataset are shown in fig. 1. The CNN-biglu achieved the best performance compared to the other four multi-label predictive models on the training dataset. The CNN-BiGRU achieved the best performance with an accuracy of 0.749 and an absolute true value of 0.740. Table 3 lists other performance metrics for these different multi-label predictive models.
TABLE 3 Performance of various multi-tag predictive models
Figure BDA0004093921900000131
As shown in Table 3, on the training set, the accuracy of CNN-BiGRU was 0.749, the coverage was 0.752, the absolute error was 0.108, and the cross validation was 5 times better than the other four models. On the test set, CNN biglu also achieved the best performance on these four indicators. Because the comparison is in a CNN-based model only, the model incorporating the biglu mechanism can not only obtain convolution characteristics, but also dynamically remember or forget the information flow. On these evaluation criteria, CNN-BiLSTM has similar performance as CNN-BiGRU, but CNN-BiLSTM requires longer convergence time. GRU is an RNN algorithm proposed to solve the problems of long-term memory and gradient in back propagation, similar to LSTM. In one aspect, the GRU has fewer parameters than LSTM. LSTM has three gating mechanisms (forget gate, input gate and output gate), while GRU has only two gating mechanisms, update gate and reset gate. Therefore, the convergence run time of biglu is shorter. On the other hand, the use of the GRU can achieve a comparable effect and is comparatively easier to train, the training efficiency can be greatly improved, and the GRU has better problem solving capability than the traditional RNN. Finally, the model CNN-BiGRU with the best comprehensive performance is selected to construct the multi-marker bioactive peptide predictor.
S4-2, influence of different lengths of sequences on model performance: in the baseline dataset, the length of each bioactive peptide ranged from 5 to 500 residues. In view of the broad length range of bioactive peptides, the present application uses carriers with different fixed lengths as inputs to test whether there is an impact on the performance of the model. First, to ensure that the amount of data in each experiment is the same and as much as possible, the present application extracts samples with lengths from the baseline dataset of 100. Then 2001 sequences are selected for further testing, then converted to digital vectors, and padded with zeros to different lengths, including 100, 200, 230, 400, and 500. As shown in table 4, the precision values of the different fixed lengths are similar to each other. The highest value (0.751) is only 1.3% higher than the lowest value (0.738). Meanwhile, the values of the other four evaluation indexes are similar in each experiment only in terms of accuracy. Overall, the sequence length had a slight effect on the model.
TABLE 4 prediction accuracy of various lengths
Figure BDA0004093921900000141
S4-3, case prediction: as described above, the application constructs a multi-marker bioactive peptide prediction model, and the performance comparison of different multi-marker methods shows that the prediction model is superior to other methods in the aspect of multi-functional bioactive peptide prediction. To further illustrate the superior performance of this predictive model, the present application conducted a case study of these multi-labeled predictors for predicting given negative and positive samples, which covers all functions of eight bioactive peptides. The prediction model shows good accuracy in all eight active function predictions, and other algorithms only find two to three true positives, as shown in Table 5.
TABLE 5 comparative accuracy with other algorithms
Figure BDA0004093921900000151
S4-4, predicting results of the test set: the present application tested this procedure with a new test set (7100 AMPs, 2800 ACPs, 3834 AIPs, 2462 AHPs, 1956 ADP, 2824 AOPs, 2473 IMPs, and 2821 NUPs) with accuracy of AHPs, respectively, as shown in table 6 and fig. 2: 78.4%, AMP:82.7%, AIP:80.3%, ACP:79.5%, ADP:80.4%, AOP:80.1%, IMP:79.9%, NUP:80.2% and overall accuracy of 80.4%.
TABLE 6 test set prediction results
Figure BDA0004093921900000152
The embodiments described above and features of the embodiments herein may be combined with each other without conflict.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (9)

1. A polypeptide function prediction method is characterized by comprising the following specific steps:
s1, data set construction and data preprocessing
Extracting eight bioactive peptides AMPs, ACPs, AIPs, AHPs, ADPs, AOPs, IMPs and NUPs from the database to construct a benchmark dataset; then, preprocessing the data of the reference data set through redundancy and homology deviation analysis, wherein 80% of sequences in the preprocessed data are used for constructing a training set, and 20% of sequences are used for constructing a test set;
s2, a multi-label prediction model is established, and the multi-label prediction model consists of a feature embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing context related sequences and a classification module;
the characteristic embedding module converts the polypeptide biological sequence vector into a dense vector with a fixed size; amino acids { A, C, D, E, F, G, H, I, M, N, P, Q, R, S, T, V, W, Y } are assigned to natural numbers {1, 2,3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}, and a zero-filling method is used to prepare a secret vector with a fixed size of 500×500 dimensions; the feature embedding module has an embedding size value of any one of 50, 100, and 150;
the multi-scale convolutional neural network CNN module consists of an encoding module and a decoding module, is formed by stacking a plurality of parallel convolutional blocks, wherein the input of each node in a convolutional layer is a small block of the upper layer of neural network, and the sequence is encoded by a plurality of convolutional pooling layers to extract the global characteristics of the sequence; the pool size value of the CNN module is 3 or 5;
the BiGRU module for processing the context related sequence receives forward input and learns reverse input to extract the context information through the forward and reverse gating circulating units GRU; the GRU unit value of the BiGRU module is any one of 50, 100 and 150;
the classification module adopts a full communication layer as a classification module, converts the output of the convolution layer into a vector mode, and simultaneously connects each output neuron with each input neuron; the full-connection dimension value of the classification module is 128 or 256;
s3, training the constructed multi-label prediction model by using the acquired data set, and improving model feature extraction and parameter prediction performance by adopting an optimization algorithm; the training learning rate value is 0.01 or 0.001;
and S4, evaluating the performance of the multi-label prediction model through five indexes including calculation precision, coverage rate, accuracy, absolute true value and absolute false value.
2. The method of claim 1, wherein in step S1, the database comprises DRAMP, antiCP, preAIP, MAHTPRED, bioDADpep and Uniprot; the pretreatment process is to delete sequences with similarity more than 90% by using a protein sequence or nucleic acid sequence clustering tool CD-HIT, so as to avoid redundancy and homology deviation in each functional peptide data set.
3. The method of claim 2, wherein in step S2, before implementing the embedding mechanism, the polypeptide is converted into a filler carrier, and then the filler carrier is input into the multi-tag prediction model, wherein the length of the polypeptide is 5-500 amino acids, and the filler carriers with equal lengths can be processed as input files.
4. A method of predicting polypeptide function as claimed in claim 3, wherein in step S2, in the biglu module of the processing context-dependent sequence, the formula for calculating activation of the j-th hidden unit in the GRU layer is as follows:
Figure FDA0004093921890000021
where σ is a logistic sigmoid function, wz and Uz represent different weight matrices learned, xt is input, and ht-1 is the previous hidden state. And the output of the multi-scale convolutional network layer is used as an input;
at time t, the reset gate r of the jth hidden unit is similarly calculated by:
Figure FDA0004093921890000022
the activation of the jth hidden unit at time t is then calculated by:
Figure FDA0004093921890000031
wherein the candidate activations are calculated accordingly using the following formula:
Figure FDA0004093921890000032
where tanh is a hyperbolic tangent function and rt is an element-wise multiplication and a set of reset gates, respectively.
5. The method of claim 4, wherein in step S4, the five indexes of precision, coverage, accuracy, absolute true value and absolute false value are defined as follows:
Figure FDA0004093921890000033
Figure FDA0004093921890000034
Figure FDA0004093921890000035
Figure FDA0004093921890000036
wherein N is the total number of the multifunctional bioactive peptides, U represents the union in the ensemble theory, and refers to the operation of counting the number of elements by the intersection of the set theory, L (L) i Representing the subset of the ith sample with true tags,
Figure FDA0004093921890000037
representing a subset of labels with predictions for the ith sample.
6. The method according to claim 5, wherein in step S4, the prediction performance of different multi-labeling methods on different peptide functions is further evaluated by using sensitivity SEN and specificity SPE, and the calculation formula is as follows:
Figure FDA0004093921890000038
Figure FDA0004093921890000039
wherein TP, TN, FP, and FN correspond to the number of true positives, true negatives, false positives, and false negatives, respectively.
7. The polypeptide function prediction device is characterized by comprising a data set construction and data preprocessing module, a characteristic embedding module, a multi-scale convolutional neural network CNN module, a BiGRU module for processing a context related sequence and a classification module; the data set construction and data preprocessing module is used for constructing a reference data set by extracting eight bioactive peptides AMPs, ACPs, AIPs, AHPs, ADPs, AOPs, IMPs and NUPs from a database; then, redundancy and homology deviation processing are carried out by adopting a tool CD-HIT of protein sequence or nucleic acid sequence clustering, sequences with similarity more than 90% are deleted, 80% of the sequences in the preprocessed data are used for constructing a training set, and 20% of the sequences are used for constructing a test set; the characteristic embedding module takes 5-500 amino acids of polypeptide as filling carriers, and the filling carriers with the same length can be used as input files for processing, so that polypeptide biological sequence vectors are converted into dense vectors with fixed sizes; the multi-scale convolutional neural network CNN module consists of an encoding module and a decoding module, is formed by stacking a plurality of parallel convolutional blocks, wherein the input of each node in a convolutional layer is a small block of the upper layer of neural network, and the sequence is encoded by a plurality of convolutional pooling layers to extract the global characteristics of the sequence; the BiGRU module for processing the context related sequence receives forward input and learns reverse input to extract the context information through the forward and reverse gating circulating units GRU; the classification module adopts a full communication layer as the classification module, converts the output of the convolution layer into a vector mode, and simultaneously connects each output neuron with each input neuron.
8. A polypeptide function prediction device, characterized by comprising a processor and a memory, in which an application executable by the processor is stored, for causing the processor to execute the polypeptide function prediction method according to any one of claims 1 to 6.
9. A computer-readable storage medium having stored therein computer-readable instructions for performing the polypeptide function prediction method according to any one of claims 1 to 6.
CN202310160375.3A 2023-02-24 2023-02-24 Polypeptide function prediction method and device Pending CN116312750A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310160375.3A CN116312750A (en) 2023-02-24 2023-02-24 Polypeptide function prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310160375.3A CN116312750A (en) 2023-02-24 2023-02-24 Polypeptide function prediction method and device

Publications (1)

Publication Number Publication Date
CN116312750A true CN116312750A (en) 2023-06-23

Family

ID=86788073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310160375.3A Pending CN116312750A (en) 2023-02-24 2023-02-24 Polypeptide function prediction method and device

Country Status (1)

Country Link
CN (1) CN116312750A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116825198A (en) * 2023-07-14 2023-09-29 湖南工商大学 Peptide sequence tag identification method based on graph annotation mechanism
CN117953973A (en) * 2024-03-21 2024-04-30 电子科技大学长三角研究院(衢州) Specific biological sequence prediction method and system based on sequence homology

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116825198A (en) * 2023-07-14 2023-09-29 湖南工商大学 Peptide sequence tag identification method based on graph annotation mechanism
CN116825198B (en) * 2023-07-14 2024-05-10 湖南工商大学 Peptide sequence tag identification method based on graph annotation mechanism
CN117953973A (en) * 2024-03-21 2024-04-30 电子科技大学长三角研究院(衢州) Specific biological sequence prediction method and system based on sequence homology

Similar Documents

Publication Publication Date Title
CN116312750A (en) Polypeptide function prediction method and device
Naseer et al. NPalmitoylDeep-PseAAC: A predictor of N-palmitoylation sites in proteins using deep representations of proteins and PseAAC via modified 5-steps rule
Baldi et al. Exploiting the past and the future in protein secondary structure prediction
Zeng et al. Deep collaborative filtering for prediction of disease genes
CN114743600A (en) Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity
Wang et al. A novel matrix of sequence descriptors for predicting protein-protein interactions from amino acid sequences
Manzoor et al. Protein encoder: An autoencoder-based ensemble feature selection scheme to predict protein secondary structure
Rahman et al. IDMIL: an alignment-free Interpretable Deep Multiple Instance Learning (MIL) for predicting disease from whole-metagenomic data
Yan et al. A review about RNA–protein-binding sites prediction based on deep learning
Xu et al. Eurnet: Efficient multi-range relational modeling of spatial multi-relational data
Cui et al. RMSCNN: a random multi-scale convolutional neural network for marine microbial bacteriocins identification
Chen et al. DeepGly: A deep learning framework with recurrent and convolutional neural networks to identify protein glycation sites from imbalanced data
Du et al. Deep multi-label joint learning for RNA and DNA-binding proteins prediction
CN116386733A (en) Protein function prediction method based on multi-view multi-scale multi-attention mechanism
Deng et al. Predict the protein-protein interaction between virus and host through hybrid deep neural network
Wang et al. Sequence-based protein-protein interaction prediction via support vector machine
Iraji et al. Druggable protein prediction using a multi-canal deep convolutional neural network based on autocovariance method
Ray et al. A weighted power framework for integrating multisource information: gene function prediction in yeast
Taju et al. Using deep learning with position specific scoring matrices to identify efflux proteins in membrane and transport proteins
CN113345535A (en) Drug target prediction method and system for keeping chemical property and function consistency of drug
Guo et al. Kernel risk sensitive loss-based echo state networks for predicting therapeutic peptides with sparse learning
Pogány et al. DT-ML: Drug-Target Metric Learning.
Fadhil et al. Classification of Cancer Microarray Data Based on Deep Learning: A Review
Sreelekshmi et al. Dingo optimized fuzzy Cnn technique for efficient protein structure prediction
Gao et al. RPI-MCNNBLSTM: BLSTM networks combining with multiple convolutional neural network models to predict RNA-protein interactions using multiple biometric features codes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination