CN114863997A - Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion - Google Patents
Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion Download PDFInfo
- Publication number
- CN114863997A CN114863997A CN202210686266.0A CN202210686266A CN114863997A CN 114863997 A CN114863997 A CN 114863997A CN 202210686266 A CN202210686266 A CN 202210686266A CN 114863997 A CN114863997 A CN 114863997A
- Authority
- CN
- China
- Prior art keywords
- amino acid
- peptide
- feature
- lstm
- short term
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Chemical & Material Sciences (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Medicinal Chemistry (AREA)
- Pharmacology & Pharmacy (AREA)
- Crystallography & Structural Chemistry (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention relates to the technical field of anticancer peptide prediction, in particular to an anticancer peptide prediction method based on bidirectional long-short term memory network and feature fusion, which comprises the following steps: reading four reference peptide sequence datasets, and carrying out amino acid composition analysis on the datasets; performing feature extraction on the data set through Bi-LSTM to generate Bi-LSTM feature vectors; extracting the characteristics of the five amino acid characteristic vectors through a fully-connected neural network; and performing feature fusion on the feature vector through a Concatenate algorithm, obtaining probability scores through a full-connection layer with a 1 unit and a Sigmoid activation function, and distinguishing the anticancer peptides and the non-anticancer peptides through the scores. The method realizes the prediction of the anti-cancer peptide with high accuracy, high Mazis correlation coefficient, high sensitivity, high specificity and high area under ROC curve.
Description
Technical Field
The invention relates to the technical field of anti-cancer peptide prediction, in particular to an anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion.
Background
The discovery of the anticancer peptide (ACP) widens the vision field of people for resisting cancers, the specificity and the tumor of the ACP cannot generate drug resistance to the ACP, solves the side effects brought by some traditional anticancer treatments, and is hopeful to become a substitute treatment method for the cancers; anticancer peptides typically consist of 5-40 amino acids; to further understand the mechanism of action of anticancer peptides, many biological experiments have been performed to identify anticancer peptides. For example, Vidal et al identified a peptide cocktail against intracellular tumor proteins by the yeast two-hybrid system, and Peelle et al discovered novel localization peptides that were not cell type specific by mammalian cell screening; however, these identification methods are time-consuming, expensive, complex and difficult to implement in a high-throughput manner, and thus rapid and efficient identification of anticancer peptides is important.
Wu et al propose a PTPD model, which uses feature vectors extracted from k-mer and Word2vec (Word vector) to input into a Convolutional Neural Network (CNN) to predict peptides; rao et al, then, have applied the convolutional network (GCN) to the prediction of anticancer peptides, proposing the ACP-GCN model; however, these deep learning methods only consider the original sequence information and physicochemical properties of amino acids, ignore the long-term related information of anticancer peptides at the time level, and cannot identify the anticancer peptides rapidly and efficiently at low cost.
Disclosure of Invention
Aiming at the defects of the existing algorithm, the method realizes the prediction of the anti-cancer peptide with high accuracy, high Mazis correlation coefficient, high sensitivity, high specificity and high area under ROC curve.
The technical scheme adopted by the invention is as follows: the method for predicting the anti-cancer peptide based on the bidirectional long-short term memory network and the feature fusion comprises the following steps:
further, step 2 comprises:
step 2.1, in order to input the peptide sequence into Bi-LSTM, firstly, the primary letter sequence of the peptide is digitally coded according to an amino acid alphabet, namely, the 20 basic amino acids are distributed with numbers of 1-20, and the peptide sequence with insufficient length is filled with 0 to ensure that all the peptide sequences have consistent length;
2.2, converting the input digital code into 64-dimensional vector representation through an Embedding layer (Embedding) of the Bi-LSTM;
step 2.3, performing feature extraction on the input 64-dimensional vector by the Bi-LSTM, wherein the Bi-LSTM specifically comprises the following components: input x at time t t Cell state C t Temporary cell stateHidden layer state h t Forgetting door f t Memory door i t Output gate O t ;
The Bi-LSTM comprises a forward long-term memory network layer and a backward long-term memory network layer, wherein each layer comprises a memory unit and a 64-dimensional hidden unit;
forget gate (select information to forget):
f t =σ(W f ·[h t-1 ,x t ]+b f ) (1)
memory gate (selecting information to remember):
i t =σ(W i ·[h t-1 ,x t ]+b i ) (2)
cell status at the present time:
output gate and current time hidden state:
o t =σ(W o [h t-1 ,x t ]+b o ) (5)
h t =O t ·tanh(C t ) (6)
wherein, W and b represent Bi-LSTM network learning weight and bias respectively;
step 3.1, according to five amino acid characteristics: binary (BPF), dipeptide composition (DPC), k-spacer amino acid group pair Composition (CKSAAGP), Amino Acid Composition (AAC) and sequence coupling number (SOCNumber), feature-coding the primary letter sequence of the peptide, the feature-coding converting the peptide sequence into a feature vector of 770 dimensions;
wherein, the five feature codes comprise: BPF feature coding, DPC feature coding, CKSAAGP feature coding, AAC feature coding and SOCNumber feature coding;
BPF signature coding is expressed as:
in a binary system, each amino acid letter is represented by a 20-dimensional vector of 0/1, for example, the first amino acid letter a is denoted as f (a) (1, 0., 0)), the second amino acid letter C is denoted as f (C) ((0, 1., 0), and so on, for a peptide sequence P, its binary characteristics can be represented as:
B(P)=[f(p 1 ),f(p 2 ),...,f(p n )] (7)
wherein, P is a peptide sequence, f (P) n ) Represents a certain amino acid letter;
DPC signature coding is represented as:
the DPC composition consists of 400 descriptors, which are defined as: the number of dipeptide combinations in a given peptide sequence is expressed as:
wherein N is ab Is the number of dipeptides represented by amino acid types a and b;
the CKSAAGP signature code is expressed as:
in the k-spacer amino acid group pair composition, the frequency of amino acid pairs separated by any k residues is calculated according to different groups of physicochemical property compositions, taking k 0 as an example, there will be 25 zero-spaced groups (g1g1, g1g 2.., g5g5), and the eigenvector will be defined as:
wherein the value of each descriptor represents the composition of the corresponding residue pair in the peptide sequence, and for a peptide sequence of length N, when k is 0,1,2,3 all =n-1,n-2,n-3,n-4...。
AAC signature is represented as:
AAC composition calculates the frequency of each amino acid type in a peptide sequence, and the frequency of 20 amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y) can be expressed as:
wherein N (a) represents the number of times an amino acid appears in the peptide sequence, and N represents the length of the peptide sequence;
the socneumber signature code is expressed as:
wherein d is i,i+d The distance between two amino acids at positions i and i + d is depicted, nlag represents the maximum value of the lag, and N is the length of the peptide sequence.
And 4, performing feature fusion on the feature vectors generated in the steps 2 and 3 through a Concatenate algorithm, inputting the feature vectors into a full-link layer with 512 units and relu activation functions, and obtaining a probability score from 0 to 1 through the full-link layer with 1 unit and Sigmoid activation functions, wherein the probability score is regarded as the anti-cancer peptide when the probability score is greater than 0.5, and the probability score is regarded as the non-anti-cancer peptide when the probability score is less than 0.5.
The invention has the beneficial effects that:
1. the rapid and efficient identification of the anti-cancer peptide is realized; the accuracy of the anti-cancer peptide identification is improved through feature fusion.
Drawings
FIG. 1 is a diagram of the construction of an anti-cancer peptide prediction model based on the fusion of bidirectional long-short term memory network and features according to the present invention;
FIG. 2 is an amino acid alphabet of the present invention;
FIG. 3 is an amino acid composition analysis of a data set of the present invention;
FIG. 4 is a comparison of the method of the present invention with a single feature model;
FIG. 5 is a comparison of different model methods for the ACPred-Fuse and ACPred-FL datasets of the present invention;
FIG. 6 is a comparison of different model methods for ACP240 and ACP740 datasets in accordance with the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples, which are simplified schematic drawings and illustrate only the basic structure of the invention in a schematic manner, and therefore only show the structures relevant to the invention.
As shown in figure 1, the method for predicting the anti-cancer peptide based on the bidirectional long-short term memory network and the characteristic fusion comprises the following steps:
TABLE 1 four reference peptide sequence datasets
The amino acid alphabet code carries out digital coding on the primary letter sequence of the peptide, namely, the 20 basic amino acids are distributed with numbers 1-20, the peptide sequence with insufficient length is filled with 0 to ensure that all the peptide sequences have consistent length, and the amino acid alphabet code is shown in figure 2;
further, step 2 comprises:
step 2.1, carrying out digital coding on the primary letter sequence of the peptide according to an amino acid alphabet, namely, allocating numbers 1-20 to 20 basic amino acids, and filling 0 in the peptide sequence with insufficient length to ensure that all the peptide sequences have consistent length;
2.2, converting the input number into 64-dimensional vector representation through an Embedding layer (Embedding) of the Bi-LSTM;
step 2.3, performing feature extraction on the input 64-dimensional vector by the Bi-LSTM, wherein the Bi-LSTM specifically comprises the following components: input x at time t t Cell state C t Temporary cell stateHidden layer state h t Forgetting door f t Memory door i t Output gate O t (ii) a The bidirectional long and short term memory network layer consists of a forward long and short term memory network layer and a backward long and short term memory network layer, and each layer consists of a memory unit and a 64-dimensional hidden unit;
amino acid signature codes are based on five amino acid signatures: binary (BPF), dipeptide composition (DPC), k-space amino acid group pair Composition (CKSAAGP), Amino Acid Composition (AAC) and sequence coupling number (SOCNumber), and feature coding is performed on a primary letter sequence of a peptide, the feature coding converts the peptide sequence into a feature vector with 770 dimensions, and five feature codes are specifically as follows:
(1) binary (BPF):
in binary, each amino acid letter is represented by a 20-dimensional vector consisting of 0/1; for example, the first amino acid letter a is denoted as f (a) ═ 1, 0.., 0), the second amino acid letter C is denoted as f (C) ═ 0, 1.., 0, and so on; for a peptide sequence P, its binary characteristics can be expressed as:
B(P)=[f(p 1 ),f(p 2 ),...,f(p n )] (7)
(2) dipeptide composition (DPC):
the dipeptide composition consists of 400 descriptors, which are defined as: the number of dipeptide combinations in a given peptide sequence can be expressed as:
wherein N is ab Is the amount of dipeptide represented by amino acid types a and b.
(3) k-spacer amino acid group pair Composition (CKSAAGP):
in the k-spacer amino acid group pair composition, the frequency of amino acid pairs separated by any k residues is calculated according to different groups of physicochemical property compositions. For example, with k equal to 0, there would be 25 zero-spaced pairs of pairs (g1g1, g1g 2.., g5g5), and the feature vector would be defined as:
wherein the value of each descriptor represents the composition of the corresponding residue pair in the peptide sequence. For peptide sequences of length N, when k is 0,1,2,3 all =n-1,n-2,n-3,n-4...。
(4) Amino Acid Composition (AAC):
amino acid composition the frequency of each amino acid type in the peptide sequence was calculated. The frequency of 20 amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y) can be expressed as:
wherein N (a) represents the number of times an amino acid appears in the peptide sequence and N represents the length of the peptide sequence.
(5) Sequence order coupling number (soccolumn):
the sequential order coupling number can be defined as:
wherein d is i,i+d The distance between two amino acids at positions i and i + d is depicted, nlag represents the maximum value of the lag, N is the length of the peptide sequence;
and 4, performing feature fusion on the feature vectors generated in the steps 2 and 3 through a Concatenate algorithm, inputting the feature vectors into a full-link layer with 512 units and relu activation functions, and obtaining a probability score from 0 to 1 through the full-link layer with 1 unit and Sigmoid activation functions, wherein the probability score is regarded as the anti-cancer peptide when the probability score is greater than 0.5, and the probability score is regarded as the non-anti-cancer peptide when the probability score is less than 0.5.
To demonstrate the validity of the fusion features, BLSTM-ACP was compared to a model of a single feature, and the results are shown in fig. 4;
as shown in Table 2, the accuracy, the Marx correlation coefficient, the sensitivity, the specificity and the area under the ROC curve of the BLSTM-ACP are obviously superior to those of other methods, such as an intuitive comparison method shown in figures 5 and 6.
TABLE 2 comparison of BLSTM-ACP with different anti-cancer peptide prediction methods on four reference datasets
Note that: the maximum is indicated in bold; ACC: the accuracy rate; MCC: a mazis correlation coefficient; and SE: sensitivity; SP: specificity; AUC: area under the ROC curve;
in conclusion, the invention can realize the prediction of the anti-cancer peptide with high accuracy, high Mazis correlation coefficient, high sensitivity, high specificity and high area under the ROC curve.
In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.
Claims (4)
1. The method for predicting the anti-cancer peptide based on the bidirectional long-short term memory network and the feature fusion is characterized by comprising the following steps of:
step 1, reading four reference peptide sequence datasets, and carrying out amino acid composition analysis on the datasets;
step 2, performing feature extraction on the data set through the Bi-LSTM to generate Bi-LSTM feature vectors;
step 3, extracting the characteristics of the five amino acid characteristic vectors through a full-connection neural network;
and 4, performing feature fusion on the feature vectors generated in the steps 2 and 3 through a Concatenate algorithm, inputting the feature vectors into a full-link layer with 512 units and relu activation functions, obtaining probability scores through the full-link layer with 1 unit and a Sigmoid activation function, and distinguishing anticancer peptides and non-anticancer peptides through the scores.
2. The method for predicting the anticancer peptide based on the fusion of the bidirectional long-short term memory network and the characteristics as claimed in claim 1, wherein the step 2 comprises:
step 2.1, digitally encoding the primary letter sequence of the peptide according to an amino acid alphabet;
2.2, converting the input digital code into a 64-dimensional vector through an embedded layer of the Bi-LSTM;
step 2.3, performing feature extraction on the input 64-dimensional vector by the Bi-LSTM, wherein the Bi-LSTM comprises the following steps: input x at time t t Cell state C t Temporary cell stateHidden state h t Forgetting door f t Memory door i t Output gate O t (ii) a The Bi-LSTM is composed of forward and backward long-short term memory network layers, each layer is composed of a memory unit and a 64-dimensional hidden unit.
3. The method for predicting the bidirectional long-short term memory network and feature fusion based anticancer peptide as set forth in claim 1, wherein the step 3 comprises: the primary letter sequence of the peptide is feature-coded according to five amino acid features, including BPF, DPC, CKSAAGP, AAC, and socumber, which convert the peptide sequence into a 770-dimensional feature vector.
4. The method of claim 3, wherein the five amino acid features feature-coding the primary letter sequence of the peptide comprises:
BPF signature coding is expressed as:
B(P)=[f(p 1 ),f(p 2 ),...,f(p n )] (7)
wherein P is a peptide sequence, f (P) n ) Represents a certain amino acid letter;
DPC signature coding is represented as:
wherein N is ab Is the number of dipeptides represented by amino acid types a and b;
the CKSAAGP signature code is expressed as:
wherein, N g1g1 -N g5g5 25 zero-spaced pairs;
AAC signature is represented as:
wherein N (a) represents the number of times an amino acid appears in the peptide sequence, and N represents the length of the peptide sequence;
the socumber signature code is represented as:
wherein, d i,i+d The distance between two amino acids at positions i and i + d is depicted, nlag represents the maximum value of the lag, and N is the length of the peptide sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210686266.0A CN114863997A (en) | 2022-06-17 | 2022-06-17 | Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210686266.0A CN114863997A (en) | 2022-06-17 | 2022-06-17 | Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114863997A true CN114863997A (en) | 2022-08-05 |
Family
ID=82624840
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210686266.0A Pending CN114863997A (en) | 2022-06-17 | 2022-06-17 | Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114863997A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115512396A (en) * | 2022-11-01 | 2022-12-23 | 山东大学 | Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network |
-
2022
- 2022-06-17 CN CN202210686266.0A patent/CN114863997A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115512396A (en) * | 2022-11-01 | 2022-12-23 | 山东大学 | Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network |
CN115512396B (en) * | 2022-11-01 | 2023-04-07 | 山东大学 | Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111985369B (en) | Course field multi-modal document classification method based on cross-modal attention convolution neural network | |
CN111798921B (en) | RNA binding protein prediction method and device based on multi-scale attention convolution neural network | |
CN108920720B (en) | Large-scale image retrieval method based on depth hash and GPU acceleration | |
CN107622182B (en) | Method and system for predicting local structural features of protein | |
CN109947963A (en) | A kind of multiple dimensioned Hash search method based on deep learning | |
CN110109060A (en) | A kind of radar emitter signal method for separating and system based on deep learning network | |
KR102638370B1 (en) | Explanable active learning method using Bayesian dual autoencoder for object detector and active learning device using the same | |
CN110766084B (en) | Small sample SAR target identification method based on CAE and HL-CNN | |
CN111400494B (en) | Emotion analysis method based on GCN-Attention | |
CN111276187B (en) | Gene expression profile feature learning method based on self-encoder | |
CN112084877B (en) | NSGA-NET-based remote sensing image recognition method | |
WO2022127075A1 (en) | Feature discretization method for remote sensing image on the basis of rough fuzzy model | |
CN110942057A (en) | Container number identification method and device and computer equipment | |
CN103020321B (en) | Neighbor search method and system | |
CN111104555A (en) | Video hash retrieval method based on attention mechanism | |
CN114863997A (en) | Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion | |
CN111598187A (en) | Progressive integrated classification method based on kernel width learning system | |
CN109902808A (en) | A method of convolutional neural networks are optimized based on floating-point numerical digit Mutation Genetic Algorithms Based | |
CN113870286A (en) | Foreground segmentation method based on multi-level feature and mask fusion | |
CN116312748A (en) | Enhancer-promoter interaction prediction model construction method based on multi-head attention mechanism | |
CN116469561A (en) | Breast cancer survival prediction method based on deep learning | |
CN110070070B (en) | Action recognition method | |
CN115408351B (en) | Military industry scientific research production data management method and system | |
CN116386733A (en) | Protein function prediction method based on multi-view multi-scale multi-attention mechanism | |
CN112735604B (en) | Novel coronavirus classification method based on deep learning algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |