CN114863997A

CN114863997A - Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion

Info

Publication number: CN114863997A
Application number: CN202210686266.0A
Authority: CN
Inventors: 杨森; 叶晨阳; 朱轮; 封红旗
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-08-05

Abstract

The invention relates to the technical field of anticancer peptide prediction, in particular to an anticancer peptide prediction method based on bidirectional long-short term memory network and feature fusion, which comprises the following steps: reading four reference peptide sequence datasets, and carrying out amino acid composition analysis on the datasets; performing feature extraction on the data set through Bi-LSTM to generate Bi-LSTM feature vectors; extracting the characteristics of the five amino acid characteristic vectors through a fully-connected neural network; and performing feature fusion on the feature vector through a Concatenate algorithm, obtaining probability scores through a full-connection layer with a 1 unit and a Sigmoid activation function, and distinguishing the anticancer peptides and the non-anticancer peptides through the scores. The method realizes the prediction of the anti-cancer peptide with high accuracy, high Mazis correlation coefficient, high sensitivity, high specificity and high area under ROC curve.

Description

Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion

Technical Field

The invention relates to the technical field of anti-cancer peptide prediction, in particular to an anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion.

Background

The discovery of the anticancer peptide (ACP) widens the vision field of people for resisting cancers, the specificity and the tumor of the ACP cannot generate drug resistance to the ACP, solves the side effects brought by some traditional anticancer treatments, and is hopeful to become a substitute treatment method for the cancers; anticancer peptides typically consist of 5-40 amino acids; to further understand the mechanism of action of anticancer peptides, many biological experiments have been performed to identify anticancer peptides. For example, Vidal et al identified a peptide cocktail against intracellular tumor proteins by the yeast two-hybrid system, and Peelle et al discovered novel localization peptides that were not cell type specific by mammalian cell screening; however, these identification methods are time-consuming, expensive, complex and difficult to implement in a high-throughput manner, and thus rapid and efficient identification of anticancer peptides is important.

Wu et al propose a PTPD model, which uses feature vectors extracted from k-mer and Word2vec (Word vector) to input into a Convolutional Neural Network (CNN) to predict peptides; rao et al, then, have applied the convolutional network (GCN) to the prediction of anticancer peptides, proposing the ACP-GCN model; however, these deep learning methods only consider the original sequence information and physicochemical properties of amino acids, ignore the long-term related information of anticancer peptides at the time level, and cannot identify the anticancer peptides rapidly and efficiently at low cost.

Disclosure of Invention

Aiming at the defects of the existing algorithm, the method realizes the prediction of the anti-cancer peptide with high accuracy, high Mazis correlation coefficient, high sensitivity, high specificity and high area under ROC curve.

The technical scheme adopted by the invention is as follows: the method for predicting the anti-cancer peptide based on the bidirectional long-short term memory network and the feature fusion comprises the following steps:

step 1, reading four reference peptide sequence datasets, and carrying out amino acid composition analysis on the datasets;

step 2, performing feature extraction on the data set through a bidirectional long-short term memory network (Bi-LSTM) to generate Bi-LSTM feature vectors;

further, step 2 comprises:

step 2.1, in order to input the peptide sequence into Bi-LSTM, firstly, the primary letter sequence of the peptide is digitally coded according to an amino acid alphabet, namely, the 20 basic amino acids are distributed with numbers of 1-20, and the peptide sequence with insufficient length is filled with 0 to ensure that all the peptide sequences have consistent length;

2.2, converting the input digital code into 64-dimensional vector representation through an Embedding layer (Embedding) of the Bi-LSTM;

step 2.3, performing feature extraction on the input 64-dimensional vector by the Bi-LSTM, wherein the Bi-LSTM specifically comprises the following components: input x at time t _t Cell state C _t Temporary cell state

Hidden layer state h _t Forgetting door f _t Memory door i _t Output gate O _t ；

The Bi-LSTM comprises a forward long-term memory network layer and a backward long-term memory network layer, wherein each layer comprises a memory unit and a 64-dimensional hidden unit;

forget gate (select information to forget):

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f ) (1)

memory gate (selecting information to remember):

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i ) (2)

cell status at the present time:

output gate and current time hidden state:

o _t ＝σ(W _o [h _t-1 ,x _t ]+b _o ) (5)

h _t ＝O _t ·tanh(C _t ) (6)

wherein, W and b represent Bi-LSTM network learning weight and bias respectively;

step 3, extracting the characteristics of the five amino acid characteristic vectors through a full-connection neural network;

step 3.1, according to five amino acid characteristics: binary (BPF), dipeptide composition (DPC), k-spacer amino acid group pair Composition (CKSAAGP), Amino Acid Composition (AAC) and sequence coupling number (SOCNumber), feature-coding the primary letter sequence of the peptide, the feature-coding converting the peptide sequence into a feature vector of 770 dimensions;

wherein, the five feature codes comprise: BPF feature coding, DPC feature coding, CKSAAGP feature coding, AAC feature coding and SOCNumber feature coding;

BPF signature coding is expressed as:

in a binary system, each amino acid letter is represented by a 20-dimensional vector of 0/1, for example, the first amino acid letter a is denoted as f (a) (1, 0., 0)), the second amino acid letter C is denoted as f (C) ((0, 1., 0), and so on, for a peptide sequence P, its binary characteristics can be represented as:

B(P)＝[f(p ₁ ),f(p ₂ ),...,f(p _n )] (7)

wherein, P is a peptide sequence, f (P) _n ) Represents a certain amino acid letter;

DPC signature coding is represented as:

the DPC composition consists of 400 descriptors, which are defined as: the number of dipeptide combinations in a given peptide sequence is expressed as:

wherein N is _ab Is the number of dipeptides represented by amino acid types a and b;

the CKSAAGP signature code is expressed as:

in the k-spacer amino acid group pair composition, the frequency of amino acid pairs separated by any k residues is calculated according to different groups of physicochemical property compositions, taking k 0 as an example, there will be 25 zero-spaced groups (g1g1, g1g 2.., g5g5), and the eigenvector will be defined as:

wherein the value of each descriptor represents the composition of the corresponding residue pair in the peptide sequence, and for a peptide sequence of length N, when k is 0,1,2,3 _all ＝n-1,n-2,n-3,n-4...。

AAC signature is represented as:

AAC composition calculates the frequency of each amino acid type in a peptide sequence, and the frequency of 20 amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y) can be expressed as:

wherein N (a) represents the number of times an amino acid appears in the peptide sequence, and N represents the length of the peptide sequence;

the socneumber signature code is expressed as:

wherein d is _i,i+d The distance between two amino acids at positions i and i + d is depicted, nlag represents the maximum value of the lag, and N is the length of the peptide sequence.

And 4, performing feature fusion on the feature vectors generated in the

steps

2 and 3 through a Concatenate algorithm, inputting the feature vectors into a full-link layer with 512 units and relu activation functions, and obtaining a probability score from 0 to 1 through the full-link layer with 1 unit and Sigmoid activation functions, wherein the probability score is regarded as the anti-cancer peptide when the probability score is greater than 0.5, and the probability score is regarded as the non-anti-cancer peptide when the probability score is less than 0.5.

The invention has the beneficial effects that:

1. the rapid and efficient identification of the anti-cancer peptide is realized; the accuracy of the anti-cancer peptide identification is improved through feature fusion.

Drawings

FIG. 1 is a diagram of the construction of an anti-cancer peptide prediction model based on the fusion of bidirectional long-short term memory network and features according to the present invention;

FIG. 2 is an amino acid alphabet of the present invention;

FIG. 3 is an amino acid composition analysis of a data set of the present invention;

FIG. 4 is a comparison of the method of the present invention with a single feature model;

FIG. 5 is a comparison of different model methods for the ACPred-Fuse and ACPred-FL datasets of the present invention;

FIG. 6 is a comparison of different model methods for ACP240 and ACP740 datasets in accordance with the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples, which are simplified schematic drawings and illustrate only the basic structure of the invention in a schematic manner, and therefore only show the structures relevant to the invention.

As shown in figure 1, the method for predicting the anti-cancer peptide based on the bidirectional long-short term memory network and the characteristic fusion comprises the following steps:

step 1, reading four reference peptide sequence datasets, and performing amino acid composition analysis on the datasets, wherein the dataset is shown in a table 1, and the dataset analysis is shown in a figure 3;

TABLE 1 four reference peptide sequence datasets

The amino acid alphabet code carries out digital coding on the primary letter sequence of the peptide, namely, the 20 basic amino acids are distributed with numbers 1-20, the peptide sequence with insufficient length is filled with 0 to ensure that all the peptide sequences have consistent length, and the amino acid alphabet code is shown in figure 2;

further, step 2 comprises:

step 2.1, carrying out digital coding on the primary letter sequence of the peptide according to an amino acid alphabet, namely, allocating numbers 1-20 to 20 basic amino acids, and filling 0 in the peptide sequence with insufficient length to ensure that all the peptide sequences have consistent length;

2.2, converting the input number into 64-dimensional vector representation through an Embedding layer (Embedding) of the Bi-LSTM;

Hidden layer state h _t Forgetting door f _t Memory door i _t Output gate O _t (ii) a The bidirectional long and short term memory network layer consists of a forward long and short term memory network layer and a backward long and short term memory network layer, and each layer consists of a memory unit and a 64-dimensional hidden unit;

amino acid signature codes are based on five amino acid signatures: binary (BPF), dipeptide composition (DPC), k-space amino acid group pair Composition (CKSAAGP), Amino Acid Composition (AAC) and sequence coupling number (SOCNumber), and feature coding is performed on a primary letter sequence of a peptide, the feature coding converts the peptide sequence into a feature vector with 770 dimensions, and five feature codes are specifically as follows:

(1) binary (BPF):

in binary, each amino acid letter is represented by a 20-dimensional vector consisting of 0/1; for example, the first amino acid letter a is denoted as f (a) ═ 1, 0.., 0), the second amino acid letter C is denoted as f (C) ═ 0, 1.., 0, and so on; for a peptide sequence P, its binary characteristics can be expressed as:

B(P)＝[f(p ₁ ),f(p ₂ ),...,f(p _n )] (7)

(2) dipeptide composition (DPC):

the dipeptide composition consists of 400 descriptors, which are defined as: the number of dipeptide combinations in a given peptide sequence can be expressed as:

wherein N is _ab Is the amount of dipeptide represented by amino acid types a and b.

(3) k-spacer amino acid group pair Composition (CKSAAGP):

in the k-spacer amino acid group pair composition, the frequency of amino acid pairs separated by any k residues is calculated according to different groups of physicochemical property compositions. For example, with k equal to 0, there would be 25 zero-spaced pairs of pairs (g1g1, g1g 2.., g5g5), and the feature vector would be defined as:

wherein the value of each descriptor represents the composition of the corresponding residue pair in the peptide sequence. For peptide sequences of length N, when k is 0,1,2,3 _all ＝n-1,n-2,n-3,n-4...。

(4) Amino Acid Composition (AAC):

amino acid composition the frequency of each amino acid type in the peptide sequence was calculated. The frequency of 20 amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y) can be expressed as:

wherein N (a) represents the number of times an amino acid appears in the peptide sequence and N represents the length of the peptide sequence.

(5) Sequence order coupling number (soccolumn):

the sequential order coupling number can be defined as:

wherein d is _i,i+d The distance between two amino acids at positions i and i + d is depicted, nlag represents the maximum value of the lag, N is the length of the peptide sequence;

and 4, performing feature fusion on the feature vectors generated in the

steps

To demonstrate the validity of the fusion features, BLSTM-ACP was compared to a model of a single feature, and the results are shown in fig. 4;

as shown in Table 2, the accuracy, the Marx correlation coefficient, the sensitivity, the specificity and the area under the ROC curve of the BLSTM-ACP are obviously superior to those of other methods, such as an intuitive comparison method shown in figures 5 and 6.

TABLE 2 comparison of BLSTM-ACP with different anti-cancer peptide prediction methods on four reference datasets

Note that: the maximum is indicated in bold; ACC: the accuracy rate; MCC: a mazis correlation coefficient; and SE: sensitivity; SP: specificity; AUC: area under the ROC curve;

in conclusion, the invention can realize the prediction of the anti-cancer peptide with high accuracy, high Mazis correlation coefficient, high sensitivity, high specificity and high area under the ROC curve.

In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.

Claims

1. The method for predicting the anti-cancer peptide based on the bidirectional long-short term memory network and the feature fusion is characterized by comprising the following steps of:

step 2, performing feature extraction on the data set through the Bi-LSTM to generate Bi-LSTM feature vectors;

and 4, performing feature fusion on the feature vectors generated in the steps 2 and 3 through a Concatenate algorithm, inputting the feature vectors into a full-link layer with 512 units and relu activation functions, obtaining probability scores through the full-link layer with 1 unit and a Sigmoid activation function, and distinguishing anticancer peptides and non-anticancer peptides through the scores.

2. The method for predicting the anticancer peptide based on the fusion of the bidirectional long-short term memory network and the characteristics as claimed in claim 1, wherein the step 2 comprises:

step 2.1, digitally encoding the primary letter sequence of the peptide according to an amino acid alphabet;

2.2, converting the input digital code into a 64-dimensional vector through an embedded layer of the Bi-LSTM;

step 2.3, performing feature extraction on the input 64-dimensional vector by the Bi-LSTM, wherein the Bi-LSTM comprises the following steps: input x at time t _t Cell state C _t Temporary cell state

Hidden state h _t Forgetting door f _t Memory door i _t Output gate O _t (ii) a The Bi-LSTM is composed of forward and backward long-short term memory network layers, each layer is composed of a memory unit and a 64-dimensional hidden unit.

3. The method for predicting the bidirectional long-short term memory network and feature fusion based anticancer peptide as set forth in claim 1, wherein the step 3 comprises: the primary letter sequence of the peptide is feature-coded according to five amino acid features, including BPF, DPC, CKSAAGP, AAC, and socumber, which convert the peptide sequence into a 770-dimensional feature vector.

4. The method of claim 3, wherein the five amino acid features feature-coding the primary letter sequence of the peptide comprises:

BPF signature coding is expressed as:

B(P)＝[f(p ₁ ),f(p ₂ ),...,f(p _n )] (7)

wherein P is a peptide sequence, f (P) _n ) Represents a certain amino acid letter;

DPC signature coding is represented as:

the CKSAAGP signature code is expressed as:

wherein, N _g1g1 -N _g5g5 25 zero-spaced pairs;

AAC signature is represented as:

the socumber signature code is represented as:

wherein, d _i,i+d The distance between two amino acids at positions i and i + d is depicted, nlag represents the maximum value of the lag, and N is the length of the peptide sequence.