CN115512396B - Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network - Google Patents
Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network Download PDFInfo
- Publication number
- CN115512396B CN115512396B CN202211352672.XA CN202211352672A CN115512396B CN 115512396 B CN115512396 B CN 115512396B CN 202211352672 A CN202211352672 A CN 202211352672A CN 115512396 B CN115512396 B CN 115512396B
- Authority
- CN
- China
- Prior art keywords
- training
- information
- peptide sequence
- feature extraction
- physicochemical property
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 108090000765 processed proteins & peptides Proteins 0.000 title claims abstract description 165
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 26
- 230000001093 anti-cancer Effects 0.000 title claims abstract description 23
- 239000003910 polypeptide antibiotic agent Substances 0.000 title claims description 12
- 238000000605 extraction Methods 0.000 claims abstract description 69
- 239000000284 extract Substances 0.000 claims abstract description 23
- 102000004196 processed proteins & peptides Human genes 0.000 claims abstract description 15
- 230000004927 fusion Effects 0.000 claims abstract description 10
- 230000000844 anti-bacterial effect Effects 0.000 claims abstract description 7
- 238000012549 training Methods 0.000 claims description 118
- 238000012795 verification Methods 0.000 claims description 53
- 150000001413 amino acids Chemical class 0.000 claims description 35
- 239000011159 matrix material Substances 0.000 claims description 26
- 230000008569 process Effects 0.000 claims description 14
- 238000012216 screening Methods 0.000 claims description 8
- 102000044503 Antimicrobial Peptides Human genes 0.000 claims description 6
- 108700042778 Antimicrobial Peptides Proteins 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 4
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 239000013598 vector Substances 0.000 description 10
- 230000008859 change Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000035945 sensitivity Effects 0.000 description 4
- 239000002253 acid Substances 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 108091026890 Coding region Proteins 0.000 description 2
- 206010059866 Drug resistance Diseases 0.000 description 2
- 206010028980 Neoplasm Diseases 0.000 description 2
- 201000011510 cancer Diseases 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 125000002924 primary amino group Chemical group [H]N([H])* 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- -1 Amino Chemical group 0.000 description 1
- 125000003275 alpha amino acid group Chemical group 0.000 description 1
- 125000000539 amino acid group Chemical group 0.000 description 1
- 229940124350 antibacterial drug Drugs 0.000 description 1
- 239000002246 antineoplastic agent Substances 0.000 description 1
- 229940041181 antineoplastic drug Drugs 0.000 description 1
- 244000052616 bacterial pathogen Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000975 bioactive effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 230000000857 drug effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 229920001184 polypeptide Polymers 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/12—Fingerprints or palmprints
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Human Computer Interaction (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a method and a system for predicting anti-cancer peptides and antibacterial peptides based on a deep neural network, which belong to the technical field of peptide identification and comprise the following steps: obtaining a peptide sequence; extracting fingerprint information, evolution information and physicochemical property information of the peptide sequence; the method comprises the steps of obtaining a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to obtain fusion information, the fusion information is identified, and the peptide identification result is obtained. The accuracy of the peptide identification result is improved.
Description
Technical Field
The invention relates to the technical field of peptide prediction, in particular to a method and a system for predicting anti-cancer peptides and antibacterial peptides based on a deep neural network.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The antibacterial peptide and the anticancer peptide are bioactive peptides consisting of a plurality of amino acids, the antibacterial peptide can solve the problem of drug effect reduction caused by drug resistance generated by bacterial pathogens, and the anticancer peptide effectively controls the drug resistance of cancer cells to anticancer drugs, thereby improving the curative effect of the drugs. Therefore, accurate prediction of anticancer and antibacterial peptides plays an important role in the treatment of cancer and the design of antibacterial drugs.
The existing anti-cancer peptide and antibacterial peptide prediction technology mainly comprises two parts of feature extraction and model prediction, and most of the two parts adopt a simple combination of an existing sequence feature extraction method and a deep learning network to train a model. Although these methods also have some predictive performance, there are three limitations:
1. in the aspect of feature extraction, no feature capable of representing the global information of a peptide chain exists, and meanwhile, specific physicochemical property features are often directly used in the aspect of physicochemical property feature extraction, so that redundancy and low quality of sequence feature information are caused.
2. Machine learning or deep learning methods specially corresponding to different features are not designed for processing. Many models use the same or similar neural network to process multiple features, resulting in unreasonable utilization of sequence feature information.
3. The traditional neural network model training mode is to randomly divide a training set and a verification set to train to obtain a final model, and the training and the reasonable division of the verification set cannot be performed by fully utilizing the preference of the neural network of the existing data set. Due to the randomness of data division, the deviation of different division modes under the same test set is large, and therefore the finally obtained network model prediction effect is extremely unstable.
Due to the reasons, the existing identification methods for the anti-cancer peptides and the antibacterial peptides have low identification accuracy and poor generalization capability.
Disclosure of Invention
In order to solve the problems, the invention provides a method and a system for predicting an anti-cancer peptide and an anti-bacterial peptide based on a deep neural network, and the accuracy of peptide sequence identification is improved.
In order to realize the purpose, the invention adopts the following technical scheme:
in a first aspect, a method for predicting an anticancer peptide and an antimicrobial peptide based on a deep neural network is provided, which includes:
obtaining a peptide sequence;
determining the physicochemical properties of each amino acid in the peptide sequence;
extracting evolution information of the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid;
the method comprises the steps of obtaining a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to obtain fusion information, the fusion information is identified, and the peptide identification result is obtained.
In a second aspect, a deep neural network based anticancer and antibacterial peptide prediction system is provided, comprising:
a data acquisition module for acquiring a peptide sequence;
the information extraction module is used for extracting evolution information of the peptide sequence, determining the physicochemical property of each amino acid in the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid;
and the identification module is used for acquiring a peptide identification result through the fingerprint information, the evolution information, the physicochemical property information and the trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to acquire fused information, and the fused information is identified to acquire the peptide identification result.
In a third aspect, an electronic device is provided, which includes a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method for predicting an anticancer peptide and an antimicrobial peptide based on a deep neural network.
In a fourth aspect, a computer readable storage medium is provided for storing computer instructions, which when executed by a processor, perform the steps of a deep neural network-based method for predicting anticancer and antibacterial peptides.
Compared with the prior art, the invention has the following beneficial effects:
1. the method obtains comprehensive characteristic information of the peptide sequence by extracting the evolution information characteristic, the fingerprint information characteristic and the physicochemical property information characteristic of the peptide sequence, fuses the three characteristics, obtains a peptide identification result by identifying the fused information, and improves the accuracy of the peptide identification result.
2. The invention extracts the characteristics of the evolution information, the fingerprint information and the physicochemical property information through different characteristic extraction networks, can extract effective characteristics from each kind of information, realizes the reasonable utilization of the information, and thus effectively ensures the accuracy of the peptide identification result.
3. When the peptide sequence recognition model is trained, firstly, the training set and the verification set are randomly divided, then the training set and the verification set are reasonably divided according to the preference of the peptide sequence recognition model, and the peptide sequence recognition model is trained through the training set and the verification set, so that the recognition accuracy of the trained peptide sequence recognition model can be improved, and the accuracy of a peptide recognition result is further ensured.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application, and the description of the exemplary embodiments and illustrations of the application are intended to explain the application and are not intended to limit the application.
FIG. 1 is a block flow diagram of the method disclosed in example 1;
fig. 2 is a block diagram of a first feature extraction network disclosed in embodiment 1;
fig. 3 is a block diagram of a second feature extraction network disclosed in embodiment 1;
fig. 4 is a block diagram of a third feature extraction network disclosed in embodiment 1.
Detailed Description
The invention is further described with reference to the following figures and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example 1
In order to improve the accuracy of the peptide identification result, in this embodiment, a method for predicting anticancer peptides and antibacterial peptides based on a deep neural network is disclosed, as shown in fig. 1, including:
s1: obtaining a peptide sequence;
s2: determining the physicochemical properties of each amino acid in the peptide sequence; extracting evolution information of the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid;
s3: the method comprises the steps of obtaining a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to obtain fusion information, the fusion information is identified, and the peptide identification result is obtained.
In particular implementations, 158 physicochemical properties of each amino acid in the peptide sequence were determined, the 158 physicochemical properties being determined by a physicochemical properties database (AAindex).
Evolutionary information of peptide sequences is represented by constructing a PSSM matrix of peptide sequences, including in particular: obtaining a position-specific scoring matrix (PSSM matrix) of the peptide sequence by using PSI-BLAST, wherein the evolution information of the obtained peptide sequence is a PSSM characteristic matrix of L multiplied by 20 in the case that the peptide sequence comprises 20 amino acids, wherein L is the length of the peptide sequence, and the matrix is the firstElement and positionBy amino acidThe probability of substitution is proportional.
According to the physicochemical properties of amino acids, the process of extracting the fingerprint information of the peptide sequence comprises the following steps:
s21: CGR curves of peptides were constructed according to the physicochemical properties of amino acids. The method comprises the following specific steps:
sequencing all amino acids in a peptide sequence according to the numerical values of the physicochemical properties, and uniformly mapping all the amino acids on a unit circle to construct a CGR curve of the peptide, wherein the coordinates of the unit circle are as follows:
wherein,Nindicates the length of the peptide sequence.
For each peptide sequence, 158 kinds of CGR curves were obtained using 158 kinds of physicochemical properties selected in the physicochemical property database (AAindex).
S22: dividing the CGR curve into a plurality of sub-blocks, and determining points on the boundary of adjacent sub-blocks; rotating the partitioned CGR curve to obtain corresponding points of the rotated sub-blocks and points on the boundaries of the adjacent sub-blocks after rotation; and calculating the Euclidean distance of the points on the two adjacent boundaries and the Euclidean distance of the corresponding points after the two adjacent circles are rotated to form a distance matrix.
Preferably, the CGR curve is divided into four sub-blocks according to four quadrants of a coordinate system, the coordinate axes are rotated by 45 ° by using position related information of points on the boundary of adjacent sub-blocks to obtain the rotated four sub-blocks, and for each CGR curve, euclidean distances between all point pairs in the eight sub-blocks are calculated to obtain eight distance matrices。
S23: and extracting main characteristic values of each distance matrix, and forming peptide sequence fingerprint information through the main characteristic values.
Calculating the main eigenvalue of each distance matrix to obtain a characteristic matrixComprises the following steps:
According to the physicochemical property of amino acid, the process of extracting the physicochemical property information of the peptide sequence comprises the following steps:
clustering all physicochemical properties in the physicochemical property database, and extracting the most representative property in each cluster as the representative physicochemical property of the amino acid;
and extracting the representative physicochemical property of the amino acid from the physicochemical property of each amino acid of the peptide sequence to obtain the physicochemical property information of the peptide sequence.
Compared with the conventional method which usually selects specific physicochemical properties directly, in this embodiment, in order to avoid redundancy and obtain more comprehensive physicochemical property distribution, 556 physicochemical properties in the AAindex are first divided into 8 clusters;
eight most representative physicochemical properties in each cluster are extracted as representative physicochemical properties, and each amino acid is coded into an eight-dimensional vector by utilizing PCPE (physical and chemical property embedding).
Constructing an L multiplied by 8 dimensional characteristic matrix of each amino acid in the peptide sequence according to the physicochemical property of each amino acid in the peptide sequenceAnd obtaining the physicochemical property information of the peptide sequence.
Inputting fingerprint information, evolution information and physicochemical property information into the trained peptide sequenceBefore identification of the model, evolution information of the peptide sequence was analyzedAnd physicochemical property informationAre unified to accommodate subsequent peptide sequence recognition models.
Specifically, the length of the information may be uniformly set to 50, and if the length is insufficient, the information is padded with 0.
The peptide sequence recognition model is described in detail.
In order to extract effective features from fingerprint information, evolution information and physicochemical property information, a first feature extraction network, a second feature extraction network and a third feature extraction network are arranged in a peptide sequence identification model, fingerprint features are extracted from the fingerprint information through the first feature extraction network, evolution features are extracted from the evolution information through the second feature extraction network, and physicochemical property features are extracted from the physicochemical property information through the third feature extraction network.
Wherein, the first feature extraction network adopts a multi-channel convolution neural network, and adds a channel attention mechanism in the multi-channel convolution neural network, as shown in fig. 2, fingerprint information is extractedAnd inputting the first feature extraction network to capture important features through local connectivity and weight sharing.
Due to the feature matrixEach row of (a) represents 158 physicochemical properties and each column represents 8-dimensional features extracted from one CGR curve, the model learns the shared weights of the 158 physicochemical properties using a more appropriate convolution kernel of size 1 × 8 instead of a general square convolution kernel. In the present invention, the number of filters is set to 16.
Global information is obtained by comprehensively considering all 158 attributes using a Channel Attention Module (CAM).
The method comprises the following specific steps:
(1) Obtaining a three-dimensional feature map by a convolution layerM’ DCGR = f conv (MDCGR)
(2) To pairGlobal averaging and maximum pooling is performed, and then a multi-layer perceptron (MLP) with shared weights consisting of two fully-connected layers is used, resulting inEach channel ofAnd given channel weight to the classification importance ofCAM i 。
The overall process of the CAM is represented as:
wherein,a sigmoid function is represented as a function,,respectively by averaging and maximum pooling calculations,andrepresenting the weight matrix of the shared MLP.
Assigning channel weights to three-dimensional feature maps by element-wise multiplicationOn the corresponding channel, obtaining a characteristic diagram。
Will be provided withFlattening and passing through a full connection layer to generate a final 400-dimensional DCGR feature vector。
The second feature extraction network adopts a bidirectional long-short memory network (Bi-LSTM), and evolves information as shown in FIG. 3Input into Bi-LSTM.
The calculation of forward LSTM is summarized as follows:
wherein,,,are respectively weight matrixes;respectively representing offset vectors;is a forgetting gate;is an input gate;is an output gate;is the current input;is a previous cell state;is the current cell state;is a new value added to the cell state;andrespectively in a previous and a current hidden state; Ä denotes element-by-element multiplication.
The working principle of the backward LSTM is the same as that of the forward LSTM, and the current hidden state is calculated as。
Resulting in a representation of the relevant PSSM eigenvector as 256 dimensionsIn whichThe last time step.
The third feature extraction network adopts a multi-head self-attention network (Transformer network), and a feature matrix obtained by PCPESince the sequence of residues plays a crucial role in peptide sequence, the present example uses sine and cosine position coding to reflect the distribution of physicochemical properties in peptide sequence, and the specific method is as follows:
wherein,indicates the amino acid position of the peptide sequence,andrepresenting even and odd element positions of the embedding vector,whereinIs the dimension of the embedding vector.
Obtaining a new feature matrix incorporating position-coded informationWhereinis shown asFeature vectors for individual residuals, L = 50, represent the peptide chain length.
the dependency relationship between amino acid residues at any distance is extracted by utilizing a single-head self-attention mechanism, the physicochemical property distribution information of the peptide is effectively obtained, and the calculation process is as follows:
wherein,representing an attention score matrix;、andrespectively representing three vectors of query, key and value;is the dimension of the same or a different dimension,, andis the relevant weight matrix.
Average pooling is adopted for the feature matrix passing through the coding region to obtain 50-dimensional PCPE feature vector of the peptide chain。
Fusing the fingerprint characteristics, the evolutionary characteristics and the physicochemical property characteristics to obtain fusion information, identifying the fusion information to obtain a peptide identification result, specifically:
by batch normalization, 400-dimensional, 256-dimensional, and 50-dimensional feature vectors are obtained as the output of each branch.
And splicing the characteristic vectors output by the branches, and performing back propagation through a Dropout layer and a full connection layer to obtain a final prediction result of the anticancer peptide or the antibacterial peptide.
The process of obtaining the trained peptide sequence recognition model is as follows:
acquiring a training set and a verification set for each training, training the constructed peptide sequence recognition model through the training set for each training, and verifying the training effect of the peptide sequence recognition model through the verification set for each training;
selecting samples with model prediction error times exceeding the set error times in the training process of the last set round number from the verification set for the training to form a verification set after screening; selecting samples which are accurately classified in the training process of the last set number of rounds from the training set for the training to form a training set after screening; selecting samples from the screened verification set to form a verification set to-be-exchanged sample set, selecting samples from the screened training set to form a training set to-be-exchanged sample set, exchanging the verification set to-be-exchanged sample set in the verification set for the training with the training set to-be-exchanged sample set in the training set for the training to form a new verification set and a new training set, and using the new verification set and the new training set as the verification set and the training set for the next training.
The process of obtaining the training set and the verification set for the first training is as follows:
acquiring a peptide sequence for training, and labeling the peptide sequence for training;
the evolution information, fingerprint information and physicochemical property information of the training peptide sequence are extracted from the training peptide sequence to form a training data set, and the training data set is randomly divided into a training set and a verification set, namely the training set and the verification set for the first training.
And training the constructed peptide sequence recognition model through a training set and a verification set for the last training, and obtaining the trained peptide sequence recognition model after the training is finished.
In specific implementation, S31: acquiring a peptide sequence for training, and labeling the peptide sequence for training;
s32, randomly dividing a training setAnd verification setA training set and a validation set for the first training are formed.
Wherein,andrespectively representing the features of the training set and the validation set,andindicating a sample label.
S33: training the constructed peptide sequence recognition model through a training set, and verifying the training effect of the peptide sequence recognition model through a verification set;
s34: in the verification setSamples with error times exceeding 5 times in 10 rounds of training processes after mid-search are generated to be a verification set after screening. Meanwhile, samples which are accurately classified in the last 10 rounds of training in the training set are found, and a training set after screening is generatedRandomly select [ k/2 ]]Is fromSample set ofWhile randomly selecting [ k/2 ]]Is derived fromSample set ofWill train the setTSample set in (1)T change And verification setVSample set in (1)V change Exchanging to construct a new training setAnd verification setI.e. byT new =T-T change +V change ,V new =V-V change +T change 。
S36: reinitializing the peptide sequence recognition model through the final training setAnd verification setAnd training the peptide sequence recognition model to obtain the trained peptide sequence recognition model.
The prediction method provided by the embodiment can effectively fuse sequence information in multiple aspects, and on the basis, a three-branch neural network model framework (TriNet) is designed according to the characteristics of the three characteristics, so that each characteristic is properly processed and effectively fused for final prediction.
On an ACP740 data set, five-fold cross validation is utilized, and compared with ACP-DL, MHCNN, iACP-DRLF, CL-ACP and DeepACPpred, the improvement percentages of accuracy, sensitivity, specificity, accuracy, F1 fraction and Markov correlation coefficient are respectively 3.2% -8.6%, 1.9% -6.9%, 3.2% -21.5%, 3.0% -12.0% and 3.2% -6.9%. On ACPmain data set, five-fold cross validation is utilized to compare TriNet with ACP-DL, MHCNN, iACP-DRLF, antiCP 2.0-AAC and AntiCP 2.0-DPC, and the improvement percentages of accuracy, sensitivity, specificity, accuracy, F1 fraction and Mareus correlation coefficient are respectively 9.3% -23.7%, 15.3% -47.3%, 3.6% -13.6%, 5.6% -14.3%, 10.4% -28.2% and 25.4% -73.1%. On AMP2828 data set, using its independent test set, comparing the TriNet of the invention with ACP-DL, MHCNN, iACP-DRLF, antiCP 2.0-AAC and AntiCP 2.0-DPC, the accuracy, sensitivity, specificity, accuracy, F1 fraction and the promotion percentage of the Markov correlation coefficient are respectively 6.1% -29.5%, 0.7% -11.0%, 1.8% -79.4%, 2.3% -44.6%, 6.7% -22.8% and 13.2% -67.4%.
In addition to the commonly used evolutionary features, two new features, sequence fingerprint information and physicochemical property information, were introduced in this example, to properly characterize the global sequence information and physicochemical property distribution of polypeptides. Test results show that the proposed features have wide adaptability and effectiveness. The concrete expression is as follows: and flattening and splicing the characteristic matrix of each branch, and predicting by using a traditional machine learning method XGboost to replace a neural network structure. On an ACP740 data set, the accuracy and the promotion percentage of the Ma Xiusi correlation coefficient are respectively 0.47% -5.8% and 0.28% -15.3%; on the ACPmain data set, the accuracy and the promotion percentage of the Ma Xiusi correlation coefficient are respectively 4.0% -17.7% and 11.3% -53.8%; on the AMP2828 dataset, the percentage improvement of the accuracy and the Ma Xiusi correlation coefficient were 4.3% -27.2% and 9.1% -61.2%, respectively.
In addition, the embodiment also provides a novel TVI (transient overvoltage) training method which can generate a more proper training and verification set according to the deviation of a network model, and specifically comprises the steps that compared with the traditional training method, on an ACP740 data set, the average promotion percentages of the accuracy, the sensitivity, the specificity, the accuracy, the F1 fraction and the Marx correlation coefficient are respectively 2.0%, 2.1%, 1.9%, 1.9%, 2.0% and 4.6%, and the maximum promotion ratios are 3.9%, 6.6%, 8.2%, 6.6%, 4.4% and 9.1%; on the ACPpain data set, the average lifting percentage is respectively 1.7%, 2.3%, 1.0%, 1.1%, 1.7% and 4.5%, and the maximum lifting percentage is respectively 4.8%, 6.0%, 6.3%, 5.6%, 4.8% and 12.8%; on the AMP2828 dataset, the average percent increases were 0.5%, 0.3%, 0.7%, 0.7%, 0.5%, 1.0%, and the maximum percent increases were 1.1%, 1.1%, 2.9%, 2.9%, 1.1%,2.2%, respectively.
Example 2
In this embodiment, disclosed is a deep neural network-based anticancer and antibacterial peptide prediction system, comprising:
a data acquisition module for acquiring a peptide sequence;
the information extraction module is used for extracting evolution information of the peptide sequence, determining the physicochemical property of each amino acid in the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid;
the identification module is used for obtaining a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to obtain fused information, and the fused information is identified to obtain the peptide identification result.
Example 3
In this embodiment, an electronic device is disclosed, comprising a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method for predicting anti-cancer and anti-microbial peptides based on deep neural networks disclosed in embodiment 1.
Example 4
In this embodiment, a computer readable storage medium is disclosed for storing computer instructions which, when executed by a processor, perform the steps of a method for predicting anti-cancer and anti-microbial peptides based on a deep neural network disclosed in embodiment 1.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.
Claims (5)
1. A method for predicting anti-cancer peptides and antibacterial peptides based on a deep neural network is characterized by comprising the following steps:
obtaining a peptide sequence;
determining the physicochemical properties of each amino acid in the peptide sequence;
extracting evolution information of the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid, wherein the evolution information of the peptide sequence is represented by constructing a PSSM matrix of the peptide sequence; constructing a CGR curve of the peptide according to the physicochemical properties of the amino acid; dividing the CGR curve into a plurality of sub-blocks, and determining points on the boundary of adjacent sub-blocks; rotating the partitioned CGR curve to obtain corresponding points of the rotated sub-blocks and points on the boundaries of the adjacent sub-blocks after rotation; calculating the Euclidean distance of points on two adjacent boundaries and the Euclidean distance of corresponding points after two adjacent rotated points to form a distance matrix; extracting main characteristic values of each distance matrix to form peptide sequence fingerprint information; clustering all physicochemical properties in the physicochemical property database, and extracting the most representative property in each cluster as the representative physicochemical property of the amino acid; extracting representative physicochemical properties of amino acids from the physicochemical properties of each amino acid of the peptide sequence to obtain physicochemical property information of the peptide sequence;
obtaining a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to obtain fusion information, and the fusion information is identified to obtain a peptide identification result;
the first feature extraction network adopts a multi-channel convolution neural network, and a channel attention mechanism is added in the multi-channel convolution neural network; the second feature extraction network adopts a bidirectional long and short memory network; the third feature extraction network adopts a multi-head self-attention network;
acquiring a training set and a verification set for each training, training the constructed peptide sequence recognition model through the training set for each training, and verifying the training effect of the peptide sequence recognition model through the verification set for each training;
selecting samples with model prediction error times exceeding the set error times in the training process of the last set round number from the verification set for the training to form a verification set after screening; selecting samples which are accurately classified in the training process of the last set number of rounds from the training set for the training to form a training set after screening; selecting samples from the screened verification set to form a verification set to-be-exchanged sample set, selecting samples from the screened training set to form a training set to-be-exchanged sample set, exchanging the verification set to-be-exchanged sample set in the verification set for the training with the training set to-be-exchanged sample set in the training set for the training to form a new verification set and a new training set, and using the new verification set and the new training set as the verification set and the training set for the next training;
and training the constructed peptide sequence recognition model through a training set and a verification set for the last training, and obtaining the trained peptide sequence recognition model after the training is finished.
2. The method of claim 1, wherein the evolution information and the physicochemical property information of the peptide sequence are unified before inputting the fingerprint information, the evolution information and the physicochemical property information into the trained peptide sequence recognition model.
3. An anticancer peptide and antibacterial peptide prediction system based on a deep neural network, which is characterized by comprising:
a data acquisition module for acquiring a peptide sequence;
the information extraction module is used for extracting evolution information of the peptide sequence, determining the physicochemical property of each amino acid in the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid, wherein the evolution information of the peptide sequence is represented by constructing a PSSM matrix of the peptide sequence; constructing a CGR curve of the peptide according to the physicochemical properties of the amino acid; dividing the CGR curve into a plurality of sub-blocks, and determining points on the boundary of adjacent sub-blocks; rotating the partitioned CGR curve to obtain corresponding points of the rotated sub-blocks and points on the boundaries of the adjacent sub-blocks after rotation; calculating Euclidean distances of points on two adjacent boundaries and Euclidean distances of corresponding points after two adjacent rotated points to form a distance matrix; extracting main characteristic values of each distance matrix to form peptide sequence fingerprint information; clustering all physicochemical properties in the physicochemical property database, and extracting the most representative property in each cluster as the representative physicochemical property of the amino acid; extracting representative physicochemical properties of amino acids from the physicochemical properties of each amino acid of the peptide sequence to obtain physicochemical property information of the peptide sequence;
the identification module is used for acquiring a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to acquire fused information, and the fused information is identified to acquire the peptide identification result;
the first feature extraction network adopts a multi-channel convolution neural network, and a channel attention mechanism is added in the multi-channel convolution neural network; the second feature extraction network adopts a bidirectional long and short memory network; the third feature extraction network adopts a multi-head self-attention network;
acquiring a training set and a verification set for each training, training the constructed peptide sequence recognition model through the training set for each training, and verifying the training effect of the peptide sequence recognition model through the verification set for each training;
selecting samples with model prediction error times exceeding the set error times in the training process of the last set round number from the verification set for the training to form a verification set after screening; selecting samples which are accurately classified in the training process of the last set number of rounds from the training set for the training to form a training set after screening; selecting samples from the screened verification set to form a verification set to-be-exchanged sample set, selecting samples from the screened training set to form a training set to-be-exchanged sample set, exchanging the verification set to-be-exchanged sample set in the verification set for the training with the training set to-be-exchanged sample set in the training set for the training to form a new verification set and a new training set, and using the new verification set and the new training set as the verification set and the training set for the next training;
and training the constructed peptide sequence recognition model through a training set and a verification set for the last training, and obtaining the trained peptide sequence recognition model after the training is finished.
4. An electronic device comprising a memory and a processor and computer instructions stored in the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method for predicting anti-cancer and anti-microbial peptides based on deep neural networks of any one of claims 1-2.
5. A computer readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the method for predicting anti-cancer and anti-microbial peptides based on a deep neural network according to any one of claims 1 to 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211352672.XA CN115512396B (en) | 2022-11-01 | 2022-11-01 | Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211352672.XA CN115512396B (en) | 2022-11-01 | 2022-11-01 | Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115512396A CN115512396A (en) | 2022-12-23 |
CN115512396B true CN115512396B (en) | 2023-04-07 |
Family
ID=84511877
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211352672.XA Active CN115512396B (en) | 2022-11-01 | 2022-11-01 | Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115512396B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116130004B (en) * | 2023-01-06 | 2024-05-24 | 成都侣康科技有限公司 | Identification processing method and system for antibacterial peptide |
CN116206690B (en) * | 2023-05-04 | 2023-08-08 | 山东大学齐鲁医院 | Antibacterial peptide generation and identification method and system |
CN118314949A (en) * | 2024-04-26 | 2024-07-09 | 海南大学 | Anticancer peptide prediction method and related device |
CN118486376A (en) * | 2024-07-15 | 2024-08-13 | 山东大学 | Antibacterial peptide and anti-inflammatory peptide identification method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022016125A1 (en) * | 2020-07-17 | 2022-01-20 | Genentech, Inc. | Attention-based neural network to predict peptide binding, presentation, and immunogenicity |
CN114863997A (en) * | 2022-06-17 | 2022-08-05 | 常州大学 | Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104484580B (en) * | 2014-11-28 | 2017-08-25 | 深圳先进技术研究院 | Antibacterial peptide Activity Prediction method based on Multi-label learning |
WO2019178056A1 (en) * | 2018-03-12 | 2019-09-19 | Massachusetts Institute Of Technology | Computational platform for in silico combinatorial sequence space exploration and artificial evolution of peptides |
CN112614538A (en) * | 2020-12-17 | 2021-04-06 | 厦门大学 | Antibacterial peptide prediction method and device based on protein pre-training characterization learning |
CN113593632B (en) * | 2021-08-09 | 2023-09-05 | 山东大学 | Polypeptide anticancer function recognition method, system, medium and equipment |
CN114743591A (en) * | 2022-03-14 | 2022-07-12 | 中国科学院深圳理工大学(筹) | Recognition method and device for MHC (major histocompatibility complex) bindable peptide chain and terminal equipment |
-
2022
- 2022-11-01 CN CN202211352672.XA patent/CN115512396B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022016125A1 (en) * | 2020-07-17 | 2022-01-20 | Genentech, Inc. | Attention-based neural network to predict peptide binding, presentation, and immunogenicity |
CN114863997A (en) * | 2022-06-17 | 2022-08-05 | 常州大学 | Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion |
Also Published As
Publication number | Publication date |
---|---|
CN115512396A (en) | 2022-12-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115512396B (en) | Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network | |
CN113192559B (en) | Protein-protein interaction site prediction method based on deep graph convolution network | |
CN112990010B (en) | Point cloud data processing method and device, computer equipment and storage medium | |
CN105718960B (en) | Based on convolutional neural networks and the matched image order models of spatial pyramid | |
WO2017196963A1 (en) | Computational method for classifying and predicting protein side chain conformations | |
CN103714148B (en) | SAR image search method based on sparse coding classification | |
Wei et al. | Projected residual vector quantization for ANN search | |
CN113257357B (en) | Protein residue contact map prediction method | |
CN114926742B (en) | Loop detection and optimization method based on second-order attention mechanism | |
CN117766021A (en) | Deep learning algorithm for predicting protein-polypeptide binding site | |
CN115830379A (en) | Zero-sample building image classification method based on double-attention machine system | |
CN114358169A (en) | Colorectal cancer detection system based on XGboost | |
Villegas-Morcillo et al. | Protein fold recognition from sequences using convolutional and recurrent neural networks | |
CN115424691A (en) | Case matching method, system, device and medium | |
Li et al. | Msvit: training multiscale vision transformers for image retrieval | |
CN114093419A (en) | RBP binding site prediction method based on multitask deep learning | |
CN116206690B (en) | Antibacterial peptide generation and identification method and system | |
CN104484580B (en) | Antibacterial peptide Activity Prediction method based on Multi-label learning | |
CN114861940B (en) | Bayesian optimization integrated learning method for sORFs in predicted plant lncRNA | |
CN116824138A (en) | Interactive image segmentation method and device based on click point influence enhancement | |
Sahoo et al. | Learning representation for mixed data types with a nonlinear deep encoder-decoder framework | |
Wang et al. | Unsupervised Hyperspectral Band Selection via Structure-Conserved and Neighborhood-Grouped Evolutionary Algorithm | |
KR20230170680A (en) | Multi-channel protein voxelization to predict variant pathogenicity using deep convolutional neural networks | |
Vega-Pons et al. | Clustering ensemble method for heterogeneous partitions | |
CN114496068A (en) | Protein secondary structure prediction method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |