CN115512396B - Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network - Google Patents

Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network Download PDF

Info

Publication number
CN115512396B
CN115512396B CN202211352672.XA CN202211352672A CN115512396B CN 115512396 B CN115512396 B CN 115512396B CN 202211352672 A CN202211352672 A CN 202211352672A CN 115512396 B CN115512396 B CN 115512396B
Authority
CN
China
Prior art keywords
training
information
peptide sequence
feature extraction
physicochemical property
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211352672.XA
Other languages
Chinese (zh)
Other versions
CN115512396A (en
Inventor
柳军涛
周婉芸
刘雨菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202211352672.XA priority Critical patent/CN115512396B/en
Publication of CN115512396A publication Critical patent/CN115512396A/en
Application granted granted Critical
Publication of CN115512396B publication Critical patent/CN115512396B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/12Fingerprints or palmprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method and a system for predicting anti-cancer peptides and antibacterial peptides based on a deep neural network, which belong to the technical field of peptide identification and comprise the following steps: obtaining a peptide sequence; extracting fingerprint information, evolution information and physicochemical property information of the peptide sequence; the method comprises the steps of obtaining a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to obtain fusion information, the fusion information is identified, and the peptide identification result is obtained. The accuracy of the peptide identification result is improved.

Description

Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network
Technical Field
The invention relates to the technical field of peptide prediction, in particular to a method and a system for predicting anti-cancer peptides and antibacterial peptides based on a deep neural network.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The antibacterial peptide and the anticancer peptide are bioactive peptides consisting of a plurality of amino acids, the antibacterial peptide can solve the problem of drug effect reduction caused by drug resistance generated by bacterial pathogens, and the anticancer peptide effectively controls the drug resistance of cancer cells to anticancer drugs, thereby improving the curative effect of the drugs. Therefore, accurate prediction of anticancer and antibacterial peptides plays an important role in the treatment of cancer and the design of antibacterial drugs.
The existing anti-cancer peptide and antibacterial peptide prediction technology mainly comprises two parts of feature extraction and model prediction, and most of the two parts adopt a simple combination of an existing sequence feature extraction method and a deep learning network to train a model. Although these methods also have some predictive performance, there are three limitations:
1. in the aspect of feature extraction, no feature capable of representing the global information of a peptide chain exists, and meanwhile, specific physicochemical property features are often directly used in the aspect of physicochemical property feature extraction, so that redundancy and low quality of sequence feature information are caused.
2. Machine learning or deep learning methods specially corresponding to different features are not designed for processing. Many models use the same or similar neural network to process multiple features, resulting in unreasonable utilization of sequence feature information.
3. The traditional neural network model training mode is to randomly divide a training set and a verification set to train to obtain a final model, and the training and the reasonable division of the verification set cannot be performed by fully utilizing the preference of the neural network of the existing data set. Due to the randomness of data division, the deviation of different division modes under the same test set is large, and therefore the finally obtained network model prediction effect is extremely unstable.
Due to the reasons, the existing identification methods for the anti-cancer peptides and the antibacterial peptides have low identification accuracy and poor generalization capability.
Disclosure of Invention
In order to solve the problems, the invention provides a method and a system for predicting an anti-cancer peptide and an anti-bacterial peptide based on a deep neural network, and the accuracy of peptide sequence identification is improved.
In order to realize the purpose, the invention adopts the following technical scheme:
in a first aspect, a method for predicting an anticancer peptide and an antimicrobial peptide based on a deep neural network is provided, which includes:
obtaining a peptide sequence;
determining the physicochemical properties of each amino acid in the peptide sequence;
extracting evolution information of the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid;
the method comprises the steps of obtaining a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to obtain fusion information, the fusion information is identified, and the peptide identification result is obtained.
In a second aspect, a deep neural network based anticancer and antibacterial peptide prediction system is provided, comprising:
a data acquisition module for acquiring a peptide sequence;
the information extraction module is used for extracting evolution information of the peptide sequence, determining the physicochemical property of each amino acid in the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid;
and the identification module is used for acquiring a peptide identification result through the fingerprint information, the evolution information, the physicochemical property information and the trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to acquire fused information, and the fused information is identified to acquire the peptide identification result.
In a third aspect, an electronic device is provided, which includes a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method for predicting an anticancer peptide and an antimicrobial peptide based on a deep neural network.
In a fourth aspect, a computer readable storage medium is provided for storing computer instructions, which when executed by a processor, perform the steps of a deep neural network-based method for predicting anticancer and antibacterial peptides.
Compared with the prior art, the invention has the following beneficial effects:
1. the method obtains comprehensive characteristic information of the peptide sequence by extracting the evolution information characteristic, the fingerprint information characteristic and the physicochemical property information characteristic of the peptide sequence, fuses the three characteristics, obtains a peptide identification result by identifying the fused information, and improves the accuracy of the peptide identification result.
2. The invention extracts the characteristics of the evolution information, the fingerprint information and the physicochemical property information through different characteristic extraction networks, can extract effective characteristics from each kind of information, realizes the reasonable utilization of the information, and thus effectively ensures the accuracy of the peptide identification result.
3. When the peptide sequence recognition model is trained, firstly, the training set and the verification set are randomly divided, then the training set and the verification set are reasonably divided according to the preference of the peptide sequence recognition model, and the peptide sequence recognition model is trained through the training set and the verification set, so that the recognition accuracy of the trained peptide sequence recognition model can be improved, and the accuracy of a peptide recognition result is further ensured.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application, and the description of the exemplary embodiments and illustrations of the application are intended to explain the application and are not intended to limit the application.
FIG. 1 is a block flow diagram of the method disclosed in example 1;
fig. 2 is a block diagram of a first feature extraction network disclosed in embodiment 1;
fig. 3 is a block diagram of a second feature extraction network disclosed in embodiment 1;
fig. 4 is a block diagram of a third feature extraction network disclosed in embodiment 1.
Detailed Description
The invention is further described with reference to the following figures and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example 1
In order to improve the accuracy of the peptide identification result, in this embodiment, a method for predicting anticancer peptides and antibacterial peptides based on a deep neural network is disclosed, as shown in fig. 1, including:
s1: obtaining a peptide sequence;
s2: determining the physicochemical properties of each amino acid in the peptide sequence; extracting evolution information of the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid;
s3: the method comprises the steps of obtaining a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to obtain fusion information, the fusion information is identified, and the peptide identification result is obtained.
In particular implementations, 158 physicochemical properties of each amino acid in the peptide sequence were determined, the 158 physicochemical properties being determined by a physicochemical properties database (AAindex).
Evolutionary information of peptide sequences is represented by constructing a PSSM matrix of peptide sequences, including in particular: obtaining a position-specific scoring matrix (PSSM matrix) of the peptide sequence by using PSI-BLAST, wherein the evolution information of the obtained peptide sequence is a PSSM characteristic matrix of L multiplied by 20 in the case that the peptide sequence comprises 20 amino acids, wherein L is the length of the peptide sequence, and the matrix is the first
Figure 666173DEST_PATH_IMAGE001
Element and position
Figure 121426DEST_PATH_IMAGE002
By amino acid
Figure 83565DEST_PATH_IMAGE003
The probability of substitution is proportional.
According to the physicochemical properties of amino acids, the process of extracting the fingerprint information of the peptide sequence comprises the following steps:
s21: CGR curves of peptides were constructed according to the physicochemical properties of amino acids. The method comprises the following specific steps:
sequencing all amino acids in a peptide sequence according to the numerical values of the physicochemical properties, and uniformly mapping all the amino acids on a unit circle to construct a CGR curve of the peptide, wherein the coordinates of the unit circle are as follows:
Figure 649676DEST_PATH_IMAGE004
wherein,
Figure 748082DEST_PATH_IMAGE005
representing the constituent peptide sequenceTo the first of the columniAnd (3) amino acid.
For the firstiAmino acid of each amino acid
Figure 374235DEST_PATH_IMAGE006
Its coordinates in the unit circle are:
Figure 823671DEST_PATH_IMAGE007
wherein,Nindicates the length of the peptide sequence.
For each peptide sequence, 158 kinds of CGR curves were obtained using 158 kinds of physicochemical properties selected in the physicochemical property database (AAindex).
S22: dividing the CGR curve into a plurality of sub-blocks, and determining points on the boundary of adjacent sub-blocks; rotating the partitioned CGR curve to obtain corresponding points of the rotated sub-blocks and points on the boundaries of the adjacent sub-blocks after rotation; and calculating the Euclidean distance of the points on the two adjacent boundaries and the Euclidean distance of the corresponding points after the two adjacent circles are rotated to form a distance matrix.
Preferably, the CGR curve is divided into four sub-blocks according to four quadrants of a coordinate system, the coordinate axes are rotated by 45 ° by using position related information of points on the boundary of adjacent sub-blocks to obtain the rotated four sub-blocks, and for each CGR curve, euclidean distances between all point pairs in the eight sub-blocks are calculated to obtain eight distance matrices
Figure 927893DEST_PATH_IMAGE008
S23: and extracting main characteristic values of each distance matrix, and forming peptide sequence fingerprint information through the main characteristic values.
Calculating the main eigenvalue of each distance matrix to obtain a characteristic matrix
Figure 146385DEST_PATH_IMAGE009
Comprises the following steps:
Figure 943440DEST_PATH_IMAGE010
Figure 552276DEST_PATH_IMAGE011
wherein,
Figure 788085DEST_PATH_IMAGE012
for the matrix of main eigenvalues of each CGR curve,
Figure 798766DEST_PATH_IMAGE013
is a matrix
Figure 829039DEST_PATH_IMAGE014
The main eigenvalue of (1).
According to the physicochemical property of amino acid, the process of extracting the physicochemical property information of the peptide sequence comprises the following steps:
clustering all physicochemical properties in the physicochemical property database, and extracting the most representative property in each cluster as the representative physicochemical property of the amino acid;
and extracting the representative physicochemical property of the amino acid from the physicochemical property of each amino acid of the peptide sequence to obtain the physicochemical property information of the peptide sequence.
Compared with the conventional method which usually selects specific physicochemical properties directly, in this embodiment, in order to avoid redundancy and obtain more comprehensive physicochemical property distribution, 556 physicochemical properties in the AAindex are first divided into 8 clusters;
eight most representative physicochemical properties in each cluster are extracted as representative physicochemical properties, and each amino acid is coded into an eight-dimensional vector by utilizing PCPE (physical and chemical property embedding).
Constructing an L multiplied by 8 dimensional characteristic matrix of each amino acid in the peptide sequence according to the physicochemical property of each amino acid in the peptide sequence
Figure 925171DEST_PATH_IMAGE015
And obtaining the physicochemical property information of the peptide sequence.
Inputting fingerprint information, evolution information and physicochemical property information into the trained peptide sequenceBefore identification of the model, evolution information of the peptide sequence was analyzed
Figure 433513DEST_PATH_IMAGE016
And physicochemical property information
Figure 564280DEST_PATH_IMAGE015
Are unified to accommodate subsequent peptide sequence recognition models.
Specifically, the length of the information may be uniformly set to 50, and if the length is insufficient, the information is padded with 0.
The peptide sequence recognition model is described in detail.
In order to extract effective features from fingerprint information, evolution information and physicochemical property information, a first feature extraction network, a second feature extraction network and a third feature extraction network are arranged in a peptide sequence identification model, fingerprint features are extracted from the fingerprint information through the first feature extraction network, evolution features are extracted from the evolution information through the second feature extraction network, and physicochemical property features are extracted from the physicochemical property information through the third feature extraction network.
Wherein, the first feature extraction network adopts a multi-channel convolution neural network, and adds a channel attention mechanism in the multi-channel convolution neural network, as shown in fig. 2, fingerprint information is extracted
Figure 765454DEST_PATH_IMAGE009
And inputting the first feature extraction network to capture important features through local connectivity and weight sharing.
Due to the feature matrix
Figure 348882DEST_PATH_IMAGE009
Each row of (a) represents 158 physicochemical properties and each column represents 8-dimensional features extracted from one CGR curve, the model learns the shared weights of the 158 physicochemical properties using a more appropriate convolution kernel of size 1 × 8 instead of a general square convolution kernel. In the present invention, the number of filters is set to 16.
Global information is obtained by comprehensively considering all 158 attributes using a Channel Attention Module (CAM).
The method comprises the following specific steps:
(1) Obtaining a three-dimensional feature map by a convolution layerM’ DCGR = f conv (MDCGR)
(2) To pair
Figure 660915DEST_PATH_IMAGE017
Global averaging and maximum pooling is performed, and then a multi-layer perceptron (MLP) with shared weights consisting of two fully-connected layers is used, resulting in
Figure 646188DEST_PATH_IMAGE017
Each channel of
Figure 955947DEST_PATH_IMAGE018
And given channel weight to the classification importance ofCAM i
The overall process of the CAM is represented as:
Figure 88988DEST_PATH_IMAGE019
wherein,
Figure 142395DEST_PATH_IMAGE020
a sigmoid function is represented as a function,
Figure 44492DEST_PATH_IMAGE021
Figure 525152DEST_PATH_IMAGE022
respectively by averaging and maximum pooling calculations,
Figure 879910DEST_PATH_IMAGE023
and
Figure 737007DEST_PATH_IMAGE024
representing the weight matrix of the shared MLP.
Assigning channel weights to three-dimensional feature maps by element-wise multiplication
Figure 493611DEST_PATH_IMAGE017
On the corresponding channel, obtaining a characteristic diagram
Figure 145172DEST_PATH_IMAGE025
Will be provided with
Figure 252805DEST_PATH_IMAGE026
Flattening and passing through a full connection layer to generate a final 400-dimensional DCGR feature vector
Figure 648014DEST_PATH_IMAGE027
The second feature extraction network adopts a bidirectional long-short memory network (Bi-LSTM), and evolves information as shown in FIG. 3
Figure 259124DEST_PATH_IMAGE028
Input into Bi-LSTM.
The calculation of forward LSTM is summarized as follows:
Figure 347166DEST_PATH_IMAGE029
Figure 614199DEST_PATH_IMAGE030
Figure 609837DEST_PATH_IMAGE031
Figure 278716DEST_PATH_IMAGE032
Figure 599976DEST_PATH_IMAGE033
Figure 88726DEST_PATH_IMAGE034
wherein,
Figure 153634DEST_PATH_IMAGE035
,
Figure 677019DEST_PATH_IMAGE036
,
Figure 903601DEST_PATH_IMAGE037
are respectively weight matrixes;
Figure 145227DEST_PATH_IMAGE038
respectively representing offset vectors;
Figure 748246DEST_PATH_IMAGE039
is a forgetting gate;
Figure 126138DEST_PATH_IMAGE040
is an input gate;
Figure 789200DEST_PATH_IMAGE041
is an output gate;
Figure 252543DEST_PATH_IMAGE042
is the current input;
Figure 596936DEST_PATH_IMAGE043
is a previous cell state;
Figure 157231DEST_PATH_IMAGE044
is the current cell state;
Figure 663298DEST_PATH_IMAGE045
is a new value added to the cell state;
Figure 941833DEST_PATH_IMAGE046
and
Figure 824338DEST_PATH_IMAGE047
respectively in a previous and a current hidden state; Ä denotes element-by-element multiplication.
The working principle of the backward LSTM is the same as that of the forward LSTM, and the current hidden state is calculated as
Figure 239139DEST_PATH_IMAGE048
Resulting in a representation of the relevant PSSM eigenvector as 256 dimensions
Figure 916108DEST_PATH_IMAGE049
In which
Figure 416360DEST_PATH_IMAGE050
The last time step.
The third feature extraction network adopts a multi-head self-attention network (Transformer network), and a feature matrix obtained by PCPE
Figure 102556DEST_PATH_IMAGE051
Since the sequence of residues plays a crucial role in peptide sequence, the present example uses sine and cosine position coding to reflect the distribution of physicochemical properties in peptide sequence, and the specific method is as follows:
Figure 371863DEST_PATH_IMAGE052
Figure 485313DEST_PATH_IMAGE053
wherein,
Figure 472860DEST_PATH_IMAGE054
indicates the amino acid position of the peptide sequence,
Figure 697168DEST_PATH_IMAGE055
and
Figure 86561DEST_PATH_IMAGE056
representing even and odd element positions of the embedding vector,
Figure 370912DEST_PATH_IMAGE057
wherein
Figure 603614DEST_PATH_IMAGE058
Is the dimension of the embedding vector.
Obtaining a new feature matrix incorporating position-coded information
Figure 631613DEST_PATH_IMAGE059
Wherein
Figure 875513DEST_PATH_IMAGE060
is shown as
Figure 330765DEST_PATH_IMAGE061
Feature vectors for individual residuals, L = 50, represent the peptide chain length.
As shown in fig. 4, will
Figure 230588DEST_PATH_IMAGE062
The coding region of the Transformer is input, and specifically comprises:
the dependency relationship between amino acid residues at any distance is extracted by utilizing a single-head self-attention mechanism, the physicochemical property distribution information of the peptide is effectively obtained, and the calculation process is as follows:
Figure 859015DEST_PATH_IMAGE063
Figure 957421DEST_PATH_IMAGE064
Figure 708208DEST_PATH_IMAGE065
wherein,
Figure 829748DEST_PATH_IMAGE066
representing an attention score matrix;
Figure 261866DEST_PATH_IMAGE067
Figure 418041DEST_PATH_IMAGE068
and
Figure 277413DEST_PATH_IMAGE069
respectively representing three vectors of query, key and value;
Figure 886249DEST_PATH_IMAGE070
is the dimension of the same or a different dimension,
Figure 856479DEST_PATH_IMAGE071
,
Figure 867160DEST_PATH_IMAGE072
and
Figure 897433DEST_PATH_IMAGE073
is the relevant weight matrix.
Average pooling is adopted for the feature matrix passing through the coding region to obtain 50-dimensional PCPE feature vector of the peptide chain
Figure 993565DEST_PATH_IMAGE074
Fusing the fingerprint characteristics, the evolutionary characteristics and the physicochemical property characteristics to obtain fusion information, identifying the fusion information to obtain a peptide identification result, specifically:
by batch normalization, 400-dimensional, 256-dimensional, and 50-dimensional feature vectors are obtained as the output of each branch.
And splicing the characteristic vectors output by the branches, and performing back propagation through a Dropout layer and a full connection layer to obtain a final prediction result of the anticancer peptide or the antibacterial peptide.
The process of obtaining the trained peptide sequence recognition model is as follows:
acquiring a training set and a verification set for each training, training the constructed peptide sequence recognition model through the training set for each training, and verifying the training effect of the peptide sequence recognition model through the verification set for each training;
selecting samples with model prediction error times exceeding the set error times in the training process of the last set round number from the verification set for the training to form a verification set after screening; selecting samples which are accurately classified in the training process of the last set number of rounds from the training set for the training to form a training set after screening; selecting samples from the screened verification set to form a verification set to-be-exchanged sample set, selecting samples from the screened training set to form a training set to-be-exchanged sample set, exchanging the verification set to-be-exchanged sample set in the verification set for the training with the training set to-be-exchanged sample set in the training set for the training to form a new verification set and a new training set, and using the new verification set and the new training set as the verification set and the training set for the next training.
The process of obtaining the training set and the verification set for the first training is as follows:
acquiring a peptide sequence for training, and labeling the peptide sequence for training;
the evolution information, fingerprint information and physicochemical property information of the training peptide sequence are extracted from the training peptide sequence to form a training data set, and the training data set is randomly divided into a training set and a verification set, namely the training set and the verification set for the first training.
And training the constructed peptide sequence recognition model through a training set and a verification set for the last training, and obtaining the trained peptide sequence recognition model after the training is finished.
In specific implementation, S31: acquiring a peptide sequence for training, and labeling the peptide sequence for training;
s32, randomly dividing a training set
Figure 767486DEST_PATH_IMAGE075
And verification set
Figure 898253DEST_PATH_IMAGE076
A training set and a validation set for the first training are formed.
Wherein,
Figure 99427DEST_PATH_IMAGE077
and
Figure 417276DEST_PATH_IMAGE078
respectively representing the features of the training set and the validation set,
Figure 994888DEST_PATH_IMAGE079
and
Figure 980161DEST_PATH_IMAGE080
indicating a sample label.
S33: training the constructed peptide sequence recognition model through a training set, and verifying the training effect of the peptide sequence recognition model through a verification set;
s34: in the verification set
Figure 352237DEST_PATH_IMAGE081
Samples with error times exceeding 5 times in 10 rounds of training processes after mid-search are generated to be a verification set after screening
Figure 157382DEST_PATH_IMAGE082
. Meanwhile, samples which are accurately classified in the last 10 rounds of training in the training set are found, and a training set after screening is generated
Figure 273105DEST_PATH_IMAGE083
Randomly select [ k/2 ]]Is from
Figure 112885DEST_PATH_IMAGE084
Sample set of
Figure 655862DEST_PATH_IMAGE085
While randomly selecting [ k/2 ]]Is derived from
Figure 213883DEST_PATH_IMAGE086
Sample set of
Figure 867718DEST_PATH_IMAGE087
Will train the setTSample set in (1)T change And verification setVSample set in (1)V change Exchanging to construct a new training set
Figure 827583DEST_PATH_IMAGE088
And verification set
Figure 541462DEST_PATH_IMAGE089
I.e. byT new =T-T change +V change V new =V-V change +T change
S35: in two new sets
Figure 321199DEST_PATH_IMAGE088
And
Figure 778725DEST_PATH_IMAGE089
repeating S33 and S34 to obtain the final training set
Figure 593097DEST_PATH_IMAGE090
And verification set
Figure 477877DEST_PATH_IMAGE091
S36: reinitializing the peptide sequence recognition model through the final training set
Figure 744910DEST_PATH_IMAGE090
And verification set
Figure 802865DEST_PATH_IMAGE091
And training the peptide sequence recognition model to obtain the trained peptide sequence recognition model.
The prediction method provided by the embodiment can effectively fuse sequence information in multiple aspects, and on the basis, a three-branch neural network model framework (TriNet) is designed according to the characteristics of the three characteristics, so that each characteristic is properly processed and effectively fused for final prediction.
On an ACP740 data set, five-fold cross validation is utilized, and compared with ACP-DL, MHCNN, iACP-DRLF, CL-ACP and DeepACPpred, the improvement percentages of accuracy, sensitivity, specificity, accuracy, F1 fraction and Markov correlation coefficient are respectively 3.2% -8.6%, 1.9% -6.9%, 3.2% -21.5%, 3.0% -12.0% and 3.2% -6.9%. On ACPmain data set, five-fold cross validation is utilized to compare TriNet with ACP-DL, MHCNN, iACP-DRLF, antiCP 2.0-AAC and AntiCP 2.0-DPC, and the improvement percentages of accuracy, sensitivity, specificity, accuracy, F1 fraction and Mareus correlation coefficient are respectively 9.3% -23.7%, 15.3% -47.3%, 3.6% -13.6%, 5.6% -14.3%, 10.4% -28.2% and 25.4% -73.1%. On AMP2828 data set, using its independent test set, comparing the TriNet of the invention with ACP-DL, MHCNN, iACP-DRLF, antiCP 2.0-AAC and AntiCP 2.0-DPC, the accuracy, sensitivity, specificity, accuracy, F1 fraction and the promotion percentage of the Markov correlation coefficient are respectively 6.1% -29.5%, 0.7% -11.0%, 1.8% -79.4%, 2.3% -44.6%, 6.7% -22.8% and 13.2% -67.4%.
In addition to the commonly used evolutionary features, two new features, sequence fingerprint information and physicochemical property information, were introduced in this example, to properly characterize the global sequence information and physicochemical property distribution of polypeptides. Test results show that the proposed features have wide adaptability and effectiveness. The concrete expression is as follows: and flattening and splicing the characteristic matrix of each branch, and predicting by using a traditional machine learning method XGboost to replace a neural network structure. On an ACP740 data set, the accuracy and the promotion percentage of the Ma Xiusi correlation coefficient are respectively 0.47% -5.8% and 0.28% -15.3%; on the ACPmain data set, the accuracy and the promotion percentage of the Ma Xiusi correlation coefficient are respectively 4.0% -17.7% and 11.3% -53.8%; on the AMP2828 dataset, the percentage improvement of the accuracy and the Ma Xiusi correlation coefficient were 4.3% -27.2% and 9.1% -61.2%, respectively.
In addition, the embodiment also provides a novel TVI (transient overvoltage) training method which can generate a more proper training and verification set according to the deviation of a network model, and specifically comprises the steps that compared with the traditional training method, on an ACP740 data set, the average promotion percentages of the accuracy, the sensitivity, the specificity, the accuracy, the F1 fraction and the Marx correlation coefficient are respectively 2.0%, 2.1%, 1.9%, 1.9%, 2.0% and 4.6%, and the maximum promotion ratios are 3.9%, 6.6%, 8.2%, 6.6%, 4.4% and 9.1%; on the ACPpain data set, the average lifting percentage is respectively 1.7%, 2.3%, 1.0%, 1.1%, 1.7% and 4.5%, and the maximum lifting percentage is respectively 4.8%, 6.0%, 6.3%, 5.6%, 4.8% and 12.8%; on the AMP2828 dataset, the average percent increases were 0.5%, 0.3%, 0.7%, 0.7%, 0.5%, 1.0%, and the maximum percent increases were 1.1%, 1.1%, 2.9%, 2.9%, 1.1%,2.2%, respectively.
Example 2
In this embodiment, disclosed is a deep neural network-based anticancer and antibacterial peptide prediction system, comprising:
a data acquisition module for acquiring a peptide sequence;
the information extraction module is used for extracting evolution information of the peptide sequence, determining the physicochemical property of each amino acid in the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid;
the identification module is used for obtaining a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to obtain fused information, and the fused information is identified to obtain the peptide identification result.
Example 3
In this embodiment, an electronic device is disclosed, comprising a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method for predicting anti-cancer and anti-microbial peptides based on deep neural networks disclosed in embodiment 1.
Example 4
In this embodiment, a computer readable storage medium is disclosed for storing computer instructions which, when executed by a processor, perform the steps of a method for predicting anti-cancer and anti-microbial peptides based on a deep neural network disclosed in embodiment 1.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (5)

1. A method for predicting anti-cancer peptides and antibacterial peptides based on a deep neural network is characterized by comprising the following steps:
obtaining a peptide sequence;
determining the physicochemical properties of each amino acid in the peptide sequence;
extracting evolution information of the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid, wherein the evolution information of the peptide sequence is represented by constructing a PSSM matrix of the peptide sequence; constructing a CGR curve of the peptide according to the physicochemical properties of the amino acid; dividing the CGR curve into a plurality of sub-blocks, and determining points on the boundary of adjacent sub-blocks; rotating the partitioned CGR curve to obtain corresponding points of the rotated sub-blocks and points on the boundaries of the adjacent sub-blocks after rotation; calculating the Euclidean distance of points on two adjacent boundaries and the Euclidean distance of corresponding points after two adjacent rotated points to form a distance matrix; extracting main characteristic values of each distance matrix to form peptide sequence fingerprint information; clustering all physicochemical properties in the physicochemical property database, and extracting the most representative property in each cluster as the representative physicochemical property of the amino acid; extracting representative physicochemical properties of amino acids from the physicochemical properties of each amino acid of the peptide sequence to obtain physicochemical property information of the peptide sequence;
obtaining a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to obtain fusion information, and the fusion information is identified to obtain a peptide identification result;
the first feature extraction network adopts a multi-channel convolution neural network, and a channel attention mechanism is added in the multi-channel convolution neural network; the second feature extraction network adopts a bidirectional long and short memory network; the third feature extraction network adopts a multi-head self-attention network;
acquiring a training set and a verification set for each training, training the constructed peptide sequence recognition model through the training set for each training, and verifying the training effect of the peptide sequence recognition model through the verification set for each training;
selecting samples with model prediction error times exceeding the set error times in the training process of the last set round number from the verification set for the training to form a verification set after screening; selecting samples which are accurately classified in the training process of the last set number of rounds from the training set for the training to form a training set after screening; selecting samples from the screened verification set to form a verification set to-be-exchanged sample set, selecting samples from the screened training set to form a training set to-be-exchanged sample set, exchanging the verification set to-be-exchanged sample set in the verification set for the training with the training set to-be-exchanged sample set in the training set for the training to form a new verification set and a new training set, and using the new verification set and the new training set as the verification set and the training set for the next training;
and training the constructed peptide sequence recognition model through a training set and a verification set for the last training, and obtaining the trained peptide sequence recognition model after the training is finished.
2. The method of claim 1, wherein the evolution information and the physicochemical property information of the peptide sequence are unified before inputting the fingerprint information, the evolution information and the physicochemical property information into the trained peptide sequence recognition model.
3. An anticancer peptide and antibacterial peptide prediction system based on a deep neural network, which is characterized by comprising:
a data acquisition module for acquiring a peptide sequence;
the information extraction module is used for extracting evolution information of the peptide sequence, determining the physicochemical property of each amino acid in the peptide sequence, and extracting fingerprint information and physicochemical property information of the peptide sequence according to the physicochemical property of the amino acid, wherein the evolution information of the peptide sequence is represented by constructing a PSSM matrix of the peptide sequence; constructing a CGR curve of the peptide according to the physicochemical properties of the amino acid; dividing the CGR curve into a plurality of sub-blocks, and determining points on the boundary of adjacent sub-blocks; rotating the partitioned CGR curve to obtain corresponding points of the rotated sub-blocks and points on the boundaries of the adjacent sub-blocks after rotation; calculating Euclidean distances of points on two adjacent boundaries and Euclidean distances of corresponding points after two adjacent rotated points to form a distance matrix; extracting main characteristic values of each distance matrix to form peptide sequence fingerprint information; clustering all physicochemical properties in the physicochemical property database, and extracting the most representative property in each cluster as the representative physicochemical property of the amino acid; extracting representative physicochemical properties of amino acids from the physicochemical properties of each amino acid of the peptide sequence to obtain physicochemical property information of the peptide sequence;
the identification module is used for acquiring a peptide identification result through fingerprint information, evolution information, physicochemical property information and a trained peptide sequence identification model, wherein the peptide sequence identification model comprises a first feature extraction network, a second feature extraction network and a third feature extraction network, the first feature extraction network extracts fingerprint features from the fingerprint information, the second feature extraction network extracts evolution features from the evolution information, the third feature extraction network extracts physicochemical property features from the physicochemical property information, the fingerprint features, the evolution features and the physicochemical property features are fused to acquire fused information, and the fused information is identified to acquire the peptide identification result;
the first feature extraction network adopts a multi-channel convolution neural network, and a channel attention mechanism is added in the multi-channel convolution neural network; the second feature extraction network adopts a bidirectional long and short memory network; the third feature extraction network adopts a multi-head self-attention network;
acquiring a training set and a verification set for each training, training the constructed peptide sequence recognition model through the training set for each training, and verifying the training effect of the peptide sequence recognition model through the verification set for each training;
selecting samples with model prediction error times exceeding the set error times in the training process of the last set round number from the verification set for the training to form a verification set after screening; selecting samples which are accurately classified in the training process of the last set number of rounds from the training set for the training to form a training set after screening; selecting samples from the screened verification set to form a verification set to-be-exchanged sample set, selecting samples from the screened training set to form a training set to-be-exchanged sample set, exchanging the verification set to-be-exchanged sample set in the verification set for the training with the training set to-be-exchanged sample set in the training set for the training to form a new verification set and a new training set, and using the new verification set and the new training set as the verification set and the training set for the next training;
and training the constructed peptide sequence recognition model through a training set and a verification set for the last training, and obtaining the trained peptide sequence recognition model after the training is finished.
4. An electronic device comprising a memory and a processor and computer instructions stored in the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method for predicting anti-cancer and anti-microbial peptides based on deep neural networks of any one of claims 1-2.
5. A computer readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the method for predicting anti-cancer and anti-microbial peptides based on a deep neural network according to any one of claims 1 to 2.
CN202211352672.XA 2022-11-01 2022-11-01 Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network Active CN115512396B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211352672.XA CN115512396B (en) 2022-11-01 2022-11-01 Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211352672.XA CN115512396B (en) 2022-11-01 2022-11-01 Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network

Publications (2)

Publication Number Publication Date
CN115512396A CN115512396A (en) 2022-12-23
CN115512396B true CN115512396B (en) 2023-04-07

Family

ID=84511877

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211352672.XA Active CN115512396B (en) 2022-11-01 2022-11-01 Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network

Country Status (1)

Country Link
CN (1) CN115512396B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116130004B (en) * 2023-01-06 2024-05-24 成都侣康科技有限公司 Identification processing method and system for antibacterial peptide
CN116206690B (en) * 2023-05-04 2023-08-08 山东大学齐鲁医院 Antibacterial peptide generation and identification method and system
CN118314949A (en) * 2024-04-26 2024-07-09 海南大学 Anticancer peptide prediction method and related device
CN118486376A (en) * 2024-07-15 2024-08-13 山东大学 Antibacterial peptide and anti-inflammatory peptide identification method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022016125A1 (en) * 2020-07-17 2022-01-20 Genentech, Inc. Attention-based neural network to predict peptide binding, presentation, and immunogenicity
CN114863997A (en) * 2022-06-17 2022-08-05 常州大学 Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484580B (en) * 2014-11-28 2017-08-25 深圳先进技术研究院 Antibacterial peptide Activity Prediction method based on Multi-label learning
WO2019178056A1 (en) * 2018-03-12 2019-09-19 Massachusetts Institute Of Technology Computational platform for in silico combinatorial sequence space exploration and artificial evolution of peptides
CN112614538A (en) * 2020-12-17 2021-04-06 厦门大学 Antibacterial peptide prediction method and device based on protein pre-training characterization learning
CN113593632B (en) * 2021-08-09 2023-09-05 山东大学 Polypeptide anticancer function recognition method, system, medium and equipment
CN114743591A (en) * 2022-03-14 2022-07-12 中国科学院深圳理工大学(筹) Recognition method and device for MHC (major histocompatibility complex) bindable peptide chain and terminal equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022016125A1 (en) * 2020-07-17 2022-01-20 Genentech, Inc. Attention-based neural network to predict peptide binding, presentation, and immunogenicity
CN114863997A (en) * 2022-06-17 2022-08-05 常州大学 Anti-cancer peptide prediction method based on bidirectional long-short term memory network and feature fusion

Also Published As

Publication number Publication date
CN115512396A (en) 2022-12-23

Similar Documents

Publication Publication Date Title
CN115512396B (en) Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network
CN113192559B (en) Protein-protein interaction site prediction method based on deep graph convolution network
CN112990010B (en) Point cloud data processing method and device, computer equipment and storage medium
CN105718960B (en) Based on convolutional neural networks and the matched image order models of spatial pyramid
WO2017196963A1 (en) Computational method for classifying and predicting protein side chain conformations
CN103714148B (en) SAR image search method based on sparse coding classification
Wei et al. Projected residual vector quantization for ANN search
CN113257357B (en) Protein residue contact map prediction method
CN114926742B (en) Loop detection and optimization method based on second-order attention mechanism
CN117766021A (en) Deep learning algorithm for predicting protein-polypeptide binding site
CN115830379A (en) Zero-sample building image classification method based on double-attention machine system
CN114358169A (en) Colorectal cancer detection system based on XGboost
Villegas-Morcillo et al. Protein fold recognition from sequences using convolutional and recurrent neural networks
CN115424691A (en) Case matching method, system, device and medium
Li et al. Msvit: training multiscale vision transformers for image retrieval
CN114093419A (en) RBP binding site prediction method based on multitask deep learning
CN116206690B (en) Antibacterial peptide generation and identification method and system
CN104484580B (en) Antibacterial peptide Activity Prediction method based on Multi-label learning
CN114861940B (en) Bayesian optimization integrated learning method for sORFs in predicted plant lncRNA
CN116824138A (en) Interactive image segmentation method and device based on click point influence enhancement
Sahoo et al. Learning representation for mixed data types with a nonlinear deep encoder-decoder framework
Wang et al. Unsupervised Hyperspectral Band Selection via Structure-Conserved and Neighborhood-Grouped Evolutionary Algorithm
KR20230170680A (en) Multi-channel protein voxelization to predict variant pathogenicity using deep convolutional neural networks
Vega-Pons et al. Clustering ensemble method for heterogeneous partitions
CN114496068A (en) Protein secondary structure prediction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant