CN116013404A

CN116013404A - Multi-mode fusion deep learning model and multifunctional bioactive peptide prediction method

Info

Publication number: CN116013404A
Application number: CN202211693605.4A
Authority: CN
Inventors: 康雁; 张华栋; 杨学昆; 彭陆晗; 王鑫超; 袁艳聪; 谢文涛; 刘章琳; 普康
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-04-25

Abstract

The invention discloses a multi-modal fusion deep learning model and a multi-functional bioactive peptide prediction method, which are characterized by comprising a multi-modal data input module, a peptide sequence coding module, a peptide structure coding module and a classification module; the multi-mode data input module inputs the peptide sequence and the peptide structure of the bioactive peptide; the peptide sequence coding module adopts a multi-scale expansion CNN and BiLSTM model to fuse and extract characteristics of a plurality of scales of peptide sequences; the peptide structure coding module adopts a multilayer CNN model to extract characteristics of peptide structure data; the classification module concatenates the peptide sequence encoding module output and the peptide structure encoding module output as inputs to the final feature output layer. The peptide sequence and the structural characteristics are effectively fused, so that the data characteristics of different visual angles can be effectively extracted, and the functional prediction of the multifunctional bioactive peptide is better performed.

Description

Multi-mode fusion deep learning model and multifunctional bioactive peptide prediction method

Technical Field

The invention relates to the field of bioactive peptide prediction, in particular to a multi-mode fusion deep learning model and a multifunctional bioactive peptide prediction method.

Background

Bioactive peptides are small protein fragments, typically containing 2-20 amino acid residues, that play a variety of roles in metabolic and biological processes. Over the past several decades, a number of biologically active peptides have been identified that have multiple functions. Accurate identification of the activity of bioactive peptides is of great importance in at least two ways: helping to promote the understanding of the action mechanism of the bioactive peptide; new natural foods and pharmaceuticals are developed to meet safety and health requirements.

The use of computer programming in biological research has greatly increased the importance of bioinformatics. Over the past several decades, many functional peptides have been identified, making it possible for machine learning algorithms to predict different peptides. More recently some predictive models have been dedicated to predicting peptide function from sequence information only, without the need for verification or using any a priori knowledge as input. In addition, various physicochemical-characteristic-based methods, mainly including amino acid composition, pseudo-amino acid composition, normalized amino acid composition, hydrophobicity, net charge, isoelectric point, alpha-helix propensity, beta-sheet propensity and turn-around propensity, have been previously proposed for predicting peptides. The structure data can effectively model the functional information of the peptide, the description degree of the peptide sequence as single data is insufficient, and the sequence data and the multi-modal data of the structure data can effectively extract the data characteristics of different visual angles, so that the peptide prediction is better carried out, and the capturing of the multi-modal data characteristics is difficult to complete by a single model.

Disclosure of Invention

The invention aims at: aiming at the problem that the prior single model is difficult to capture the characteristics of multi-modal data, a multi-modal fusion deep learning model is provided, structural properties are introduced into the model, multiple activities of bioactive peptides in a sequence are obtained by adopting a multi-scale expansion convolution and BiLSTM fusion model, characteristics of the bioactive peptides are obtained by adopting a multi-scale CNN module to process structural input, the obtained multi-modal characteristics are processed, information fusion is carried out, complementarity of the sequence and the structural properties of the bioactive peptides is effectively considered, and the bioactive peptide characteristics are captured from the multi-modal characteristics.

The technical scheme of the invention is as follows:

a multi-modal fusion deep learning model comprises a multi-modal data input module, a peptide sequence coding module, a peptide structure coding module and a classification module; the multi-mode data input module inputs the peptide sequence and the peptide structure of the bioactive peptide; the peptide sequence coding module adopts a multi-scale expansion CNN and BiLSTM model to fuse and extract characteristics of a plurality of scales of peptide sequences; the peptide structure coding module adopts a multilayer CNN model to extract characteristics of peptide structure data; the classification module concatenates the peptide sequence encoding module output and the peptide structure encoding module output as inputs to the final feature output layer.

Further, the multi-scale expanded CNN includes: when applied to a one-dimensional CNN, it can be calculated as:

wherein y is _i Representing the output of the ith element in the convolution, x _i For the input of the ith element, omega is the weight of the filter, and the length of the filter is K; r is the rate of dilation, where r=1 is equal to the normal convolution, and when the rate of dilation r=2, a zero is inserted in the adjacent convolution weights.

Further, the LSTM model comprises the following steps:

calculating the forgetting state:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )，

v _t ＝tanh(W _c ·[h _t-1 ,x _t ]+b _v ),

calculating an input state:

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i )，

calculating unit state:

C _t ＝f _t ·C _t-1 +i _t ·v _t ，

calculating the output door and hidden state of the current time:

O _t ＝σ(W _o [h _t-1 ,x _t ]+b _o )，

h _t ＝ _t ·tanh(C _t )，

calculating forward and reverse outputs:

wherein W and b respectively refer to the weight and the deviation of the training matrix, and sigma represents that the value is 0,1]A nonlinear activation function in the range, h represents a hidden layer unit, f _t Representing a forgetting gate unit, v _t Is a unit state unit, which refers to an update gate unit, i.e. an input unit, C _t Indicating the state of the cell, O _t Is a synchronizing gate that synchronizes information from a previous cell and outputs it;

representing per-element summation for summing elements of forward and reverse outputs.

Further, the multimodal data input module pre-processes the peptide sequence prior to inputting the peptide sequence, populating the peptide with less than 517 residues using a specific feature 'X', converting all features of the peptide to integers.

Further, the peptide sequence encoding module is input as an amino acid sequence; the peptide structural coding module is input into a peptide molecular fingerprint.

Further, the classification module is a complete connection layer and is provided with five neurons with sigmoid functions; the output of each neuron represents the probability of belonging to the corresponding type of peptide.

The invention also comprises an active peptide prediction method, wherein peptide sequence codes and peptide structure codes are input, and the function of predicting peptide is realized by utilizing a multi-mode fusion deep learning model.

Compared with the prior art, the invention has the beneficial effects that:

1. a multimode fusion deep learning model and a multifunctional bioactive peptide prediction method effectively fuse two multimode data of peptide sequences and structural features, and can effectively extract data characteristics of different visual angles, thereby better predicting peptide functions;

2. in order to effectively aim at the characteristics of sequence data and structural data, different encoders are respectively designed for feature extraction; and extracting the characteristics of the sequence data by adopting a multi-scale expansion convolution CNN and BiLSTM model, and extracting the characteristics of the structure data by adopting a multi-level CNN model.

Drawings

FIG. 1 is a model flow diagram of a multi-modal fusion deep learning model and a method for predicting multifunctional bioactive peptides.

Detailed Description

It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The features and capabilities of the present invention are described in further detail below in connection with examples.

Referring to fig. 1, a multi-modal fusion deep learning model includes a multi-modal data input module, a peptide sequence encoding module, a peptide structure encoding module, and a full-link layer module; the multi-mode data input module inputs the peptide sequence and the peptide structure of the bioactive peptide; the peptide sequence coding module adopts a multi-scale expansion CNN and BiLSTM model to fuse and extract characteristics of a plurality of scales of peptide sequences; the peptide structure coding module adopts a multilayer CNN model to extract characteristics of peptide structure data; the full-junction layer module connects the peptide sequence encoding module output and the peptide structure encoding module output in series as inputs to the final feature output layer.

The multi-scale expanded CNN includes: when applied to a one-dimensional CNN, it can be calculated as:

wherein y is _i Representing the output of the ith element in the convolution, x _i For the input of the ith element, omega is the weight of the filter, and the length of the filter is K; r is the rate of dilation, where r=1 is equal to the normal convolution, and when the rate of dilation r=2, in adjacent convolution weightsA zero is inserted.

Compared to traditional convolution, the dilation convolution can capture multi-scale context information by setting different dilation rates, and can expand the receptive field without increasing network parameters. In contrast to the normal convolution operation, a hyper-parameter called the expansion rate is added to the hole convolution. Different expansion rates can be seen as the insertion of different size holes between each convolution kernel parameter.

The peptide sequence encoding module firstly inputs an embedding matrix of the peptide sequence into expansion convolution blocks with different expansion rates (expansion rates are 2, 4 and 8 respectively) for convolution, and the convolution blocks are arranged in parallel to extract information of the peptide sequence on different scales. And then used to prevent overfitting in obtaining the convolution feature matrix by a max-pooling operation.

Long term memory (LSTM) is a modified traditional Recurrent Neural Network (RNN) that can capture the entire history of input data. LSTM solves the possible gradient vanishing or gradient explosion problem in back propagation by adding input, output and forgetting gates. However, in the predictive model, LSTM cannot encode information from back to front, i.e., future information is not used. Bi-LSTM solves this problem well, consisting of two LSTM layers combined together. One LSTM cell processes the forward input and the other cell processes the reverse input. Compared to standard LSTM, biLSTM can obtain correlation from historical and current information, so the network can better understand the context information. The complete LSTM hidden element carries the concatenated vector for both the forward and reverse process outputs. The BiLSTM model comprises the following steps:

calculating the forgetting state:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )，

v _t ＝tanh(W _c ·[h _t-1 ,x _t ]+b _v ),

calculating an input state:

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i )，

calculating unit state:

C _t ＝f _t ·C _t-1 +i _t ·v _t ，

calculating the output door and hidden state of the current time:

O _t ＝σ(W _o [h _t-1 ,x _t ]+b _o )，

h _t ＝O _t ·tanh(C _t )，

calculating forward and reverse outputs:

wherein W and b respectively refer to the weight and the deviation of the training matrix, and sigma represents that the value is 0,1]A nonlinear activation function in the range, h represents a hidden layer unit, f _t Representing a forgetting gate unit, v _t Is a unit state unit, which refers to an update gate unit, i.e. an input unit, C _t Indicating the state of the cell, O _t Is a synchronizing gate that synchronizes information from a previous cell and outputs it. Wherein the method comprises the steps of

The multimodal data input module pre-processes the peptide sequence prior to inputting the peptide sequence, populating the peptide with less than 517 residues using a specific feature 'X', converting all features of the peptide to integers.

The peptide sequence coding module is input into an amino acid sequence; the peptide structural coding module is input into a peptide molecular fingerprint.

The classification module is a neuron with five sigmoid functions in a complete connection layer; the output of each neuron represents the probability of belonging to the corresponding type of peptide. The full connection layer is adopted as a prediction layer. The peptide sequence coding module output is concatenated with the structural coding module output as the input to the final feature for use as the output layer. In the multi-label problem, the probabilities of each node are independent of each other, with binary cross entropy as a loss function. A score for each node between 0 and 1 is obtained using sigmoid as the activation function. Finally, we obtain the prediction labels for each class using 0.5 as a threshold, and the elements in the five-dimensional prediction vector correspond to the labels of ACP, ADP, AHP, AIP and AMP, respectively.

Three successive one-dimensional convolution layers were used with the multilayer convolution model as the peptide structure coding module, with the number of filters being 16, 32, respectively. The convolutional layer is followed by an average pooling layer. Finally, a Dropout layer with a rate of 0.3 is used to avoid overfitting.

Comparative experiments

Experimental data set

The data set used the same experimental data set as in a deep learning method based on Convolutional Neural Network (CNN) and gated loop unit (GRU), called MLBP, using the active peptide sequence as input. The dataset was retrieved by searching the Google Scholar engine in 2020 using keywords bioactive peptide. The initial dataset included 18 bioactive peptides. Because the number of training samples is too small to train the deep neural network well, peptide fragments of 500 residues are eliminated. Thus, five functional peptides (antimicrobial peptide AMP, anticancer peptide ACP, antidiabetic peptide ADP, antihypertensive peptide AHP and anti-inflammatory peptide AIP) were retained. Clustering tool CD-HIT was used to remove or reduce redundancy and homology. Sequence identity was set to 0.9. The final amounts of ACP, ADP, AHP, AIP and AMP are 646, 514, 868, 1678 and 2409, respectively. A total of 80% of peptides were randomly sampled as training set and the remaining 20% were used as test set.

Experimental environment

For training, validation and testing methods, a deep learning server is used, in which the hardware aspects: xeon E5-2650v4CPU, 15GB memory, GPX1080Ti GPU. The software aspect is: ubuntu 18.04, python 3.6, tensorflow1.15.6.

Contrast method

The current advanced method based on deep learning is MLBP, and the MLBP performs feature extraction and fusion on peptide sequences through CNN+BiGRU. This method allows for sequence input of the active peptide, and not structural feature input. Meanwhile, four most advanced methods based on machine learning are compared, wherein the most advanced methods comprise a second-order algorithm CLR, a high-order algorithm RAndom k-labELsets (RAKEL), a ranking low-rank learning algorithm support vector machine, a binary and robust low-rank correlation (RBRL) and a multi-label learning depth forest (MLDF) based on depth forests. The results show that the proposed method is superior to the most advanced multi-label method based on deep learning and machine learning.

Evaluation of sequence coding modules

The multi-scale expansion rolling and BiLSTM model is designed to be used as a sequence coding module for extracting sequence characteristics, so that the multi-scale expansion rolling and BiLSTM model designed by the user has excellent characteristic extraction capability on the multifunctional active peptide sequence. The structural coding module is deleted, and the sequence characteristic input module, the sequence coding module and the classification module are reserved as multifunctional predictors of bioactive peptides. The same experimental comparison was made with the presently advanced MLBP based on multifunctional bioactive peptide sequences. As shown in Table 1, our sequence coding model was improved by 0.08 on Precision, 0.045 on Coverage, and 0.07 on Accuracy, showing that the sequence coding module has good sequence feature extraction capability.

TABLE 1 comparison of the expression of the sequence encoding modules and MLBP on test dataset

Evaluation of structural coding modules

In order to verify the function of the structural coding part in the multi-mode fusion learning model, only sequence characteristic input is reserved during input, namely a structural characteristic input module, a structural characteristic coding module and a classification module are reserved. Experimental comparisons were made with both sequence and structural feature inputs. The results show that the model performance when inputting structure and sequence features is comprehensively superior to the performance when inputting only sequence features in five indexes. Functional information that the structural features can effectively model the peptide is described.

Table 2 comparison of the Performance of the sequence encoding module and the structural encoding module on test data sets

Evaluation of the Overall model

In order to further evaluate our model, we used the features of both the sequence features and the structural features as inputs to the multi-modal fusion deep learning model, and the results showed that the multi-modal fusion deep learning was superior to the current latest method, see table 3. And after the two modes are fused, the five indexes are better than the performance of single characteristic input. The model can effectively fuse the sequence characteristics and the structural characteristics of the multifunctional active peptide.

TABLE 3 comparison of multimodal fusion deep learning with the performance of the currently optimal method on test dataset

Multi-scale dilation convolution and BiLSTM evaluation

Evaluation of multi-scale dilation convolution

To verify the performance of multi-scale dilation convolution in extracting peptide sequence features, we performed smile experimental evaluations on multi-scale dilation convolution blocks. First, in our method we replace the dilation convolution with a generic CNN in order to compare the performance of the multi-scale dilation convolution with that of the generic multi-scale convolution to extract peptide features, and the results in table 4 show that the effect when using multi-scale dilation convolution is significantly better than that of the multi-scale generic CNN. Considering that the peptide sequences in the baseline dataset are in the range of 5-517 residues in length and that the length spans are very wide, we considered the use of different dilation convolution rates to extract sequence features over different length ranges. We have therefore devised models at different rates of expansion to test the ability of different rates of expansion to extract features of peptide sequences. We first set the expansion rates in the multi-scale expansion convolution block to all the same rate, i.e. all 2, 4 or 8. The results show that the effects of the three are not very different and are obviously better than those of the common CNN. Given the very wide span of length of bioactive peptides, for some peptide sequences that are too long or too short, if the rate of expansion is all set to be uniform, the features therein will not be adequately extracted. In order to solve the problem that the sequence characteristics are difficult to fully extract due to the wide length span of the bioactive peptide, the expansion rates of three convolution blocks in the multi-scale expansion convolution blocks are respectively set to be 2, 4 and 8. The results show that the effect at expansion rates of 2, 4, 8, respectively, is significantly better than the effect at expansion rates of all 2, 4, or 8. Indicating that features of the unused span can be fully extracted between different expansion rates.

TABLE 4 comparison of the performance of different expansion rates (r) on test data sets

Evaluation of BiLSTM

BiLSTM plays a role in the model in effectively fusing features extracted from multi-scale dilation convolution. To verify the effect of BiLSTM we designed 5 DNN models and evaluated the effect of each model on the test set, which showed that BiLSTM achieved the best performance. Table 5 lists the performance of these different DNNs.

TABLE 5 comparison of LSTM and other DNN model Performance on test

The foregoing examples merely represent specific embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, several variations and modifications can be made without departing from the technical solution of the present application, which fall within the protection scope of the present application.

Claims

1. The multi-modal fusion deep learning model is characterized by comprising a multi-modal data input module, a peptide sequence coding module, a peptide structure coding module and a classification module; the multi-mode data input module inputs the peptide sequence and the peptide structure of the bioactive peptide; the peptide sequence coding module adopts a multi-scale expansion CNN and BiLSTM model to fuse and extract characteristics of a plurality of scales of peptide sequences; the peptide structure coding module adopts a multi-scale CNN model to extract characteristics of peptide structure data; the classification module concatenates the peptide sequence encoding module output and the peptide machinery encoding module output as inputs to the final feature output layer.

2. The multi-modal fusion deep learning model of claim 1, wherein the multi-scale dilation CNN comprises: when applied to a one-dimensional CNN, it can be calculated as:

3. The multi-modal fusion deep learning model of claim 1, wherein the BiLSTM model comprises the steps of:

calculating the forgetting state:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )，

v _t ＝tanh(W _c ·[h _t-1 ,x _t ]+b _v ),

calculating an input state:

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i )，

calculating unit state:

C _t ＝f _t ·C _t-1 +i _t ·v _t ，

calculating the output door and hidden state of the current time:

O _t ＝σ(W _o [h _t-1 ,x _t ]+b _o )，

h _t ＝O _t ·tanh(C _t )，

calculating forward and reverse outputs:

4. The multimodal fusion deep learning model of claim 1 wherein the multimodal data input module pre-processes the peptide sequence prior to inputting the peptide sequence, populating the peptide with features 'X' of less than 517 residues to convert all features of the peptide to integers.

5. The multimodal fusion deep learning model of claim 1 wherein the peptide sequence encoding module inputs as an amino acid sequence; the peptide structural coding module is input into a peptide molecular fingerprint.

6. The multi-modal fusion deep learning model of claim 1, wherein the classification module is a fully connected layer having five neurons with sigmoid functionality; the output of each neuron represents the probability of belonging to the corresponding type of peptide.

7. A method for predicting multifunctional bioactive peptide, characterized in that the input peptide sequence code and the peptide structure code are used for predicting the function of the multifunctional bioactive peptide by using the multi-modal fusion deep learning model according to any one of claims 1-6.