CN113571133B

CN113571133B - Lactic acid bacteria antibacterial peptide prediction method based on graph neural network

Info

Publication number: CN113571133B
Application number: CN202111074774.5A
Authority: CN
Inventors: 董改芳; 孙志宏; 翟冰; 左永春; 刘江平; 扎木苏
Original assignee: Inner Mongolia Agricultural University
Current assignee: Inner Mongolia Agricultural University
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2022-06-17
Anticipated expiration: 2041-09-14
Also published as: CN113571133A

Abstract

The invention discloses a lactic acid bacteria antibacterial peptide prediction method based on a graph neural network, which comprises the steps of searching known lactic acid bacteria antibacterial peptide to establish a positive sample, collecting a sequence with the length of 5-255 from a protein database to establish a negative sample, and removing redundant sequences and similarity; extracting features according to the positive and negative samples to obtain feature vectors and an initial input graph, and establishing a graph neural network model on the basis; parameters such as the optimal number of layers, the optimal training rounds, the learning rate and the like of the graph neural network are determined by training, evaluating and circularly optimizing the graph neural network model; and finally, predicting the suspected bacterial strain data with the antibacterial activity according to the graph neural network model. According to the invention, the prediction method of the lactobacillus antibacterial peptide is adopted, and the computer model prediction is used for replacing the laboratory wet experiment screening, so that the judgment time of the lactobacillus antibacterial peptide protein sequence is shortened, the accurate and efficient batch recognition is realized, and an effective replacement method is provided for the screening of lactobacillus strains with antibacterial characteristics.

Description

Lactic acid bacteria antibacterial peptide prediction method based on graph neural network

Technical Field

The invention relates to the field of identification of biological antibacterial peptides, in particular to a lactic acid bacteria antibacterial peptide prediction method based on a graph neural network.

Background

In the existing recognition technology of biological antibacterial peptide, the following two technologies are mainly adopted:

firstly, an agar hole diffusion method is adopted for bacteriostasis experiments, the consumed time is long, and high-throughput identification cannot be realized; and secondly, the recognition is carried out by adopting a machine learning technology or a long-short term memory and convolutional neural network technology in deep learning, although a plurality of amino acid sequences can be processed at one time, only the local semantic information of the antibacterial peptide sequence can be captured, and the characteristic information of the antibacterial peptide is not easy to grasp from the perspective of the overall structure, so that the recognition accuracy and other indexes need to be improved.

Disclosure of Invention

In order to solve the problems and realize accurate identification and high-flux identification of the antibacterial peptide, the invention provides the following technical scheme:

a lactic acid bacteria antibacterial peptide prediction method based on a graph neural network comprises the following steps:

s1, data acquisition, establishing a positive sample and a negative sample, wherein the positive sample is a lactobacillus antibacterial peptide sequence set separated from more than 20 known international antibacterial peptide databases, the negative sample is a non-repetitive protein sequence set which satisfies 5-255 of length and has similarity lower than 80% in the international protein databases (such as Unit prot), and establishing a sample set according to the positive sample and the negative sample;

s2, preprocessing data, performing word segmentation processing on the peptide sequence, establishing two types of nodes according to the word segmentation and the peptide sequence, establishing edges according to the word co-occurrence relation and the belonging relation of the words and the sequence, and forming an initial input graph of the neural network by the nodes and the edges together; establishing a feature vector of a word segmentation by using a word embedding technology, wherein the feature vector is used as an input feature vector of a graph neural network;

s3, constructing a graph neural network model, calculating an adjacency matrix of an initial input graph, and constructing a multilayer graph convolutional neural network according to the adjacency matrix and the input feature vector;

s4, training the graph neural network model, calculating loss through a cross entropy loss function, adjusting each layer of weight matrix of the graph neural network model according to a loss value and an optimization function, recalculating loss by using the adjusted weight matrix, and repeating the process until the loss value reaches the minimum;

s5, evaluating and optimizing the graph neural network model, evaluating the graph neural network model according to the evaluation indexes, adjusting the model layer number, the training round number and the learning rate parameter of the graph neural network model according to each evaluation index, and repeating the training of the model until the optimal parameter combination which achieves the highest accuracy of the graph neural network model and other relatively optimal evaluation indexes is found;

and S6, identifying strains, performing protein sequencing on the suspected lactobacillus strains in batches by adopting the model, and then screening and identifying whether the suspected lactobacillus strains have antibacterial activity.

Preferably, the word embedding technique in step S2 includes, but is not limited to, Bert, FastText, ELMo.

Preferably, the evaluation indexes in step S4 include, but are not limited to, sensitivity, specificity, accuracy, and manikin correlation coefficient.

Preferably, the specific process of step S5 is as follows:

s51, fixing the number of model layers to 2, and the learning rate to 0.001, sequentially changing the number of model training rounds from 50 to 500 by using the step length as 10, drawing an evaluation index change curve, and finding the best number of model training rounds at this time;

s52, fixing the model number to 2, enabling the learning rate to be 0.0001 to 0.01, sequentially changing the model training turns from 50 to 500 according to the step length of 0.0001, sequentially changing the model training turns according to the step length of 10, drawing an evaluation index change curve, and finding the best model training turn number each time;

s53, gradually changing the number of model layers from 3 to 6, and repeating the process;

and S54, finding the optimal model layer number, the optimal training round number and the learning rate by summarizing the results of the three steps.

By adopting the prediction method and the graph neural network technology, the amino acid conserved structure of the antibacterial peptide sequence is expressed as the nodes on the graph, the co-occurrence relation between the conserved structures is expressed as the edges in the graph, and the recognition problem of the antibacterial peptide is ingeniously converted into the classification problem of the nodes on the graph. Because the graph structure is an integral structure, the structure can capture and mine the characteristic information of the antibacterial peptide sequence from the integral angle, thereby realizing the accurate classification of the nodes in the graph. Compared with the prior art, the identification accuracy index is greatly improved, and batch identification is realized.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a diagram illustrating an implementation of data collection in an embodiment of the present invention;

FIG. 3 is a partial data of a positive sample according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the drawings and the embodiment.

The lactobacillus antimicrobial peptide prediction method based on the graph neural network is mainly divided into four aspects of data acquisition, model establishment, model optimization and model prediction.

In particular, it can be subdivided into the following steps:

s1, collecting data, and establishing a positive sample and a negative sample

The positive sample is a lactobacillus antibacterial peptide sequence set separated from a comprehensive and thematic antibacterial peptide database obtained by investigation, the negative sample is a protein sequence set meeting the length requirement of 5-255, and a sample set is established according to the positive sample and the negative sample.

As shown in fig. 2, lactic acid bacteria antimicrobial peptides are separated from antimicrobial peptide databases such as APD3, ADAM, DRAMP and the like, and a positive sample is established; protein sequences with the sequence length of 5-255 are separated from public databases such as PDB, UniProt and the like, and negative samples are established. Both positive and negative examples require the use of CD-HIT, CD-HIT-2D software to remove redundant sequences and sequences with similarity greater than 80%, and then combine them into a sample set. The model was evaluated using a 10-fold cross-validation method.

S2, preprocessing data

The method statistically analyzes the length range of the data sequence of the lactobacillus antimicrobial peptide and the proportion distribution of each amino acid, researches various Chinese word segmentation technologies processed by natural languages, can determine amino acid conservative structure combination by adopting methods of multiple sequence comparison, single amino acid, dipeptide and the like, and integrates the information to determine a word segmentation scheme.

And vectorizing the words by using word embedding technologies such as Bert, FastText, ELMo and the like to form feature vectors of the words, wherein the feature vectors are used as input feature vectors of the graph neural network. And establishing nodes according to the words of the peptide sequences and the peptide sequences, establishing edges according to the co-occurrence relation of the words and the affiliated relation of the words and the sequences, and forming an initial input graph of the neural network by the nodes and the edges together.

The term is used herein to refer to a domain that may be conserved in a protein sequence. There are 20 kinds of amino acids (see the one-letter abbreviation table of amino acids for details) in nature, a plurality of amino acids form a peptide chain, and one or more peptide chains may form a protein. The word can be formed by a single amino acid or a group of two amino acids, and can also be formed by possible conserved sequences in the antibacterial peptide sequence structure, and the word and the sequence are used as nodes of a graph neural network and establish the relationship between the nodes, so that the graph neural network model can be adopted for identification processing.

S3 construction of graph neural network model

And calculating an adjacency matrix of the initial input graph, and constructing a multi-layer graph convolution neural network according to the adjacency matrix and the feature vector.

A multi-layer graph convolutional neural network may be constructed according to equation (1).

Z(A,X)＝softmax(A'…(ReLU(A'XW₀))…W_n) (1)

Where A is the adjacency matrix, X is the eigenvector, ReLU is the activation function, W₀、W_nThe weight matrix is determined according to the number of layers of the graph convolution neural network.

A' is obtained by subjecting A to Laplace transform (2).

D is a degree matrix of the graph, I is an identity matrix, and a calculation formula of D is shown in (3).

S4 training of graph neural network model

Calculating a loss value through a cross entropy loss function, and adjusting a weight matrix W through an Adam optimizer according to the loss value₁To W_nAnd recalculating the loss value by using the adjusted weight matrix, and repeating the process until the loss value reaches the minimum value.

S5, evaluation and tuning of graph neural network model

(1) Evaluation of

And evaluating the neural network model according to the evaluation index, and verifying the accuracy of the neural network model. The evaluation indexes comprise sensitivity, specificity, accuracy and Marek's correlation coefficient.

The scheme is evaluated according to four indexes.

Sensitivity (SN) represents the proportion of all antimicrobial peptides that are correctly predicted; specificity (SP) indicates the proportion of all non-antibacterial peptides that are correctly predicted; accuracy (ACC) represents the proportion of all samples that are correctly predicted. Since this index is considered to be the most important index among the evaluation indexes, it can be considered as an index by which the model expresses the effect of the prediction model; the Mathew's Correlation Coefficient (MCC) is used to evaluate the classification performance, and it is a statistical method to measure the Correlation between the predicted result and the actual result.

True Positive (TP) indicates the number of antimicrobial peptides predicted to be antimicrobial peptides; true Negative (TN) indicates the number of non-antibacterial peptides predicted to be non-antibacterial peptides; false Positive (FP) indicates the number of antimicrobial peptides predicted to be non-antimicrobial peptides; false Negative (FN) indicates the number of non-antibacterial peptides predicted to be antibacterial peptides.

(2) Adjusting and optimizing

Parameters such as the number of model layers, the number of rounds of epochs, the Learning Rate and the like are optimized through the following steps.

S51, constructing experience according to the deep learning model, wherein the number of model layers is fixed to 2, the learning rate is 0.001, model training turns are sequentially changed by taking the step length as 10 from 50 to 500, an evaluation index change curve is drawn, and the best model training turn is found;

S6, strain identification

And screening and identifying a suspected lactic acid bacteria strain sequence with antibacterial activity after protein sequencing by adopting the model. Of course, the model can be loaded into an intelligent device in various forms such as APP, a client, an H5 applet and Web, so that screening and identification of the non-determined strains can be facilitated at any time.

The above is a specific embodiment of the present invention, but the scope of the present invention should not be limited thereto. Any changes or substitutions that can be easily made by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention, and therefore, the protection scope of the present invention is subject to the protection scope defined by the appended claims.

Claims

1. A lactic acid bacteria antibacterial peptide prediction method based on a graph neural network is characterized by comprising the following steps:

s1, collecting data, establishing a positive sample and a negative sample, wherein the positive sample is a lactic acid bacteria antibacterial peptide sequence set separated from known international antibacterial peptide databases, the negative sample is a non-repetitive protein sequence set which meets the requirement of 5-255 in length and has similarity lower than 80% in a protein database, and establishing a sample set according to the positive sample and the negative sample;

s2, preprocessing data, performing word segmentation processing on the peptide sequence, establishing two types of nodes according to the word segmentation and the peptide sequence, establishing edges according to the word co-occurrence relation and the belonging relation of the words and the sequence, and forming an initial input graph of the graph neural network by the nodes and the edges together; establishing a feature vector of a participle by using a word embedding technology, wherein the feature vector is used as an input feature vector of a graph neural network;

s4, training the graph neural network model, calculating loss through a cross entropy loss function, adjusting each layer of weight matrix of the graph neural network model according to the loss value, recalculating the loss by using the adjusted weight matrix, and repeating the process until the loss value reaches the minimum;

s5, evaluating and adjusting the graph neural network model, evaluating the graph neural network model according to the evaluation indexes, adjusting the layer number, the training round number and the learning rate parameter of the graph neural network model according to each evaluation index, and repeating the training model until the optimal parameter combination which reaches the highest accuracy of the graph neural network model and other relatively excellent evaluation indexes is found;

2. The method for predicting lactic acid bacteria antimicrobial peptides based on the graph neural network according to claim 1, wherein: the word embedding technique in the step S2 includes, but is not limited to, Bert, FastText, ELMo.

3. The method for predicting lactobacillus antimicrobial peptides based on neural network of claim 1, wherein the evaluation index in step S5 includes but is not limited to sensitivity, specificity, accuracy, and manikin correlation coefficient.