CN113571133B - Lactic acid bacteria antibacterial peptide prediction method based on graph neural network - Google Patents
Lactic acid bacteria antibacterial peptide prediction method based on graph neural network Download PDFInfo
- Publication number
- CN113571133B CN113571133B CN202111074774.5A CN202111074774A CN113571133B CN 113571133 B CN113571133 B CN 113571133B CN 202111074774 A CN202111074774 A CN 202111074774A CN 113571133 B CN113571133 B CN 113571133B
- Authority
- CN
- China
- Prior art keywords
- neural network
- graph neural
- graph
- network model
- antibacterial peptide
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Biotechnology (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computing Systems (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a lactic acid bacteria antibacterial peptide prediction method based on a graph neural network, which comprises the steps of searching known lactic acid bacteria antibacterial peptide to establish a positive sample, collecting a sequence with the length of 5-255 from a protein database to establish a negative sample, and removing redundant sequences and similarity; extracting features according to the positive and negative samples to obtain feature vectors and an initial input graph, and establishing a graph neural network model on the basis; parameters such as the optimal number of layers, the optimal training rounds, the learning rate and the like of the graph neural network are determined by training, evaluating and circularly optimizing the graph neural network model; and finally, predicting the suspected bacterial strain data with the antibacterial activity according to the graph neural network model. According to the invention, the prediction method of the lactobacillus antibacterial peptide is adopted, and the computer model prediction is used for replacing the laboratory wet experiment screening, so that the judgment time of the lactobacillus antibacterial peptide protein sequence is shortened, the accurate and efficient batch recognition is realized, and an effective replacement method is provided for the screening of lactobacillus strains with antibacterial characteristics.
Description
Technical Field
The invention relates to the field of identification of biological antibacterial peptides, in particular to a lactic acid bacteria antibacterial peptide prediction method based on a graph neural network.
Background
In the existing recognition technology of biological antibacterial peptide, the following two technologies are mainly adopted:
firstly, an agar hole diffusion method is adopted for bacteriostasis experiments, the consumed time is long, and high-throughput identification cannot be realized; and secondly, the recognition is carried out by adopting a machine learning technology or a long-short term memory and convolutional neural network technology in deep learning, although a plurality of amino acid sequences can be processed at one time, only the local semantic information of the antibacterial peptide sequence can be captured, and the characteristic information of the antibacterial peptide is not easy to grasp from the perspective of the overall structure, so that the recognition accuracy and other indexes need to be improved.
Disclosure of Invention
In order to solve the problems and realize accurate identification and high-flux identification of the antibacterial peptide, the invention provides the following technical scheme:
a lactic acid bacteria antibacterial peptide prediction method based on a graph neural network comprises the following steps:
s1, data acquisition, establishing a positive sample and a negative sample, wherein the positive sample is a lactobacillus antibacterial peptide sequence set separated from more than 20 known international antibacterial peptide databases, the negative sample is a non-repetitive protein sequence set which satisfies 5-255 of length and has similarity lower than 80% in the international protein databases (such as Unit prot), and establishing a sample set according to the positive sample and the negative sample;
s2, preprocessing data, performing word segmentation processing on the peptide sequence, establishing two types of nodes according to the word segmentation and the peptide sequence, establishing edges according to the word co-occurrence relation and the belonging relation of the words and the sequence, and forming an initial input graph of the neural network by the nodes and the edges together; establishing a feature vector of a word segmentation by using a word embedding technology, wherein the feature vector is used as an input feature vector of a graph neural network;
s3, constructing a graph neural network model, calculating an adjacency matrix of an initial input graph, and constructing a multilayer graph convolutional neural network according to the adjacency matrix and the input feature vector;
s4, training the graph neural network model, calculating loss through a cross entropy loss function, adjusting each layer of weight matrix of the graph neural network model according to a loss value and an optimization function, recalculating loss by using the adjusted weight matrix, and repeating the process until the loss value reaches the minimum;
s5, evaluating and optimizing the graph neural network model, evaluating the graph neural network model according to the evaluation indexes, adjusting the model layer number, the training round number and the learning rate parameter of the graph neural network model according to each evaluation index, and repeating the training of the model until the optimal parameter combination which achieves the highest accuracy of the graph neural network model and other relatively optimal evaluation indexes is found;
and S6, identifying strains, performing protein sequencing on the suspected lactobacillus strains in batches by adopting the model, and then screening and identifying whether the suspected lactobacillus strains have antibacterial activity.
Preferably, the word embedding technique in step S2 includes, but is not limited to, Bert, FastText, ELMo.
Preferably, the evaluation indexes in step S4 include, but are not limited to, sensitivity, specificity, accuracy, and manikin correlation coefficient.
Preferably, the specific process of step S5 is as follows:
s51, fixing the number of model layers to 2, and the learning rate to 0.001, sequentially changing the number of model training rounds from 50 to 500 by using the step length as 10, drawing an evaluation index change curve, and finding the best number of model training rounds at this time;
s52, fixing the model number to 2, enabling the learning rate to be 0.0001 to 0.01, sequentially changing the model training turns from 50 to 500 according to the step length of 0.0001, sequentially changing the model training turns according to the step length of 10, drawing an evaluation index change curve, and finding the best model training turn number each time;
s53, gradually changing the number of model layers from 3 to 6, and repeating the process;
and S54, finding the optimal model layer number, the optimal training round number and the learning rate by summarizing the results of the three steps.
By adopting the prediction method and the graph neural network technology, the amino acid conserved structure of the antibacterial peptide sequence is expressed as the nodes on the graph, the co-occurrence relation between the conserved structures is expressed as the edges in the graph, and the recognition problem of the antibacterial peptide is ingeniously converted into the classification problem of the nodes on the graph. Because the graph structure is an integral structure, the structure can capture and mine the characteristic information of the antibacterial peptide sequence from the integral angle, thereby realizing the accurate classification of the nodes in the graph. Compared with the prior art, the identification accuracy index is greatly improved, and batch identification is realized.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a diagram illustrating an implementation of data collection in an embodiment of the present invention;
FIG. 3 is a partial data of a positive sample according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the drawings and the embodiment.
The lactobacillus antimicrobial peptide prediction method based on the graph neural network is mainly divided into four aspects of data acquisition, model establishment, model optimization and model prediction.
In particular, it can be subdivided into the following steps:
s1, collecting data, and establishing a positive sample and a negative sample
The positive sample is a lactobacillus antibacterial peptide sequence set separated from a comprehensive and thematic antibacterial peptide database obtained by investigation, the negative sample is a protein sequence set meeting the length requirement of 5-255, and a sample set is established according to the positive sample and the negative sample.
As shown in fig. 2, lactic acid bacteria antimicrobial peptides are separated from antimicrobial peptide databases such as APD3, ADAM, DRAMP and the like, and a positive sample is established; protein sequences with the sequence length of 5-255 are separated from public databases such as PDB, UniProt and the like, and negative samples are established. Both positive and negative examples require the use of CD-HIT, CD-HIT-2D software to remove redundant sequences and sequences with similarity greater than 80%, and then combine them into a sample set. The model was evaluated using a 10-fold cross-validation method.
S2, preprocessing data
The method statistically analyzes the length range of the data sequence of the lactobacillus antimicrobial peptide and the proportion distribution of each amino acid, researches various Chinese word segmentation technologies processed by natural languages, can determine amino acid conservative structure combination by adopting methods of multiple sequence comparison, single amino acid, dipeptide and the like, and integrates the information to determine a word segmentation scheme.
And vectorizing the words by using word embedding technologies such as Bert, FastText, ELMo and the like to form feature vectors of the words, wherein the feature vectors are used as input feature vectors of the graph neural network. And establishing nodes according to the words of the peptide sequences and the peptide sequences, establishing edges according to the co-occurrence relation of the words and the affiliated relation of the words and the sequences, and forming an initial input graph of the neural network by the nodes and the edges together.
The term is used herein to refer to a domain that may be conserved in a protein sequence. There are 20 kinds of amino acids (see the one-letter abbreviation table of amino acids for details) in nature, a plurality of amino acids form a peptide chain, and one or more peptide chains may form a protein. The word can be formed by a single amino acid or a group of two amino acids, and can also be formed by possible conserved sequences in the antibacterial peptide sequence structure, and the word and the sequence are used as nodes of a graph neural network and establish the relationship between the nodes, so that the graph neural network model can be adopted for identification processing.
S3 construction of graph neural network model
And calculating an adjacency matrix of the initial input graph, and constructing a multi-layer graph convolution neural network according to the adjacency matrix and the feature vector.
A multi-layer graph convolutional neural network may be constructed according to equation (1).
Z(A,X)=softmax(A'…(ReLU(A'XW0))…Wn) (1)
Where A is the adjacency matrix, X is the eigenvector, ReLU is the activation function, W0、WnThe weight matrix is determined according to the number of layers of the graph convolution neural network.
A' is obtained by subjecting A to Laplace transform (2).
D is a degree matrix of the graph, I is an identity matrix, and a calculation formula of D is shown in (3).
S4 training of graph neural network model
Calculating a loss value through a cross entropy loss function, and adjusting a weight matrix W through an Adam optimizer according to the loss value1To WnAnd recalculating the loss value by using the adjusted weight matrix, and repeating the process until the loss value reaches the minimum value.
S5, evaluation and tuning of graph neural network model
(1) Evaluation of
And evaluating the neural network model according to the evaluation index, and verifying the accuracy of the neural network model. The evaluation indexes comprise sensitivity, specificity, accuracy and Marek's correlation coefficient.
The scheme is evaluated according to four indexes.
Sensitivity (SN) represents the proportion of all antimicrobial peptides that are correctly predicted; specificity (SP) indicates the proportion of all non-antibacterial peptides that are correctly predicted; accuracy (ACC) represents the proportion of all samples that are correctly predicted. Since this index is considered to be the most important index among the evaluation indexes, it can be considered as an index by which the model expresses the effect of the prediction model; the Mathew's Correlation Coefficient (MCC) is used to evaluate the classification performance, and it is a statistical method to measure the Correlation between the predicted result and the actual result.
True Positive (TP) indicates the number of antimicrobial peptides predicted to be antimicrobial peptides; true Negative (TN) indicates the number of non-antibacterial peptides predicted to be non-antibacterial peptides; false Positive (FP) indicates the number of antimicrobial peptides predicted to be non-antimicrobial peptides; false Negative (FN) indicates the number of non-antibacterial peptides predicted to be antibacterial peptides.
(2) Adjusting and optimizing
Parameters such as the number of model layers, the number of rounds of epochs, the Learning Rate and the like are optimized through the following steps.
S51, constructing experience according to the deep learning model, wherein the number of model layers is fixed to 2, the learning rate is 0.001, model training turns are sequentially changed by taking the step length as 10 from 50 to 500, an evaluation index change curve is drawn, and the best model training turn is found;
s52, fixing the model number to 2, enabling the learning rate to be 0.0001 to 0.01, sequentially changing the model training turns from 50 to 500 according to the step length of 0.0001, sequentially changing the model training turns according to the step length of 10, drawing an evaluation index change curve, and finding the best model training turn number each time;
s53, gradually changing the number of model layers from 3 to 6, and repeating the process;
and S54, finding the optimal model layer number, the optimal training round number and the learning rate by summarizing the results of the three steps.
S6, strain identification
And screening and identifying a suspected lactic acid bacteria strain sequence with antibacterial activity after protein sequencing by adopting the model. Of course, the model can be loaded into an intelligent device in various forms such as APP, a client, an H5 applet and Web, so that screening and identification of the non-determined strains can be facilitated at any time.
The above is a specific embodiment of the present invention, but the scope of the present invention should not be limited thereto. Any changes or substitutions that can be easily made by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention, and therefore, the protection scope of the present invention is subject to the protection scope defined by the appended claims.
Claims (3)
1. A lactic acid bacteria antibacterial peptide prediction method based on a graph neural network is characterized by comprising the following steps:
s1, collecting data, establishing a positive sample and a negative sample, wherein the positive sample is a lactic acid bacteria antibacterial peptide sequence set separated from known international antibacterial peptide databases, the negative sample is a non-repetitive protein sequence set which meets the requirement of 5-255 in length and has similarity lower than 80% in a protein database, and establishing a sample set according to the positive sample and the negative sample;
s2, preprocessing data, performing word segmentation processing on the peptide sequence, establishing two types of nodes according to the word segmentation and the peptide sequence, establishing edges according to the word co-occurrence relation and the belonging relation of the words and the sequence, and forming an initial input graph of the graph neural network by the nodes and the edges together; establishing a feature vector of a participle by using a word embedding technology, wherein the feature vector is used as an input feature vector of a graph neural network;
s3, constructing a graph neural network model, calculating an adjacency matrix of an initial input graph, and constructing a multilayer graph convolutional neural network according to the adjacency matrix and the input feature vector;
s4, training the graph neural network model, calculating loss through a cross entropy loss function, adjusting each layer of weight matrix of the graph neural network model according to the loss value, recalculating the loss by using the adjusted weight matrix, and repeating the process until the loss value reaches the minimum;
s5, evaluating and adjusting the graph neural network model, evaluating the graph neural network model according to the evaluation indexes, adjusting the layer number, the training round number and the learning rate parameter of the graph neural network model according to each evaluation index, and repeating the training model until the optimal parameter combination which reaches the highest accuracy of the graph neural network model and other relatively excellent evaluation indexes is found;
and S6, identifying strains, performing protein sequencing on the suspected lactobacillus strains in batches by adopting the model, and then screening and identifying whether the suspected lactobacillus strains have antibacterial activity.
2. The method for predicting lactic acid bacteria antimicrobial peptides based on the graph neural network according to claim 1, wherein: the word embedding technique in the step S2 includes, but is not limited to, Bert, FastText, ELMo.
3. The method for predicting lactobacillus antimicrobial peptides based on neural network of claim 1, wherein the evaluation index in step S5 includes but is not limited to sensitivity, specificity, accuracy, and manikin correlation coefficient.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111074774.5A CN113571133B (en) | 2021-09-14 | 2021-09-14 | Lactic acid bacteria antibacterial peptide prediction method based on graph neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111074774.5A CN113571133B (en) | 2021-09-14 | 2021-09-14 | Lactic acid bacteria antibacterial peptide prediction method based on graph neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113571133A CN113571133A (en) | 2021-10-29 |
CN113571133B true CN113571133B (en) | 2022-06-17 |
Family
ID=78173770
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111074774.5A Active CN113571133B (en) | 2021-09-14 | 2021-09-14 | Lactic acid bacteria antibacterial peptide prediction method based on graph neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113571133B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114999586B (en) * | 2022-06-14 | 2023-08-08 | 内蒙古农业大学 | Method for predicting interaction of lactobacillus bulgaricus and streptococcus thermophilus |
CN115938486B (en) * | 2022-12-06 | 2023-11-10 | 内蒙古农业大学 | Antibacterial lactic acid bacterial strain screening method based on graph neural network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112614538A (en) * | 2020-12-17 | 2021-04-06 | 厦门大学 | Antibacterial peptide prediction method and device based on protein pre-training characterization learning |
WO2021164365A1 (en) * | 2020-02-17 | 2021-08-26 | 支付宝(杭州)信息技术有限公司 | Graph neural network model training method, apparatus and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021007812A1 (en) * | 2019-07-17 | 2021-01-21 | 深圳大学 | Deep neural network hyperparameter optimization method, electronic device and storage medium |
-
2021
- 2021-09-14 CN CN202111074774.5A patent/CN113571133B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021164365A1 (en) * | 2020-02-17 | 2021-08-26 | 支付宝(杭州)信息技术有限公司 | Graph neural network model training method, apparatus and system |
CN112614538A (en) * | 2020-12-17 | 2021-04-06 | 厦门大学 | Antibacterial peptide prediction method and device based on protein pre-training characterization learning |
Non-Patent Citations (2)
Title |
---|
布晓婷等.基于多标签直推学习的抗菌肽及其抗菌功能预测.《大连理工大学学报》.2017,(第03期),全文. * |
方春等.基于长短期记忆网络的抗癌肽的预测.《山东理工大学学报(自然科学版)》.2020,(第03期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN113571133A (en) | 2021-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113571133B (en) | Lactic acid bacteria antibacterial peptide prediction method based on graph neural network | |
CN111881722B (en) | Cross-age face recognition method, system, device and storage medium | |
CN109190698B (en) | Classification and identification system and method for network digital virtual assets | |
CN112214335B (en) | Web service discovery method based on knowledge graph and similarity network | |
WO2021043126A1 (en) | System and method for event recognition | |
CN114004252A (en) | Bearing fault diagnosis method, device and equipment | |
Packianather et al. | A wrapper-based feature selection approach using Bees Algorithm for a wood defect classification system | |
CN114120063A (en) | Unsupervised fine-grained image classification model training method and classification method based on clustering | |
CN109617864B (en) | Website identification method and website identification system | |
CN114881173A (en) | Resume classification method and device based on self-attention mechanism | |
CN114897085A (en) | Clustering method based on closed subgraph link prediction and computer equipment | |
CN113269274B (en) | Zero sample identification method and system based on cycle consistency | |
CN113764034A (en) | Method, device, equipment and medium for predicting potential BGC in genome sequence | |
CN112967755A (en) | Cell type identification method for single cell RNA sequencing data | |
CN116881841A (en) | Hybrid model fault diagnosis method based on F1-score multistage decision analysis | |
CN111401444A (en) | Method and device for predicting origin of red wine, computer equipment and storage medium | |
CN114757433B (en) | Method for rapidly identifying relative risk of drinking water source antibiotic resistance | |
CN110797080A (en) | Predicting synthetic lethal genes based on cross-species migratory learning | |
CN112951320B (en) | Biomedical network association prediction method based on ensemble learning | |
CN105488502B (en) | Object detection method and device | |
CN114943290A (en) | Biological invasion identification method based on multi-source data fusion analysis | |
CN110097126B (en) | Method for checking important personnel and house missing registration based on DBSCAN clustering algorithm | |
CN111402953B (en) | Protein sequence classification method based on hierarchical attention network | |
CN108304546B (en) | Medical image retrieval method based on content similarity and Softmax classifier | |
Cantú-Paz et al. | Evolving neural networks to identify bent-double galaxies in the FIRST survey |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |