CN116206690B - Antibacterial peptide generation and identification method and system - Google Patents
Antibacterial peptide generation and identification method and system Download PDFInfo
- Publication number
- CN116206690B CN116206690B CN202310483081.4A CN202310483081A CN116206690B CN 116206690 B CN116206690 B CN 116206690B CN 202310483081 A CN202310483081 A CN 202310483081A CN 116206690 B CN116206690 B CN 116206690B
- Authority
- CN
- China
- Prior art keywords
- information
- polypeptide
- antibacterial peptide
- sequence
- dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
- G16B35/20—Screening of libraries
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
- G16H70/40—ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A50/00—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
- Y02A50/30—Against vector-borne diseases, e.g. mosquito-borne, fly-borne, tick-borne or waterborne diseases whose impact is exacerbated by climate change
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Library & Information Science (AREA)
- Data Mining & Analysis (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Artificial Intelligence (AREA)
- Biochemistry (AREA)
- Molecular Biology (AREA)
- Medicinal Chemistry (AREA)
- Pharmacology & Pharmacy (AREA)
- Toxicology (AREA)
- Primary Health Care (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention provides a method and a system for generating and identifying antibacterial peptide, which belong to the technical field of computer-aided drug research and development, wherein the scheme comprises the following steps: acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide; obtaining structural information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; inputting the reference polypeptide sequence information and the structure information into a pre-trained variational self-coding model to obtain sequence information and structure information of a target antibacterial peptide; based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide, a functional identification result of the target antibacterial peptide is obtained by utilizing a pre-trained neural network model; the scheme can generate the antibacterial peptide with specific properties, and the generated antibacterial peptide is different from the antibacterial peptide existing in the nature; meanwhile, in the identification of the antibacterial peptide, the identification accuracy of the polypeptide is effectively improved based on a multi-mode identification method.
Description
Technical Field
The invention belongs to the technical field of computer-aided drug research and development, and particularly relates to an antibacterial peptide generation and identification method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The Polypeptide (Polypeptide) is a group of amino acid sequences connected by peptide bonds, and can be widely used in various fields such as cardiovascular diseases, cytokine simulation, antibiosis and the like. However, there are a limited number of polypeptides existing in nature and a single function, and thus, there is a growing interest in designing and artificially synthesizing polypeptides. In order to synthesize polypeptide sequences with specific functions, there are currently mainly two types of methods:
the first type is numerical simulation and calculation based on physical and chemical methods, and the methods are large in calculation amount, low in efficiency and not ideal in accuracy, so that the method is difficult to be used for large-scale polypeptide engineering;
the second type is a machine learning-based method, and the method realizes rapid polypeptide function prediction by learning from a huge amount of data sets with function labels. However, such machine learning-based methods have a high degree of predictive performance depending on the selection and processing of data features (i.e., feature engineering), as well as feature learning methods. If the characteristics constructed are not adequate, the predictive performance of the model on polypeptide function can be seriously affected. In recent years, with the development of deep learning in machine learning, many researchers have attempted to generate or identify polypeptides having specific functions using sequence-based models, such as cyclic neural networks (RNNs), gated cyclic neural networks (GRUs), long-short-term memory neural networks (LSTM), and the like. However, from the first principle, the structure of a protein/polypeptide has a crucial influence on its functioning, whereas all previous approaches ignore the structural features of the polypeptide. For this reason, a model is needed for more accurately generating and identifying specific antibacterial peptides by considering both sequence information and structural information.
Recently, there have also been some structure-based approaches for predicting the affinity of proteins and small molecules. For example, many researchers have implemented prediction of protein/polypeptide function by constructing a graphic neural network to achieve representation and learning of amino acid sequences. Specifically, most of various methods based on the graph neural network (Graph Neural Network) use amino acids as nodes of the graph neural network, and utilize physical and chemical properties of the amino acids as characteristics of the nodes, and the distance between alpha carbon atoms in the amino acids is used as a side for connecting the amino acids. However, the above method ignores the characteristic that the length of the polypeptide sequence often exhibits long tail distribution, i.e., the number of sequences belonging to the length is continuously decreasing as the length of the sequence increases. Because the dimension of the adjacency matrix of the graph neural network is consistent with the maximum number of amino acids, the adjacency matrix of the graph neural network is too sparse. In addition, since amino acids between polypeptide sequences tend to be tandem, they often rely on a single pathway for feature transformation, resulting in the difficulty in transferring amino acid features that are too far apart from each other. Therefore, there is a need for a more rational method for expressing and learning polypeptide sequences, which overcomes the above negative effects and enables more accurate prediction of polypeptide functions.
Meanwhile, since the properties of the polypeptide existing in nature are often unbalanced in distribution, the number of the antibacterial peptides capable of inhibiting one type of drug-resistant bacteria is likely to be far greater than the number of the antibacterial peptides capable of inhibiting another type of drug-resistant bacteria, and in addition, one antibacterial peptide can have inhibition ability on multiple types of drug-resistant bacteria, the phenomenon can cause serious data unbalance in the antibacterial peptide identification process, so that the model is biased to identify a plurality of types and study on a few types is omitted. Thus, there is also a need for a method for alleviating sample distribution imbalance for identifying a minority class in antimicrobial peptides.
Disclosure of Invention
In order to solve the problems, the invention provides an antibacterial peptide generation and identification method and system, wherein the scheme is characterized in that the antibacterial peptide with specific attribute can be generated by introducing preset functional attribute information in the generation process of the antibacterial peptide, and the generated antibacterial peptide is different from the antibacterial peptide existing in the nature; meanwhile, in the identification of the antibacterial peptide, based on a multi-mode identification method, through combining the polypeptide sequence characteristics with the structural characteristics of the polypeptide, the identification accuracy of the polypeptide is effectively improved, the effect of the polypeptide can be predicted more accurately, the research and development efficiency of the computer-aided polypeptide drugs can be greatly improved, and more proper drugs are screened out, so that the cost of the drug research and development process is reduced.
According to a first aspect of an embodiment of the present invention, there is provided an antibacterial peptide generation and recognition method including:
acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
obtaining structural information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
inputting the reference polypeptide sequence information and the structure information into a pre-trained variational self-coding model to obtain sequence information and structure information of a target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence information and structural information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structural information, so as to realize generation of a belt condition;
based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide, a functional identification result of the target antibacterial peptide is obtained by utilizing a pre-trained neural network model.
Further, the functional attribute information is obtained according to the functional attribute information of the polypeptide sequence, and specifically expressed as a binary sequence, wherein the length of the binary sequence is the sum of the functional attributes existing in the known various polypeptide sequences, each bit in the binary sequence represents one functional attribute, when the bit is 1, the functional attribute is represented, and when the bit is 0, the functional attribute is represented.
Further, the variation self-coding model comprises two branches at the same time, wherein the first branch is used for processing structural information, and the second branch is used for processing sequence information; in order to embed functional attribute information into a variable self-encoder, the functional attribute information is compressed to one dimension using a multi-layer perceptron and then multiplied with input information to be encoded.
Further, the training process of the variation self-coding model specifically comprises the following steps:
acquiring a preset number of polypeptide data samples to construct a training set, wherein the polypeptide data samples comprise polypeptide sequences and functional attribute information corresponding to the current polypeptide sequences;
representing the functional attribute information corresponding to the current polypeptide sequence by using a binary sequence; analyzing each sample to obtain three-dimensional space structure information of polypeptide data; based on the three-dimensional space structure information of the polypeptide data, acquiring the structure information of each sample by adopting a polypeptide multi-channel coloring method;
and taking the structure information and the sequence information of each polypeptide data sample as the input of a variation self-coding model, introducing the functional attribute information into the middle layer of the variation self-coding model, and taking the polypeptide sequence information and the polypeptide structure information of each polypeptide data sample as the output of the variation self-coding model to realize the training of the variation self-coding model.
Further, the whole polypeptide is expressed as a cube structure consisting of a plurality of three-dimensional voxels, specifically: constructing a three-dimensional space coordinate system by taking the gravity center of the polypeptide as an origin and a cube structure containing the whole polypeptide; and dividing the cube structure containing the whole polypeptide into a plurality of three-dimensional voxels by taking the preset unit distance as the side length of the three-dimensional voxels.
Furthermore, the coloring of each three-dimensional voxel is based on the concept of RGB multi-channel imaging, and the quality of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the acid-base property of the amino acid to which the atom belongs are utilized to respectively assign values to each channel, so that the multi-channel representation of the color of each three-dimensional voxel is obtained.
Further, for three-dimensional voxels that do not contain atoms, each channel for color representation is assigned a zero value.
Further, the solubility of the amino acid is divided into hydrophilic and hydrophobic, and different values are set for the hydrophilic amino acid and the hydrophobic amino acid; the acid-base properties of the amino acids are classified into acidity, neutrality and alkalinity, wherein the acidic amino acids, the neutral amino acids and the acidic amino acids are set to different values.
Further, the selection of the reference polypeptide sequence is based on functional requirements of the antimicrobial peptide of interest from a library of polypeptide sequences for which functional information is known.
According to a second aspect of embodiments of the present invention, there is provided an antimicrobial peptide generation and recognition system comprising:
a data acquisition unit for acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
the structure characteristic extraction unit is used for acquiring structure information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
the target antibacterial peptide generation unit is used for inputting the reference polypeptide sequence information and the structure information into a pre-trained variation self-coding model to obtain sequence information and structure information of the target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence information and structural information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structural information, so as to realize generation of a belt condition;
and the function recognition unit is used for obtaining a function recognition result of the target antibacterial peptide by utilizing a pre-trained neural network model based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide.
Compared with the prior art, the invention has the beneficial effects that:
(1) The invention provides a method and a system for generating and identifying antibacterial peptide, wherein functional attribute information is introduced in the generation process of the antibacterial peptide, so that the antibacterial peptide with specific attribute can be generated, and the generated antibacterial peptide is different from the antibacterial peptide existing in the nature; meanwhile, in the identification of the antibacterial peptide, based on a multi-mode identification method, through combining the polypeptide sequence characteristics with the structural characteristics of the polypeptide, the identification accuracy of the polypeptide is effectively improved, the effect of the polypeptide can be predicted more accurately, the research and development efficiency of the computer-aided polypeptide drugs can be greatly improved, and more proper drugs are screened out, so that the cost of the drug research and development process is reduced.
(2) The scheme of the invention is based on a polypeptide multi-channel coloring method, and each polypeptide is regarded as a cube structure consisting of a plurality of three-dimensional voxels (3D pixels) containing three channel colors, so that the structural characteristic representation of the polypeptide is obtained; the method can acquire additional effective characteristics and relieve the problem of deep learning model performance attenuation caused by long tail distribution phenomenon of the sequence length.
(3) The scheme greatly improves the accuracy of multi-tag classification by expanding the Softmax loss function for multi-classification into the multi-tag classification task (namely the identification of the antibacterial peptide).
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a flowchart of an antimicrobial peptide generation and identification method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a model training process according to an embodiment of the present invention;
FIG. 3 is a schematic diagram showing the process of generating antimicrobial peptides by using a trained generation model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram showing the identification process of the antibacterial peptide according to the embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Embodiment one:
the aim of this example is to provide a method for producing and identifying antimicrobial peptides.
In order to solve the problems existing in the prior art, the present embodiment provides an antibacterial peptide generation and identification method, which mainly adopts the following technical concepts: the method comprises the steps of defining each functional attribute of a polypeptide capable of inhibiting a specific bacterium (specifically, the polypeptide can be determined according to actual requirements, and the broad-spectrum drug-resistant bacterium is adopted in the embodiment) as a plurality of labels, so that a strip condition generating task and an identification task based on multi-label information are constructed; meanwhile, in the process of polypeptide generation and identification, polypeptide characteristics are extracted by adopting a multi-channel coloring method based on polypeptides, and polypeptide sequence information is combined to realize polypeptide generation and identification based on multi-mode data. Specifically, as shown in fig. 1, the method specifically includes the following steps:
step 1: acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
wherein the functional attribute information is obtained according to functional attribute information (i.e., a tag) of the polypeptide sequence, and in a specific embodiment, the functional attribute information may be represented as a binary sequence, where the length of the binary sequence is the sum of functional attributes existing in various known polypeptide sequences, each bit in the binary sequence represents a functional attribute, and when the bit is 1, it represents that the functional attribute is provided, and when the bit is 0, it represents that the functional attribute is not provided.
In particular embodiments, the selection of the reference polypeptide sequence is based on functional requirements of the antimicrobial peptide of interest from a library of polypeptide sequences for which functional information is known.
Step 2: obtaining structural information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
in a specific embodiment, the polypeptide structure information is obtained by carrying out structural analysis on a polypeptide science column, and the polypeptide structural characteristics are obtained by adopting a polypeptide multichannel coloring method based on the polypeptide three-dimensional structure information; wherein, for the polypeptide with the resolved structure, directly downloading PDB (Protein Data Bank) file; and predicting the three-dimensional space structure of each polypeptide by using an alpha fold2 model for the polypeptides with the structures which are not analyzed, and storing the polypeptides in a PDB file mode.
Specifically, the whole polypeptide is expressed as a cube structure consisting of a plurality of three-dimensional voxels, specifically: constructing a three-dimensional space coordinate system by taking the gravity center of the polypeptide as an origin and a cube structure containing the whole polypeptide; taking a preset unit distance as the side length of a three-dimensional voxel, and dividing a cube structure containing the whole polypeptide into a plurality of three-dimensional voxels; the coloring of each three-dimensional voxel is based on the concept of RGB multi-channel imaging, and each channel is respectively assigned by utilizing the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the acid-base property of the amino acid to which the atom belongs, so that the multi-channel representation of the color of each three-dimensional voxel is obtained; for three-dimensional voxels that do not contain atoms, each channel for color representation is assigned a zero value.
Specifically, the polypeptide multichannel coloring method realizes multichannel polypeptide structure coloring based on known information such as three-dimensional space coordinates and van der Waals radius of each atom in the polypeptide, and mass of the atom, solubility of amino acid and acid-base property of the amino acid, and specifically comprises the following steps:
in this embodiment, the spatial structure of each atom of the polypeptide sequence is used to determine the spatial position occupied by the polypeptide, specifically: constructing a three-dimensional space coordinate system (comprising an x axis, a y axis and a z axis) by taking the gravity center of the polypeptide as an origin, and constructing a cube structure containing the whole polypeptide; the three-dimensional space coordinate system adopted in the embodiment is a three-dimensional cartesian coordinate system, and the unit distance of each coordinate axis is set to 1 angstrom (angstrom is metric length unit, and 1 angstrom is equal to 0.1 nanometer);
dividing a cube structure containing the whole polypeptide into a plurality of three-dimensional voxels by taking a preset unit distance (1 angstrom in the embodiment) as the side length of the three-dimensional voxels;
in one or more embodiments, the cube region may cover only a portion of the entire polypeptide, for example: assuming that only atoms in the positive and negative directions of the respective coordinate axes are considered, the entire cube region should be represented as a cube with a length, width, and height of l×2 angstroms, where L may be any positive integer, and is typically set to a multiple of 8 for convenience of subsequent feature extraction. Each atom is classified into a group of voxels according to its van der Waals radius (the information of the classification of each atom is as follows: H ' (hydrogen) 1, C ' (carbon) 1.5, N ' (nitrogen) 1.5, O ' (oxygen) 1.5, S ' (sulfur) 2), and the spatial position occupied by the atom in the cube is determined based on the van der Waals radius of each atom.
In a specific embodiment, the coloring of each three-dimensional voxel is based on the concept of RGB multi-channel imaging, and the quality of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the acid-base property of the amino acid to which the atom belongs are utilized to respectively assign a value to each channel to obtain the multi-channel representation of the color of each three-dimensional voxel. Wherein each channel of the color representation of a three-dimensional voxel that does not contain atoms is assigned a zero value. Specific:
the polypeptide coloring regards the properties of the atoms themselves (i.e. the atomic mass), the solubility of the amino acids of atomic constitution and the acid-base nature of the amino acids of atomic constitution as three different channels, respectively, which are color filled for each three-dimensional voxel, depending on the atoms involved in the three-dimensional voxel. In the attribute channel of the atom, the embodiment directly fills the value of the first channel of the three-dimensional voxel color after rounding the mass of the atom; for the channels corresponding to the solubility of amino acids and the alkalinity of amino acids, we consider that the spatial positions of multiple atoms can represent the position of one amino acid, so in this embodiment, the solubility of different amino acids is divided into hydrophobic amino acids and hydrophilic amino acids, and in order to facilitate distinguishing the two types of amino acids in consideration of the range of the values of each channel of each three-dimensional voxel from 0 to 255, and in consideration of the fact that the value of the background (i.e. the three-dimensional space region without any atom) is 0, in this embodiment, the hydrophobic amino acid is assigned to 128, the atoms of the hydrophilic amino acid are assigned to 255, and further, the assignment of the second channel of the three-dimensional voxel color is performed according to the classification of the solubility of the amino acids formed by the atoms in the three-dimensional voxel. Similarly, in this embodiment, the amino acid ph is classified into acidic, neutral and alkaline, and assigned to 86, 168 and 255 in order, so that the assignment of the third channel of the three-dimensional voxel color is performed according to the amino acid ph classification of the three-dimensional voxel.
Step 3: inputting the reference polypeptide sequence information and the structure information into a pre-trained variational self-coding model to obtain sequence information and structure information of a target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence and structure information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structure information, so as to realize generation of a band condition;
in a specific implementation manner, the method of the embodiment proposes a variation self-encoder model based on conditional compression, as shown in a model generation training process in fig. 2, the scheme of the embodiment reforms the variation self-encoder into a structure for inputting and reconstructing two modal information (i.e., a polypeptide sequence and polypeptide structure information) at the same time, in order to be able to generate a specific polypeptide sequence, the functional attribute information in the step 1 is used in a task of generating the polypeptide sequence, specifically, on one hand, the scheme of the embodiment fuses the functional attribute information (not compressed) into middle layer features for realizing generation of a band condition; on the other hand, considering that the functional attribute information cannot be directly spliced into the three-dimensional voxels (pixels) of the structure, the scheme in this embodiment compresses the functional attribute information to one dimension through the fully connected layer, and then multiplies the features of the one dimension with the original input information (i.e., the structural features and the reference polypeptide sequence), thereby realizing efficient strip condition generation.
The variable self-coding model based on conditional compression simultaneously comprises two branches, wherein the first branch is used for processing structural information, and the second branch is used for processing sequence information; in order to embed the conditions into the variational self-encoder, we use a multi-layer perceptron to compress the condition information (i.e. the functional attribute information) into one dimension at the same time, and then multiply the one dimension with the structural features and the polypeptide sequence, thereby controlling the structure and the sequence at the same time; whereas conventional variational self-encoders typically receive input from only one modality, and directly splice conditions with features; the middle layer feature refers to a feature vector formed by directly splicing the outputs of the structure encoder and the sequence encoder.
In one or more embodiments, the training process of the variation self-coding model is specifically:
acquiring a preset number of polypeptide data samples to construct a training set, wherein the polypeptide data samples comprise polypeptide sequences and functional attribute information corresponding to the current polypeptide sequences;
representing the functional attribute information corresponding to the current polypeptide sequence by using a binary sequence; analyzing each sample to obtain three-dimensional space structure information of polypeptide data; based on the three-dimensional space structure information of the polypeptide data, acquiring the structure information of each sample by adopting a polypeptide multi-channel coloring method;
and taking the structure information and the sequence information of each polypeptide data sample as the input of a variation self-coding model, introducing the functional attribute information into the middle layer of the variation self-coding model, and taking the polypeptide sequence information and the polypeptide structure information of each polypeptide data sample as the output of the variation self-coding model to realize the training of the variation self-coding model.
Wherein the intermediate layer is used for stitching of features.
A schematic of the process of antimicrobial peptide production by the trained production model is shown in FIG. 3.
In one or more embodiments, the polypeptide data sample may be collected from an existing polypeptide database, including polypeptide sequences and their corresponding functional attribute information, including activity, toxicity, and the like.
In one or more embodiments, the analyzing each sample to obtain three-dimensional spatial structure information of the polypeptide data specifically includes: for the polypeptide with the resolved structure, directly downloading the PDB file; and predicting the three-dimensional space structure of each polypeptide by using an alpha fold2 model for the polypeptides with the structures which are not analyzed, and storing the polypeptides in a PDB file mode.
Step 4: based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide, a functional identification result of the target antibacterial peptide is obtained by utilizing a pre-trained neural network model. Wherein, a schematic diagram of the process of antimicrobial peptide recognition is shown in FIG. 4.
The neural network model can adopt a CNN convolutional neural network model, and the fusion characteristic is an intermediate layer characteristic obtained by encoding sequence information and structural information by an encoder of the variation self-encoding model. Wherein functional attribute information is not utilized in the polypeptide identification process.
In a specific embodiment, a loss function in the training process of the neural network model adopts a scheme of 'softmax+crossing', and the loss function is specifically expressed as follows:
wherein s is i Score, s, representing non-target class j A score representing the category of the object,is a negative set of samples (i.e. a set of categories other than the current category),/is a set of negative samples (i.e. a set of categories other than the current category)>E is a natural base, which is a positive sample set.
To demonstrate the effectiveness of the protocol described in this example, the following experiments were performed:
table 1 experimental results
Wherein F1-score represents F1 score, which is a statistical indicator used to measure accuracy of the two classification models;
avg represents an average value;
Bi-GRU (Bidirectional Gated Recurrent Unit) represents a Bi-directional gated loop unit network, which is an RNN sequence model;
the Multi-model represents multiple modes, which corresponds to the processing mode of the neural network model+the Multi-mode structure information in the present embodiment;
multi-model+rebalance represents a Multi-modal+rebalance, which corresponds to the processing scheme of using a neural network model+multi-modal+rebalance cross-over function in this embodiment.
As shown in Table 1, the present example exemplifies six antibacterial prediction tasks, bi-GRU represents a method using only pure sequences, which only achieves an average F1-score of 0.39. By adding the multi-modal structure information, the method achieves an F1-score of 0.50, which is improved by more than ten percent compared to the previous method. On the basis, a rebalancing cross entropy function is further added, so that the performance of the model can further obtain 0.60F 1-score.
Embodiment two:
it is an object of this embodiment to provide an antimicrobial peptide generation and recognition system.
An antimicrobial peptide generation and recognition system, comprising:
a data acquisition unit for acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
the structure characteristic extraction unit is used for acquiring structure information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
the target antibacterial peptide generation unit is used for inputting the reference polypeptide sequence information and the structure information into a pre-trained variation self-coding model to obtain sequence information and structure information of the target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence information and structural information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structural information, so as to realize generation of a belt condition;
and the function recognition unit is used for obtaining a function recognition result of the target antibacterial peptide by utilizing a pre-trained neural network model based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide.
Further, the system in this embodiment corresponds to the method in the first embodiment, and the technical details thereof are described in the first embodiment, so that the details are not repeated here.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A method of producing and identifying an antimicrobial peptide, comprising:
acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
obtaining structural information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
inputting the reference polypeptide sequence information and the structure information into a pre-trained variational self-coding model to obtain sequence information and structure information of a target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence information and structural information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structural information, so as to realize generation of a belt condition;
the variable self-coding model based on conditional compression simultaneously comprises two branches, wherein the first branch is used for processing structural information, and the second branch is used for processing sequence information; in order to embed the condition into the variable self-encoder, simultaneously compressing the condition information to one dimension by using a multi-layer perceptron, and multiplying the condition information by the structural characteristics and the polypeptide sequence, thereby simultaneously controlling the structure and the sequence;
based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide, a functional identification result of the target antibacterial peptide is obtained by utilizing a pre-trained neural network model.
2. The method for producing and identifying an antimicrobial peptide according to claim 1, wherein the functional attribute information is obtained from functional attribute information of a polypeptide sequence, specifically expressed as a binary sequence, wherein the length of the binary sequence is a sum of functional attributes existing in known various polypeptide sequences, each bit in the binary sequence represents a functional attribute, and when the bit is 1, it represents that the functional attribute is provided, and when the bit is 0, it represents that the functional attribute is not provided.
3. The method for generating and identifying an antimicrobial peptide according to claim 1, wherein the variant self-coding model comprises two branches simultaneously, a first branch for processing structural information and a second branch for processing sequence information; in order to embed functional attribute information into a variable self-encoder, the functional attribute information is compressed to one dimension using a multi-layer perceptron and then multiplied with input information to be encoded.
4. The method for generating and identifying the antibacterial peptide according to claim 1, wherein the training process of the variation self-coding model is specifically as follows:
acquiring a preset number of polypeptide data samples to construct a training set, wherein the polypeptide data samples comprise polypeptide sequences and functional attribute information corresponding to the current polypeptide sequences;
representing the functional attribute information corresponding to the current polypeptide sequence by using a binary sequence; analyzing each sample to obtain three-dimensional space structure information of polypeptide data; based on the three-dimensional space structure information of the polypeptide data, acquiring the structure information of each sample by adopting a polypeptide multi-channel coloring method;
and taking the structure information and the sequence information of each polypeptide data sample as the input of a variation self-coding model, introducing the functional attribute information into the middle layer of the variation self-coding model, and taking the polypeptide sequence information and the polypeptide structure information of each polypeptide data sample as the output of the variation self-coding model to realize the training of the variation self-coding model.
5. The method for producing and identifying an antimicrobial peptide according to claim 1, wherein the whole polypeptide is represented as a cubic structure consisting of a plurality of three-dimensional voxels, in particular: constructing a three-dimensional space coordinate system by taking the gravity center of the polypeptide as an origin and a cube structure containing the whole polypeptide; and dividing the cube structure containing the whole polypeptide into a plurality of three-dimensional voxels by taking the preset unit distance as the side length of the three-dimensional voxels.
6. The method for generating and identifying the antibacterial peptide according to claim 1, wherein the coloring of each three-dimensional voxel is based on the concept of RGB multi-channel imaging, and each channel is respectively assigned by utilizing the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the acid-base property of the amino acid to which the atom belongs, so as to obtain the multi-channel representation of the color of each three-dimensional voxel.
7. An antimicrobial peptide generation and identification method as claimed in claim 1, wherein for three-dimensional voxels that do not contain atoms, each channel for color representation is assigned a value of zero.
8. The method for producing and identifying an antibacterial peptide according to claim 1, wherein the solubility of the amino acid is divided into hydrophilic and hydrophobic, and different values are set for the hydrophilic amino acid and the hydrophobic amino acid; the acid-base properties of the amino acids are classified into acidity, neutrality and alkalinity, wherein the acidic amino acids, the neutral amino acids and the acidic amino acids are set to different values.
9. An antimicrobial peptide generation and recognition method according to claim 1, wherein the selection of the reference polypeptide sequence is based on functional requirements of the antimicrobial peptide of interest from a library of polypeptide sequences of known functional information.
10. An antimicrobial peptide generation and recognition system, comprising:
a data acquisition unit for acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
the structure characteristic extraction unit is used for acquiring structure information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
the target antibacterial peptide generation unit is used for inputting the reference polypeptide sequence information and the structure information into a pre-trained variation self-coding model to obtain sequence information and structure information of the target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence information and structural information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structural information, so as to realize generation of a belt condition;
the variable self-coding model based on conditional compression simultaneously comprises two branches, wherein the first branch is used for processing structural information, and the second branch is used for processing sequence information; in order to embed the condition into the variable self-encoder, simultaneously compressing the condition information to one dimension by using a multi-layer perceptron, and multiplying the condition information by the structural characteristics and the polypeptide sequence, thereby simultaneously controlling the structure and the sequence;
and the function recognition unit is used for obtaining a function recognition result of the target antibacterial peptide by utilizing a pre-trained neural network model based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310483081.4A CN116206690B (en) | 2023-05-04 | 2023-05-04 | Antibacterial peptide generation and identification method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310483081.4A CN116206690B (en) | 2023-05-04 | 2023-05-04 | Antibacterial peptide generation and identification method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116206690A CN116206690A (en) | 2023-06-02 |
CN116206690B true CN116206690B (en) | 2023-08-08 |
Family
ID=86508010
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310483081.4A Active CN116206690B (en) | 2023-05-04 | 2023-05-04 | Antibacterial peptide generation and identification method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116206690B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117809749B (en) * | 2024-02-28 | 2024-05-28 | 普瑞基准科技(北京)有限公司 | Method and device for generating functional polypeptide sequence, memory and electronic equipment |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109671469A (en) * | 2018-12-11 | 2019-04-23 | 浙江大学 | The method for predicting marriage relation and binding affinity between polypeptide and HLA I type molecule based on Recognition with Recurrent Neural Network |
CN112614538A (en) * | 2020-12-17 | 2021-04-06 | 厦门大学 | Antibacterial peptide prediction method and device based on protein pre-training characterization learning |
CN113412519A (en) * | 2019-02-11 | 2021-09-17 | 旗舰开拓创新六世公司 | Machine learning-guided polypeptide analysis |
CN114093427A (en) * | 2021-11-12 | 2022-02-25 | 杭州电子科技大学 | Antiviral peptide prediction method based on deep learning and machine learning |
CN114155909A (en) * | 2021-12-03 | 2022-03-08 | 北京有竹居网络技术有限公司 | Method for constructing polypeptide molecule and electronic device |
CN114360636A (en) * | 2022-01-04 | 2022-04-15 | 北京航空航天大学 | Antibody sequence structure collaborative design method based on flow model |
CN115136246A (en) * | 2019-08-02 | 2022-09-30 | 旗舰开拓创新六世公司 | Machine learning-guided polypeptide design |
CN115512396A (en) * | 2022-11-01 | 2022-12-23 | 山东大学 | Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network |
CN115512763A (en) * | 2022-09-06 | 2022-12-23 | 北京百度网讯科技有限公司 | Method for generating polypeptide sequence, method and device for training polypeptide generation model |
CN115862747A (en) * | 2023-02-27 | 2023-03-28 | 北京航空航天大学 | Sequence-structure-function coupled protein pre-training model construction method |
CN115985384A (en) * | 2022-12-28 | 2023-04-18 | 星希尔生物科技(上海)有限公司 | Target polypeptide design method and system based on reinforcement learning and molecular simulation |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3739589A1 (en) * | 2019-05-17 | 2020-11-18 | NEC OncoImmunity AS | Method and system for binding affinity prediction and method of generating a candidate protein-binding peptide |
WO2021108919A1 (en) * | 2019-12-06 | 2021-06-10 | The Governing Council Of The University Of Toronto | System and method for generating a protein sequence |
CN111951887B (en) * | 2020-07-27 | 2024-06-28 | 深圳市新合生物医疗科技有限公司 | Leucocyte antigen and polypeptide binding affinity prediction method based on deep learning |
US20220336057A1 (en) * | 2021-04-15 | 2022-10-20 | Illumina, Inc. | Efficient voxelization for deep learning |
-
2023
- 2023-05-04 CN CN202310483081.4A patent/CN116206690B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109671469A (en) * | 2018-12-11 | 2019-04-23 | 浙江大学 | The method for predicting marriage relation and binding affinity between polypeptide and HLA I type molecule based on Recognition with Recurrent Neural Network |
CN113412519A (en) * | 2019-02-11 | 2021-09-17 | 旗舰开拓创新六世公司 | Machine learning-guided polypeptide analysis |
CN115136246A (en) * | 2019-08-02 | 2022-09-30 | 旗舰开拓创新六世公司 | Machine learning-guided polypeptide design |
CN112614538A (en) * | 2020-12-17 | 2021-04-06 | 厦门大学 | Antibacterial peptide prediction method and device based on protein pre-training characterization learning |
CN114093427A (en) * | 2021-11-12 | 2022-02-25 | 杭州电子科技大学 | Antiviral peptide prediction method based on deep learning and machine learning |
CN114155909A (en) * | 2021-12-03 | 2022-03-08 | 北京有竹居网络技术有限公司 | Method for constructing polypeptide molecule and electronic device |
CN114360636A (en) * | 2022-01-04 | 2022-04-15 | 北京航空航天大学 | Antibody sequence structure collaborative design method based on flow model |
CN115512763A (en) * | 2022-09-06 | 2022-12-23 | 北京百度网讯科技有限公司 | Method for generating polypeptide sequence, method and device for training polypeptide generation model |
CN115512396A (en) * | 2022-11-01 | 2022-12-23 | 山东大学 | Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network |
CN115985384A (en) * | 2022-12-28 | 2023-04-18 | 星希尔生物科技(上海)有限公司 | Target polypeptide design method and system based on reinforcement learning and molecular simulation |
CN115862747A (en) * | 2023-02-27 | 2023-03-28 | 北京航空航天大学 | Sequence-structure-function coupled protein pre-training model construction method |
Non-Patent Citations (1)
Title |
---|
基于序列的蛋白质进化关系分析和抗菌肽识别研究;陈宏达;《中国优秀硕士学位论文全文数据库》;1-60 * |
Also Published As
Publication number | Publication date |
---|---|
CN116206690A (en) | 2023-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Simonovsky et al. | Dynamic edge-conditioned filters in convolutional neural networks on graphs | |
Furukawa | SOM of SOMs | |
CN116206690B (en) | Antibacterial peptide generation and identification method and system | |
CN102930597B (en) | Processing method for three-dimensional model of external memory | |
CN110457514A (en) | A kind of multi-tag image search method based on depth Hash | |
CN112560966B (en) | Polarized SAR image classification method, medium and equipment based on scattering map convolution network | |
CN114999565B (en) | Drug target affinity prediction method based on representation learning and graph neural network | |
CN115083435B (en) | Audio data processing method and device, computer equipment and storage medium | |
CN113159067A (en) | Fine-grained image identification method and device based on multi-grained local feature soft association aggregation | |
CN110688897A (en) | Pedestrian re-identification method and device based on joint judgment and generation learning | |
CN116011682B (en) | Meteorological data prediction method and device, storage medium and electronic device | |
CN114998583B (en) | Image processing method, image processing apparatus, device, and storage medium | |
CN113887501A (en) | Behavior recognition method and device, storage medium and electronic equipment | |
Rastogi et al. | GA based clustering of mixed data type of attributes (numeric, categorical, ordinal, binary and ratio-scaled) | |
CN113724195B (en) | Quantitative analysis model and establishment method of protein based on immunofluorescence image | |
CN114511924A (en) | Semi-supervised bone action identification method based on self-adaptive augmentation and representation learning | |
CN116883591A (en) | Mathematical modeling method using computer multidimensional space | |
CN112270762A (en) | Three-dimensional model retrieval method based on multi-mode fusion | |
CN117292750A (en) | Cell type duty ratio prediction method, device, equipment and storage medium | |
CN112183303A (en) | Transformer equipment image classification method and device, computer equipment and medium | |
CN116797832A (en) | Stropharia rugoso-annulata hierarchical detection method based on mixed deep learning model | |
CN115827878A (en) | Statement emotion analysis method, device and equipment | |
CN114758721B (en) | Deep learning-based transcription factor binding site positioning method | |
Moe et al. | Implementing spatio-temporal graph convolutional networks on graphcore ipus | |
CN115544307A (en) | Directed graph data feature extraction and expression method and system based on incidence matrix |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |