CN116206690B - Antibacterial peptide generation and identification method and system - Google Patents

Antibacterial peptide generation and identification method and system Download PDF

Info

Publication number
CN116206690B
CN116206690B CN202310483081.4A CN202310483081A CN116206690B CN 116206690 B CN116206690 B CN 116206690B CN 202310483081 A CN202310483081 A CN 202310483081A CN 116206690 B CN116206690 B CN 116206690B
Authority
CN
China
Prior art keywords
information
polypeptide
antibacterial peptide
sequence
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310483081.4A
Other languages
Chinese (zh)
Other versions
CN116206690A (en
Inventor
李延青
王悦
龚海帆
李晓娟
李理想
左秀丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu Hospital of Shandong University
Original Assignee
Qilu Hospital of Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu Hospital of Shandong University filed Critical Qilu Hospital of Shandong University
Priority to CN202310483081.4A priority Critical patent/CN116206690B/en
Publication of CN116206690A publication Critical patent/CN116206690A/en
Application granted granted Critical
Publication of CN116206690B publication Critical patent/CN116206690B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/40ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A50/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
    • Y02A50/30Against vector-borne diseases, e.g. mosquito-borne, fly-borne, tick-borne or waterborne diseases whose impact is exacerbated by climate change

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Toxicology (AREA)
  • Primary Health Care (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention provides a method and a system for generating and identifying antibacterial peptide, which belong to the technical field of computer-aided drug research and development, wherein the scheme comprises the following steps: acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide; obtaining structural information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; inputting the reference polypeptide sequence information and the structure information into a pre-trained variational self-coding model to obtain sequence information and structure information of a target antibacterial peptide; based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide, a functional identification result of the target antibacterial peptide is obtained by utilizing a pre-trained neural network model; the scheme can generate the antibacterial peptide with specific properties, and the generated antibacterial peptide is different from the antibacterial peptide existing in the nature; meanwhile, in the identification of the antibacterial peptide, the identification accuracy of the polypeptide is effectively improved based on a multi-mode identification method.

Description

Antibacterial peptide generation and identification method and system
Technical Field
The invention belongs to the technical field of computer-aided drug research and development, and particularly relates to an antibacterial peptide generation and identification method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The Polypeptide (Polypeptide) is a group of amino acid sequences connected by peptide bonds, and can be widely used in various fields such as cardiovascular diseases, cytokine simulation, antibiosis and the like. However, there are a limited number of polypeptides existing in nature and a single function, and thus, there is a growing interest in designing and artificially synthesizing polypeptides. In order to synthesize polypeptide sequences with specific functions, there are currently mainly two types of methods:
the first type is numerical simulation and calculation based on physical and chemical methods, and the methods are large in calculation amount, low in efficiency and not ideal in accuracy, so that the method is difficult to be used for large-scale polypeptide engineering;
the second type is a machine learning-based method, and the method realizes rapid polypeptide function prediction by learning from a huge amount of data sets with function labels. However, such machine learning-based methods have a high degree of predictive performance depending on the selection and processing of data features (i.e., feature engineering), as well as feature learning methods. If the characteristics constructed are not adequate, the predictive performance of the model on polypeptide function can be seriously affected. In recent years, with the development of deep learning in machine learning, many researchers have attempted to generate or identify polypeptides having specific functions using sequence-based models, such as cyclic neural networks (RNNs), gated cyclic neural networks (GRUs), long-short-term memory neural networks (LSTM), and the like. However, from the first principle, the structure of a protein/polypeptide has a crucial influence on its functioning, whereas all previous approaches ignore the structural features of the polypeptide. For this reason, a model is needed for more accurately generating and identifying specific antibacterial peptides by considering both sequence information and structural information.
Recently, there have also been some structure-based approaches for predicting the affinity of proteins and small molecules. For example, many researchers have implemented prediction of protein/polypeptide function by constructing a graphic neural network to achieve representation and learning of amino acid sequences. Specifically, most of various methods based on the graph neural network (Graph Neural Network) use amino acids as nodes of the graph neural network, and utilize physical and chemical properties of the amino acids as characteristics of the nodes, and the distance between alpha carbon atoms in the amino acids is used as a side for connecting the amino acids. However, the above method ignores the characteristic that the length of the polypeptide sequence often exhibits long tail distribution, i.e., the number of sequences belonging to the length is continuously decreasing as the length of the sequence increases. Because the dimension of the adjacency matrix of the graph neural network is consistent with the maximum number of amino acids, the adjacency matrix of the graph neural network is too sparse. In addition, since amino acids between polypeptide sequences tend to be tandem, they often rely on a single pathway for feature transformation, resulting in the difficulty in transferring amino acid features that are too far apart from each other. Therefore, there is a need for a more rational method for expressing and learning polypeptide sequences, which overcomes the above negative effects and enables more accurate prediction of polypeptide functions.
Meanwhile, since the properties of the polypeptide existing in nature are often unbalanced in distribution, the number of the antibacterial peptides capable of inhibiting one type of drug-resistant bacteria is likely to be far greater than the number of the antibacterial peptides capable of inhibiting another type of drug-resistant bacteria, and in addition, one antibacterial peptide can have inhibition ability on multiple types of drug-resistant bacteria, the phenomenon can cause serious data unbalance in the antibacterial peptide identification process, so that the model is biased to identify a plurality of types and study on a few types is omitted. Thus, there is also a need for a method for alleviating sample distribution imbalance for identifying a minority class in antimicrobial peptides.
Disclosure of Invention
In order to solve the problems, the invention provides an antibacterial peptide generation and identification method and system, wherein the scheme is characterized in that the antibacterial peptide with specific attribute can be generated by introducing preset functional attribute information in the generation process of the antibacterial peptide, and the generated antibacterial peptide is different from the antibacterial peptide existing in the nature; meanwhile, in the identification of the antibacterial peptide, based on a multi-mode identification method, through combining the polypeptide sequence characteristics with the structural characteristics of the polypeptide, the identification accuracy of the polypeptide is effectively improved, the effect of the polypeptide can be predicted more accurately, the research and development efficiency of the computer-aided polypeptide drugs can be greatly improved, and more proper drugs are screened out, so that the cost of the drug research and development process is reduced.
According to a first aspect of an embodiment of the present invention, there is provided an antibacterial peptide generation and recognition method including:
acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
obtaining structural information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
inputting the reference polypeptide sequence information and the structure information into a pre-trained variational self-coding model to obtain sequence information and structure information of a target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence information and structural information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structural information, so as to realize generation of a belt condition;
based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide, a functional identification result of the target antibacterial peptide is obtained by utilizing a pre-trained neural network model.
Further, the functional attribute information is obtained according to the functional attribute information of the polypeptide sequence, and specifically expressed as a binary sequence, wherein the length of the binary sequence is the sum of the functional attributes existing in the known various polypeptide sequences, each bit in the binary sequence represents one functional attribute, when the bit is 1, the functional attribute is represented, and when the bit is 0, the functional attribute is represented.
Further, the variation self-coding model comprises two branches at the same time, wherein the first branch is used for processing structural information, and the second branch is used for processing sequence information; in order to embed functional attribute information into a variable self-encoder, the functional attribute information is compressed to one dimension using a multi-layer perceptron and then multiplied with input information to be encoded.
Further, the training process of the variation self-coding model specifically comprises the following steps:
acquiring a preset number of polypeptide data samples to construct a training set, wherein the polypeptide data samples comprise polypeptide sequences and functional attribute information corresponding to the current polypeptide sequences;
representing the functional attribute information corresponding to the current polypeptide sequence by using a binary sequence; analyzing each sample to obtain three-dimensional space structure information of polypeptide data; based on the three-dimensional space structure information of the polypeptide data, acquiring the structure information of each sample by adopting a polypeptide multi-channel coloring method;
and taking the structure information and the sequence information of each polypeptide data sample as the input of a variation self-coding model, introducing the functional attribute information into the middle layer of the variation self-coding model, and taking the polypeptide sequence information and the polypeptide structure information of each polypeptide data sample as the output of the variation self-coding model to realize the training of the variation self-coding model.
Further, the whole polypeptide is expressed as a cube structure consisting of a plurality of three-dimensional voxels, specifically: constructing a three-dimensional space coordinate system by taking the gravity center of the polypeptide as an origin and a cube structure containing the whole polypeptide; and dividing the cube structure containing the whole polypeptide into a plurality of three-dimensional voxels by taking the preset unit distance as the side length of the three-dimensional voxels.
Furthermore, the coloring of each three-dimensional voxel is based on the concept of RGB multi-channel imaging, and the quality of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the acid-base property of the amino acid to which the atom belongs are utilized to respectively assign values to each channel, so that the multi-channel representation of the color of each three-dimensional voxel is obtained.
Further, for three-dimensional voxels that do not contain atoms, each channel for color representation is assigned a zero value.
Further, the solubility of the amino acid is divided into hydrophilic and hydrophobic, and different values are set for the hydrophilic amino acid and the hydrophobic amino acid; the acid-base properties of the amino acids are classified into acidity, neutrality and alkalinity, wherein the acidic amino acids, the neutral amino acids and the acidic amino acids are set to different values.
Further, the selection of the reference polypeptide sequence is based on functional requirements of the antimicrobial peptide of interest from a library of polypeptide sequences for which functional information is known.
According to a second aspect of embodiments of the present invention, there is provided an antimicrobial peptide generation and recognition system comprising:
a data acquisition unit for acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
the structure characteristic extraction unit is used for acquiring structure information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
the target antibacterial peptide generation unit is used for inputting the reference polypeptide sequence information and the structure information into a pre-trained variation self-coding model to obtain sequence information and structure information of the target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence information and structural information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structural information, so as to realize generation of a belt condition;
and the function recognition unit is used for obtaining a function recognition result of the target antibacterial peptide by utilizing a pre-trained neural network model based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide.
Compared with the prior art, the invention has the beneficial effects that:
(1) The invention provides a method and a system for generating and identifying antibacterial peptide, wherein functional attribute information is introduced in the generation process of the antibacterial peptide, so that the antibacterial peptide with specific attribute can be generated, and the generated antibacterial peptide is different from the antibacterial peptide existing in the nature; meanwhile, in the identification of the antibacterial peptide, based on a multi-mode identification method, through combining the polypeptide sequence characteristics with the structural characteristics of the polypeptide, the identification accuracy of the polypeptide is effectively improved, the effect of the polypeptide can be predicted more accurately, the research and development efficiency of the computer-aided polypeptide drugs can be greatly improved, and more proper drugs are screened out, so that the cost of the drug research and development process is reduced.
(2) The scheme of the invention is based on a polypeptide multi-channel coloring method, and each polypeptide is regarded as a cube structure consisting of a plurality of three-dimensional voxels (3D pixels) containing three channel colors, so that the structural characteristic representation of the polypeptide is obtained; the method can acquire additional effective characteristics and relieve the problem of deep learning model performance attenuation caused by long tail distribution phenomenon of the sequence length.
(3) The scheme greatly improves the accuracy of multi-tag classification by expanding the Softmax loss function for multi-classification into the multi-tag classification task (namely the identification of the antibacterial peptide).
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a flowchart of an antimicrobial peptide generation and identification method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a model training process according to an embodiment of the present invention;
FIG. 3 is a schematic diagram showing the process of generating antimicrobial peptides by using a trained generation model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram showing the identification process of the antibacterial peptide according to the embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Embodiment one:
the aim of this example is to provide a method for producing and identifying antimicrobial peptides.
In order to solve the problems existing in the prior art, the present embodiment provides an antibacterial peptide generation and identification method, which mainly adopts the following technical concepts: the method comprises the steps of defining each functional attribute of a polypeptide capable of inhibiting a specific bacterium (specifically, the polypeptide can be determined according to actual requirements, and the broad-spectrum drug-resistant bacterium is adopted in the embodiment) as a plurality of labels, so that a strip condition generating task and an identification task based on multi-label information are constructed; meanwhile, in the process of polypeptide generation and identification, polypeptide characteristics are extracted by adopting a multi-channel coloring method based on polypeptides, and polypeptide sequence information is combined to realize polypeptide generation and identification based on multi-mode data. Specifically, as shown in fig. 1, the method specifically includes the following steps:
step 1: acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
wherein the functional attribute information is obtained according to functional attribute information (i.e., a tag) of the polypeptide sequence, and in a specific embodiment, the functional attribute information may be represented as a binary sequence, where the length of the binary sequence is the sum of functional attributes existing in various known polypeptide sequences, each bit in the binary sequence represents a functional attribute, and when the bit is 1, it represents that the functional attribute is provided, and when the bit is 0, it represents that the functional attribute is not provided.
In particular embodiments, the selection of the reference polypeptide sequence is based on functional requirements of the antimicrobial peptide of interest from a library of polypeptide sequences for which functional information is known.
Step 2: obtaining structural information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
in a specific embodiment, the polypeptide structure information is obtained by carrying out structural analysis on a polypeptide science column, and the polypeptide structural characteristics are obtained by adopting a polypeptide multichannel coloring method based on the polypeptide three-dimensional structure information; wherein, for the polypeptide with the resolved structure, directly downloading PDB (Protein Data Bank) file; and predicting the three-dimensional space structure of each polypeptide by using an alpha fold2 model for the polypeptides with the structures which are not analyzed, and storing the polypeptides in a PDB file mode.
Specifically, the whole polypeptide is expressed as a cube structure consisting of a plurality of three-dimensional voxels, specifically: constructing a three-dimensional space coordinate system by taking the gravity center of the polypeptide as an origin and a cube structure containing the whole polypeptide; taking a preset unit distance as the side length of a three-dimensional voxel, and dividing a cube structure containing the whole polypeptide into a plurality of three-dimensional voxels; the coloring of each three-dimensional voxel is based on the concept of RGB multi-channel imaging, and each channel is respectively assigned by utilizing the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the acid-base property of the amino acid to which the atom belongs, so that the multi-channel representation of the color of each three-dimensional voxel is obtained; for three-dimensional voxels that do not contain atoms, each channel for color representation is assigned a zero value.
Specifically, the polypeptide multichannel coloring method realizes multichannel polypeptide structure coloring based on known information such as three-dimensional space coordinates and van der Waals radius of each atom in the polypeptide, and mass of the atom, solubility of amino acid and acid-base property of the amino acid, and specifically comprises the following steps:
in this embodiment, the spatial structure of each atom of the polypeptide sequence is used to determine the spatial position occupied by the polypeptide, specifically: constructing a three-dimensional space coordinate system (comprising an x axis, a y axis and a z axis) by taking the gravity center of the polypeptide as an origin, and constructing a cube structure containing the whole polypeptide; the three-dimensional space coordinate system adopted in the embodiment is a three-dimensional cartesian coordinate system, and the unit distance of each coordinate axis is set to 1 angstrom (angstrom is metric length unit, and 1 angstrom is equal to 0.1 nanometer);
dividing a cube structure containing the whole polypeptide into a plurality of three-dimensional voxels by taking a preset unit distance (1 angstrom in the embodiment) as the side length of the three-dimensional voxels;
in one or more embodiments, the cube region may cover only a portion of the entire polypeptide, for example: assuming that only atoms in the positive and negative directions of the respective coordinate axes are considered, the entire cube region should be represented as a cube with a length, width, and height of l×2 angstroms, where L may be any positive integer, and is typically set to a multiple of 8 for convenience of subsequent feature extraction. Each atom is classified into a group of voxels according to its van der Waals radius (the information of the classification of each atom is as follows: H ' (hydrogen) 1, C ' (carbon) 1.5, N ' (nitrogen) 1.5, O ' (oxygen) 1.5, S ' (sulfur) 2), and the spatial position occupied by the atom in the cube is determined based on the van der Waals radius of each atom.
In a specific embodiment, the coloring of each three-dimensional voxel is based on the concept of RGB multi-channel imaging, and the quality of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the acid-base property of the amino acid to which the atom belongs are utilized to respectively assign a value to each channel to obtain the multi-channel representation of the color of each three-dimensional voxel. Wherein each channel of the color representation of a three-dimensional voxel that does not contain atoms is assigned a zero value. Specific:
the polypeptide coloring regards the properties of the atoms themselves (i.e. the atomic mass), the solubility of the amino acids of atomic constitution and the acid-base nature of the amino acids of atomic constitution as three different channels, respectively, which are color filled for each three-dimensional voxel, depending on the atoms involved in the three-dimensional voxel. In the attribute channel of the atom, the embodiment directly fills the value of the first channel of the three-dimensional voxel color after rounding the mass of the atom; for the channels corresponding to the solubility of amino acids and the alkalinity of amino acids, we consider that the spatial positions of multiple atoms can represent the position of one amino acid, so in this embodiment, the solubility of different amino acids is divided into hydrophobic amino acids and hydrophilic amino acids, and in order to facilitate distinguishing the two types of amino acids in consideration of the range of the values of each channel of each three-dimensional voxel from 0 to 255, and in consideration of the fact that the value of the background (i.e. the three-dimensional space region without any atom) is 0, in this embodiment, the hydrophobic amino acid is assigned to 128, the atoms of the hydrophilic amino acid are assigned to 255, and further, the assignment of the second channel of the three-dimensional voxel color is performed according to the classification of the solubility of the amino acids formed by the atoms in the three-dimensional voxel. Similarly, in this embodiment, the amino acid ph is classified into acidic, neutral and alkaline, and assigned to 86, 168 and 255 in order, so that the assignment of the third channel of the three-dimensional voxel color is performed according to the amino acid ph classification of the three-dimensional voxel.
Step 3: inputting the reference polypeptide sequence information and the structure information into a pre-trained variational self-coding model to obtain sequence information and structure information of a target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence and structure information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structure information, so as to realize generation of a band condition;
in a specific implementation manner, the method of the embodiment proposes a variation self-encoder model based on conditional compression, as shown in a model generation training process in fig. 2, the scheme of the embodiment reforms the variation self-encoder into a structure for inputting and reconstructing two modal information (i.e., a polypeptide sequence and polypeptide structure information) at the same time, in order to be able to generate a specific polypeptide sequence, the functional attribute information in the step 1 is used in a task of generating the polypeptide sequence, specifically, on one hand, the scheme of the embodiment fuses the functional attribute information (not compressed) into middle layer features for realizing generation of a band condition; on the other hand, considering that the functional attribute information cannot be directly spliced into the three-dimensional voxels (pixels) of the structure, the scheme in this embodiment compresses the functional attribute information to one dimension through the fully connected layer, and then multiplies the features of the one dimension with the original input information (i.e., the structural features and the reference polypeptide sequence), thereby realizing efficient strip condition generation.
The variable self-coding model based on conditional compression simultaneously comprises two branches, wherein the first branch is used for processing structural information, and the second branch is used for processing sequence information; in order to embed the conditions into the variational self-encoder, we use a multi-layer perceptron to compress the condition information (i.e. the functional attribute information) into one dimension at the same time, and then multiply the one dimension with the structural features and the polypeptide sequence, thereby controlling the structure and the sequence at the same time; whereas conventional variational self-encoders typically receive input from only one modality, and directly splice conditions with features; the middle layer feature refers to a feature vector formed by directly splicing the outputs of the structure encoder and the sequence encoder.
In one or more embodiments, the training process of the variation self-coding model is specifically:
acquiring a preset number of polypeptide data samples to construct a training set, wherein the polypeptide data samples comprise polypeptide sequences and functional attribute information corresponding to the current polypeptide sequences;
representing the functional attribute information corresponding to the current polypeptide sequence by using a binary sequence; analyzing each sample to obtain three-dimensional space structure information of polypeptide data; based on the three-dimensional space structure information of the polypeptide data, acquiring the structure information of each sample by adopting a polypeptide multi-channel coloring method;
and taking the structure information and the sequence information of each polypeptide data sample as the input of a variation self-coding model, introducing the functional attribute information into the middle layer of the variation self-coding model, and taking the polypeptide sequence information and the polypeptide structure information of each polypeptide data sample as the output of the variation self-coding model to realize the training of the variation self-coding model.
Wherein the intermediate layer is used for stitching of features.
A schematic of the process of antimicrobial peptide production by the trained production model is shown in FIG. 3.
In one or more embodiments, the polypeptide data sample may be collected from an existing polypeptide database, including polypeptide sequences and their corresponding functional attribute information, including activity, toxicity, and the like.
In one or more embodiments, the analyzing each sample to obtain three-dimensional spatial structure information of the polypeptide data specifically includes: for the polypeptide with the resolved structure, directly downloading the PDB file; and predicting the three-dimensional space structure of each polypeptide by using an alpha fold2 model for the polypeptides with the structures which are not analyzed, and storing the polypeptides in a PDB file mode.
Step 4: based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide, a functional identification result of the target antibacterial peptide is obtained by utilizing a pre-trained neural network model. Wherein, a schematic diagram of the process of antimicrobial peptide recognition is shown in FIG. 4.
The neural network model can adopt a CNN convolutional neural network model, and the fusion characteristic is an intermediate layer characteristic obtained by encoding sequence information and structural information by an encoder of the variation self-encoding model. Wherein functional attribute information is not utilized in the polypeptide identification process.
In a specific embodiment, a loss function in the training process of the neural network model adopts a scheme of 'softmax+crossing', and the loss function is specifically expressed as follows:
wherein s is i Score, s, representing non-target class j A score representing the category of the object,is a negative set of samples (i.e. a set of categories other than the current category),/is a set of negative samples (i.e. a set of categories other than the current category)>E is a natural base, which is a positive sample set.
To demonstrate the effectiveness of the protocol described in this example, the following experiments were performed:
table 1 experimental results
Wherein F1-score represents F1 score, which is a statistical indicator used to measure accuracy of the two classification models;
avg represents an average value;
Bi-GRU (Bidirectional Gated Recurrent Unit) represents a Bi-directional gated loop unit network, which is an RNN sequence model;
the Multi-model represents multiple modes, which corresponds to the processing mode of the neural network model+the Multi-mode structure information in the present embodiment;
multi-model+rebalance represents a Multi-modal+rebalance, which corresponds to the processing scheme of using a neural network model+multi-modal+rebalance cross-over function in this embodiment.
As shown in Table 1, the present example exemplifies six antibacterial prediction tasks, bi-GRU represents a method using only pure sequences, which only achieves an average F1-score of 0.39. By adding the multi-modal structure information, the method achieves an F1-score of 0.50, which is improved by more than ten percent compared to the previous method. On the basis, a rebalancing cross entropy function is further added, so that the performance of the model can further obtain 0.60F 1-score.
Embodiment two:
it is an object of this embodiment to provide an antimicrobial peptide generation and recognition system.
An antimicrobial peptide generation and recognition system, comprising:
a data acquisition unit for acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
the structure characteristic extraction unit is used for acquiring structure information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
the target antibacterial peptide generation unit is used for inputting the reference polypeptide sequence information and the structure information into a pre-trained variation self-coding model to obtain sequence information and structure information of the target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence information and structural information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structural information, so as to realize generation of a belt condition;
and the function recognition unit is used for obtaining a function recognition result of the target antibacterial peptide by utilizing a pre-trained neural network model based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide.
Further, the system in this embodiment corresponds to the method in the first embodiment, and the technical details thereof are described in the first embodiment, so that the details are not repeated here.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of producing and identifying an antimicrobial peptide, comprising:
acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
obtaining structural information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
inputting the reference polypeptide sequence information and the structure information into a pre-trained variational self-coding model to obtain sequence information and structure information of a target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence information and structural information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structural information, so as to realize generation of a belt condition;
the variable self-coding model based on conditional compression simultaneously comprises two branches, wherein the first branch is used for processing structural information, and the second branch is used for processing sequence information; in order to embed the condition into the variable self-encoder, simultaneously compressing the condition information to one dimension by using a multi-layer perceptron, and multiplying the condition information by the structural characteristics and the polypeptide sequence, thereby simultaneously controlling the structure and the sequence;
based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide, a functional identification result of the target antibacterial peptide is obtained by utilizing a pre-trained neural network model.
2. The method for producing and identifying an antimicrobial peptide according to claim 1, wherein the functional attribute information is obtained from functional attribute information of a polypeptide sequence, specifically expressed as a binary sequence, wherein the length of the binary sequence is a sum of functional attributes existing in known various polypeptide sequences, each bit in the binary sequence represents a functional attribute, and when the bit is 1, it represents that the functional attribute is provided, and when the bit is 0, it represents that the functional attribute is not provided.
3. The method for generating and identifying an antimicrobial peptide according to claim 1, wherein the variant self-coding model comprises two branches simultaneously, a first branch for processing structural information and a second branch for processing sequence information; in order to embed functional attribute information into a variable self-encoder, the functional attribute information is compressed to one dimension using a multi-layer perceptron and then multiplied with input information to be encoded.
4. The method for generating and identifying the antibacterial peptide according to claim 1, wherein the training process of the variation self-coding model is specifically as follows:
acquiring a preset number of polypeptide data samples to construct a training set, wherein the polypeptide data samples comprise polypeptide sequences and functional attribute information corresponding to the current polypeptide sequences;
representing the functional attribute information corresponding to the current polypeptide sequence by using a binary sequence; analyzing each sample to obtain three-dimensional space structure information of polypeptide data; based on the three-dimensional space structure information of the polypeptide data, acquiring the structure information of each sample by adopting a polypeptide multi-channel coloring method;
and taking the structure information and the sequence information of each polypeptide data sample as the input of a variation self-coding model, introducing the functional attribute information into the middle layer of the variation self-coding model, and taking the polypeptide sequence information and the polypeptide structure information of each polypeptide data sample as the output of the variation self-coding model to realize the training of the variation self-coding model.
5. The method for producing and identifying an antimicrobial peptide according to claim 1, wherein the whole polypeptide is represented as a cubic structure consisting of a plurality of three-dimensional voxels, in particular: constructing a three-dimensional space coordinate system by taking the gravity center of the polypeptide as an origin and a cube structure containing the whole polypeptide; and dividing the cube structure containing the whole polypeptide into a plurality of three-dimensional voxels by taking the preset unit distance as the side length of the three-dimensional voxels.
6. The method for generating and identifying the antibacterial peptide according to claim 1, wherein the coloring of each three-dimensional voxel is based on the concept of RGB multi-channel imaging, and each channel is respectively assigned by utilizing the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the acid-base property of the amino acid to which the atom belongs, so as to obtain the multi-channel representation of the color of each three-dimensional voxel.
7. An antimicrobial peptide generation and identification method as claimed in claim 1, wherein for three-dimensional voxels that do not contain atoms, each channel for color representation is assigned a value of zero.
8. The method for producing and identifying an antibacterial peptide according to claim 1, wherein the solubility of the amino acid is divided into hydrophilic and hydrophobic, and different values are set for the hydrophilic amino acid and the hydrophobic amino acid; the acid-base properties of the amino acids are classified into acidity, neutrality and alkalinity, wherein the acidic amino acids, the neutral amino acids and the acidic amino acids are set to different values.
9. An antimicrobial peptide generation and recognition method according to claim 1, wherein the selection of the reference polypeptide sequence is based on functional requirements of the antimicrobial peptide of interest from a library of polypeptide sequences of known functional information.
10. An antimicrobial peptide generation and recognition system, comprising:
a data acquisition unit for acquiring reference polypeptide sequence information and functional attribute information corresponding to a target antibacterial peptide;
the structure characteristic extraction unit is used for acquiring structure information of a reference polypeptide sequence based on a polypeptide multi-channel coloring method; the polypeptide multichannel coloring method specifically comprises the following steps: the whole polypeptide is expressed as a cube structure composed of a plurality of three-dimensional voxels, and the color of each three-dimensional voxel is determined by the mass of an atom to which the three-dimensional voxel belongs, the solubility of an amino acid to which the atom belongs and the value of an acid-base three-channel of the amino acid to which the atom belongs;
the target antibacterial peptide generation unit is used for inputting the reference polypeptide sequence information and the structure information into a pre-trained variation self-coding model to obtain sequence information and structure information of the target antibacterial peptide; the variation self-coding model receives two modal information of polypeptide sequence information and structural information at the same time, and fuses the two modal information with the functional attribute information in the coding process of the polypeptide sequence information and the structural information, so as to realize generation of a belt condition;
the variable self-coding model based on conditional compression simultaneously comprises two branches, wherein the first branch is used for processing structural information, and the second branch is used for processing sequence information; in order to embed the condition into the variable self-encoder, simultaneously compressing the condition information to one dimension by using a multi-layer perceptron, and multiplying the condition information by the structural characteristics and the polypeptide sequence, thereby simultaneously controlling the structure and the sequence;
and the function recognition unit is used for obtaining a function recognition result of the target antibacterial peptide by utilizing a pre-trained neural network model based on the fusion characteristics of the sequence information and the structure information of the target antibacterial peptide.
CN202310483081.4A 2023-05-04 2023-05-04 Antibacterial peptide generation and identification method and system Active CN116206690B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310483081.4A CN116206690B (en) 2023-05-04 2023-05-04 Antibacterial peptide generation and identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310483081.4A CN116206690B (en) 2023-05-04 2023-05-04 Antibacterial peptide generation and identification method and system

Publications (2)

Publication Number Publication Date
CN116206690A CN116206690A (en) 2023-06-02
CN116206690B true CN116206690B (en) 2023-08-08

Family

ID=86508010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310483081.4A Active CN116206690B (en) 2023-05-04 2023-05-04 Antibacterial peptide generation and identification method and system

Country Status (1)

Country Link
CN (1) CN116206690B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117809749B (en) * 2024-02-28 2024-05-28 普瑞基准科技(北京)有限公司 Method and device for generating functional polypeptide sequence, memory and electronic equipment

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109671469A (en) * 2018-12-11 2019-04-23 浙江大学 The method for predicting marriage relation and binding affinity between polypeptide and HLA I type molecule based on Recognition with Recurrent Neural Network
CN112614538A (en) * 2020-12-17 2021-04-06 厦门大学 Antibacterial peptide prediction method and device based on protein pre-training characterization learning
CN113412519A (en) * 2019-02-11 2021-09-17 旗舰开拓创新六世公司 Machine learning-guided polypeptide analysis
CN114093427A (en) * 2021-11-12 2022-02-25 杭州电子科技大学 Antiviral peptide prediction method based on deep learning and machine learning
CN114155909A (en) * 2021-12-03 2022-03-08 北京有竹居网络技术有限公司 Method for constructing polypeptide molecule and electronic device
CN114360636A (en) * 2022-01-04 2022-04-15 北京航空航天大学 Antibody sequence structure collaborative design method based on flow model
CN115136246A (en) * 2019-08-02 2022-09-30 旗舰开拓创新六世公司 Machine learning-guided polypeptide design
CN115512396A (en) * 2022-11-01 2022-12-23 山东大学 Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network
CN115512763A (en) * 2022-09-06 2022-12-23 北京百度网讯科技有限公司 Method for generating polypeptide sequence, method and device for training polypeptide generation model
CN115862747A (en) * 2023-02-27 2023-03-28 北京航空航天大学 Sequence-structure-function coupled protein pre-training model construction method
CN115985384A (en) * 2022-12-28 2023-04-18 星希尔生物科技(上海)有限公司 Target polypeptide design method and system based on reinforcement learning and molecular simulation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3739589A1 (en) * 2019-05-17 2020-11-18 NEC OncoImmunity AS Method and system for binding affinity prediction and method of generating a candidate protein-binding peptide
WO2021108919A1 (en) * 2019-12-06 2021-06-10 The Governing Council Of The University Of Toronto System and method for generating a protein sequence
CN111951887B (en) * 2020-07-27 2024-06-28 深圳市新合生物医疗科技有限公司 Leucocyte antigen and polypeptide binding affinity prediction method based on deep learning
US20220336057A1 (en) * 2021-04-15 2022-10-20 Illumina, Inc. Efficient voxelization for deep learning

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109671469A (en) * 2018-12-11 2019-04-23 浙江大学 The method for predicting marriage relation and binding affinity between polypeptide and HLA I type molecule based on Recognition with Recurrent Neural Network
CN113412519A (en) * 2019-02-11 2021-09-17 旗舰开拓创新六世公司 Machine learning-guided polypeptide analysis
CN115136246A (en) * 2019-08-02 2022-09-30 旗舰开拓创新六世公司 Machine learning-guided polypeptide design
CN112614538A (en) * 2020-12-17 2021-04-06 厦门大学 Antibacterial peptide prediction method and device based on protein pre-training characterization learning
CN114093427A (en) * 2021-11-12 2022-02-25 杭州电子科技大学 Antiviral peptide prediction method based on deep learning and machine learning
CN114155909A (en) * 2021-12-03 2022-03-08 北京有竹居网络技术有限公司 Method for constructing polypeptide molecule and electronic device
CN114360636A (en) * 2022-01-04 2022-04-15 北京航空航天大学 Antibody sequence structure collaborative design method based on flow model
CN115512763A (en) * 2022-09-06 2022-12-23 北京百度网讯科技有限公司 Method for generating polypeptide sequence, method and device for training polypeptide generation model
CN115512396A (en) * 2022-11-01 2022-12-23 山东大学 Method and system for predicting anti-cancer peptide and antibacterial peptide based on deep neural network
CN115985384A (en) * 2022-12-28 2023-04-18 星希尔生物科技(上海)有限公司 Target polypeptide design method and system based on reinforcement learning and molecular simulation
CN115862747A (en) * 2023-02-27 2023-03-28 北京航空航天大学 Sequence-structure-function coupled protein pre-training model construction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于序列的蛋白质进化关系分析和抗菌肽识别研究;陈宏达;《中国优秀硕士学位论文全文数据库》;1-60 *

Also Published As

Publication number Publication date
CN116206690A (en) 2023-06-02

Similar Documents

Publication Publication Date Title
Simonovsky et al. Dynamic edge-conditioned filters in convolutional neural networks on graphs
Furukawa SOM of SOMs
CN116206690B (en) Antibacterial peptide generation and identification method and system
CN102930597B (en) Processing method for three-dimensional model of external memory
CN110457514A (en) A kind of multi-tag image search method based on depth Hash
CN112560966B (en) Polarized SAR image classification method, medium and equipment based on scattering map convolution network
CN114999565B (en) Drug target affinity prediction method based on representation learning and graph neural network
CN115083435B (en) Audio data processing method and device, computer equipment and storage medium
CN113159067A (en) Fine-grained image identification method and device based on multi-grained local feature soft association aggregation
CN110688897A (en) Pedestrian re-identification method and device based on joint judgment and generation learning
CN116011682B (en) Meteorological data prediction method and device, storage medium and electronic device
CN114998583B (en) Image processing method, image processing apparatus, device, and storage medium
CN113887501A (en) Behavior recognition method and device, storage medium and electronic equipment
Rastogi et al. GA based clustering of mixed data type of attributes (numeric, categorical, ordinal, binary and ratio-scaled)
CN113724195B (en) Quantitative analysis model and establishment method of protein based on immunofluorescence image
CN114511924A (en) Semi-supervised bone action identification method based on self-adaptive augmentation and representation learning
CN116883591A (en) Mathematical modeling method using computer multidimensional space
CN112270762A (en) Three-dimensional model retrieval method based on multi-mode fusion
CN117292750A (en) Cell type duty ratio prediction method, device, equipment and storage medium
CN112183303A (en) Transformer equipment image classification method and device, computer equipment and medium
CN116797832A (en) Stropharia rugoso-annulata hierarchical detection method based on mixed deep learning model
CN115827878A (en) Statement emotion analysis method, device and equipment
CN114758721B (en) Deep learning-based transcription factor binding site positioning method
Moe et al. Implementing spatio-temporal graph convolutional networks on graphcore ipus
CN115544307A (en) Directed graph data feature extraction and expression method and system based on incidence matrix

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant