CN116894472A - Training method and device for neural network model for predicting binding force of polypeptide - Google Patents

Training method and device for neural network model for predicting binding force of polypeptide Download PDF

Info

Publication number
CN116894472A
CN116894472A CN202310730636.0A CN202310730636A CN116894472A CN 116894472 A CN116894472 A CN 116894472A CN 202310730636 A CN202310730636 A CN 202310730636A CN 116894472 A CN116894472 A CN 116894472A
Authority
CN
China
Prior art keywords
polypeptide molecule
neural network
training
network model
binding force
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310730636.0A
Other languages
Chinese (zh)
Inventor
黄俊雄
黄俊源
赵春青
肖斌
郑汉城
李英睿
王俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Carbon Cloud Intelligent Technology Co ltd
Shenzhen Carbon Cloud Intelligent Peptide Pharmaceutical Technology Co ltd
Original Assignee
Zhuhai Carbon Cloud Intelligent Technology Co ltd
Shenzhen Carbon Cloud Intelligent Peptide Pharmaceutical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Carbon Cloud Intelligent Technology Co ltd, Shenzhen Carbon Cloud Intelligent Peptide Pharmaceutical Technology Co ltd filed Critical Zhuhai Carbon Cloud Intelligent Technology Co ltd
Priority to CN202310730636.0A priority Critical patent/CN116894472A/en
Publication of CN116894472A publication Critical patent/CN116894472A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The application provides a training method and device for a neural network model for predicting polypeptide binding force, electronic equipment and a storage medium. The method comprises the steps of obtaining a training sample set, wherein the training sample comprises an amino acid sequence of a sample polypeptide molecule and a binding force value of the sample polypeptide molecule and a target protein; and training the neural network model by using the training sample set to generate a parameter set in the neural network model, wherein the neural network model is a multi-layer perceptron with a residual structure and is used for representing the corresponding relation between the amino acid sequence of the polypeptide molecule and the binding force numerical value of the corresponding polypeptide molecule and the target protein. The neural network model obtained by the method has high accuracy and good performance in predicting the binding capacity of polypeptide molecules and target proteins.

Description

Training method and device for neural network model for predicting binding force of polypeptide
Technical Field
The application belongs to the technical field of computers, and particularly relates to a training method and device for a neural network model for predicting polypeptide binding force, electronic equipment and a storage medium.
Background
Deep learning is an important field in machine learning research, and aims to establish a neural network for simulating human brain for analysis learning. In deep learning techniques, the artificial neural network (Artificial Neural Network, ANN) structure has a significant impact on the effect of the final generated model. In recent years, deep learning technology has also found related applications in the field of biological medicine, including screening of polypeptides of drug molecules or predicting the binding capacity of polypeptide molecules to target proteins in the field. Because of the existence of various amino acids in nature, high-complexity amino acid sequence combination brings higher requirements to a training method of a neural network model, and explosive combination is easy to generate when parameters are more, so that the performance of the neural network model is poor, and therefore, the development of the training method of the neural network model aiming at the biological field, particularly related to polypeptide molecules, is very important.
Disclosure of Invention
The application provides a training method, a device, electronic equipment and a storage medium for predicting a neural network model of polypeptide binding force, which can be used for solving the technical deficiency in the related field, and provides a neural network model of predicting the binding capacity of polypeptide molecules and target proteins with high accuracy and good performance and a training method thereof.
In a first aspect, embodiments of the present application provide a method of training a neural network model for predicting binding of a polypeptide molecule to a target protein, the method comprising: obtaining a training sample set, wherein the training sample comprises an amino acid sequence of a sample polypeptide molecule and a binding force value of the sample polypeptide molecule and a target protein; training the neural network model by using the training sample set to generate a parameter set in the neural network model, wherein the neural network model is a multi-layer perceptron with a residual structure and is used for representing the corresponding relation between the amino acid sequence of the polypeptide molecule and the binding force numerical value of the corresponding polypeptide molecule and the target protein.
In some alternative embodiments, an mth layer of the hidden layers of the multi-layer perceptron is connected to an mth+k layer, where m is a positive integer greater than 0 and k is a positive integer greater than or equal to 2.
In some alternative embodiments, the parameter amount of the mth layer in the hidden layers of the multi-layer perceptron is a preset first number, the parameter amount of the (m+1) th layer in the hidden layers of the multi-layer perceptron is a preset second number, and the preset first number is a positive integer less than the preset second number.
In some alternative embodiments, training the neural network model with the training sample set includes: binarizing the amino acid sequence of the sample polypeptide molecule to obtain an amino acid sequence binarization matrix corresponding to the sample polypeptide molecule, wherein the rows and columns of the binarization matrix respectively correspond to the type of the amino acid and the type of the amino acid in the sample polypeptide molecule, or the rows and columns of the binarization matrix respectively correspond to the type of the amino acid and the type of the amino acid in the sample polypeptide molecule; and training the neural network model by using the binarization matrix.
In some alternative embodiments, training the neural network model using a binarization matrix includes: multiplying the binarization matrix with a matrix of a learnable encoder to obtain a continuous value mathematical representation converted into non-sparsity; non-sparse continuous value mathematical representations are input into the neural network model, thereby training the neural network model.
In some alternative embodiments, obtaining a training sample set includes: obtaining fluorescent signals of at least two sample polypeptide molecules on the polypeptide chip in a binding relationship with the target protein; quantifying the fluorescent signal of each sample polypeptide molecule to obtain the binding force value of the sample polypeptide molecule and the target protein; and generating training samples in the training sample set by using the amino acid sequence of each sample polypeptide molecule and the binding force value of the sample polypeptide molecule and the target protein.
In a second aspect, embodiments of the present application provide a training device for predicting a neural network model of binding force of a polypeptide molecule to a target protein, comprising: an acquisition module configured to acquire a training sample set, wherein the training sample comprises an amino acid sequence of a sample polypeptide molecule and a binding force value of the sample polypeptide molecule to a target protein of interest; the training module is configured to train the neural network model by using the training sample set to generate a parameter set in the neural network model, wherein the neural network model is a multi-layer perceptron with a residual structure and is used for representing the corresponding relation between the amino acid sequence of the polypeptide molecule and the binding force numerical value of the corresponding polypeptide molecule and the target protein.
In a third aspect, embodiments of the present application provide a method for predicting binding force between a polypeptide molecule and a target protein, comprising: inputting the amino acid sequence of the polypeptide molecule to be predicted into a binding force prediction model to obtain a binding force value between the polypeptide molecule to be predicted and the target protein, wherein the binding force prediction model is a neural network model trained by the training method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a device for predicting binding force of a polypeptide molecule to a target protein, comprising: the prediction unit is configured to input an amino acid sequence of the polypeptide molecule to be predicted into a binding force prediction model to obtain a binding force value between the polypeptide molecule to be predicted and the target protein, wherein the binding force prediction model is a neural network model trained by the training method according to the first aspect.
In a fifth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method of any of the aspects described above.
In a sixth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by one or more processors implements a method as in any of the above aspects.
Polypeptide molecules (e.g., small molecule active peptides) have become a new trend in the field of pharmaceutical research as vaccines, diagnostic reagents, drugs, and drug lead compounds. In drug development using polypeptide molecules, it is critical to predict and screen binding forces of polypeptide molecules to target proteins. There are currently a number of methods for screening polypeptide molecules based on their binding activity to a target protein, including: based on phage display peptide library, random synthetic polypeptide library, antisense polypeptide library, polypeptide chip screening, protein degradation (enzymatic, chemical), MHC-polypeptide complex, protein structure and protein structure prediction, etc. The traditional method for screening drug molecules by experimental means has large manpower and material resource consumption, and depends on experimental equipment and design of experimental methods, whether target proteins are known or not, and the breadth of drug molecules covered by screening tools (e.g. chips).
The existing method for predicting or screening drug molecules by using a computer means comprises the steps of using a molecular evolution or molecular simulation method to screen unknown potential candidate polypeptides, such as PatchDock, lightDock and other virtual screening technologies based on molecular docking, mainly using computer rapid operation to realize the docking of polypeptides and corresponding target proteins in spatial conformation, docking molecules in a virtual polypeptide database with specific active sites of a target protein crystal structure one by one, continuously adjusting the binding position of the polypeptides and the target proteins, the conformation, the dihedral angle of rotatable bonds in the molecules and the amino acid residue side chains and skeletons of the target proteins by the computer rapid operation, searching the optimal conformation of the polypeptide molecules and the target proteins in spatial structure, predicting the binding mode and the binding affinity between the two, and selecting a theoretical simulation intermolecular interaction method of polypeptide ligands which are close to the natural conformation and have the optimal binding affinity with the target proteins by scoring values. The method has the advantages of complex model, long time consumption, easy influence of manual parameter setting on the result, and incapability of designing and optimizing candidate polypeptides in a global unbiased candidate molecule space. Meanwhile, the method is highly dependent on the three-dimensional molecular structure, and if the three-dimensional molecular structure of the protein and the polypeptide cannot be accurately obtained, the performance and the accuracy are greatly reduced.
In order to solve the above problems, the present application provides a method for accurately and efficiently predicting or screening polypeptide molecules (e.g., small molecule active peptides with patent drug potential) by using computer technology. The method of the application predicts the binding force of the polypeptide molecule and the target protein by training the neural network model, and specifically comprises the steps of taking the known amino acid sequence of the sample polypeptide molecule and the binding force value of the sample polypeptide molecule and the target protein as training samples, and training the neural network model by utilizing the training sample set to generate a parameter set in the neural network model, wherein the neural network model is a multi-layer perceptron with a residual structure and is used for representing the corresponding relation between the amino acid sequence of the polypeptide molecule and the binding force value of the corresponding polypeptide molecule and the target protein. The neural network model obtained by the training method can realize global systematic evaluation of the target protein binding polypeptide molecules, and can accurately predict the binding force of the target protein and the polypeptide molecules, so that the binding force relationship between the target protein and any polypeptide molecules is predicted, and the discovery efficiency of the polypeptide molecules with patent medicine potential is remarkably improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the specification and together with the description, serve to explain the principles of the specification.
FIG. 1 is a schematic diagram of an implementation environment in which an embodiment of the present application may be applied.
FIG. 2A is a flow chart of one embodiment of a training method for a neural network model for predicting binding of polypeptide molecules to target proteins, in accordance with the present application.
FIG. 2B is an exploded flow chart of one embodiment of a training method for a neural network model for predicting binding of polypeptide molecules to target proteins, in accordance with the present application.
FIG. 2C is a schematic diagram of the structure of a neural network model for predicting binding force of polypeptide molecules to target proteins, obtained according to the training method of the present application.
FIG. 3 is a schematic diagram of one embodiment of a training device for neural network models for predicting binding of polypeptide molecules to target proteins in accordance with the present application.
FIG. 4 is a flow chart of one embodiment of a method for predicting binding of a polypeptide molecule to a target protein according to the present application.
FIG. 5 is a schematic structural diagram of one embodiment of a device for predicting binding force of a polypeptide molecule to a target protein according to the present application.
Fig. 6 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.
Fig. 7 is a graph of the performance comparison result of the neural network model obtained by the conventional deep learning method and the method of the present application.
Detailed Description
The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
FIG. 1 shows a schematic diagram of an implementation environment of a training method of a neural network model for predicting binding force of a polypeptide molecule to a target protein and/or a predicting binding force of a polypeptide molecule to a target protein, an apparatus, an electronic device and a storage medium, to which the present application can be applied.
As shown in fig. 1, the implementation environment includes an electronic device 100, and a training method of a neural network model for predicting binding force of a polypeptide molecule to a target protein and/or a prediction method of binding force of a polypeptide molecule to a target protein in the embodiment of the present application may be executed by terminal devices 101, 102, 103, 104. By way of example, the electronic device 100 may include at least one of a terminal device or a server.
The terminal devices 101, 102, 103, 104 may be hardware or software. When the terminal devices 101, 102, 103, 104 are hardware, they may be various electronic devices having a display screen and supporting information input (e.g., text input and/or voice input, etc.), including but not limited to smartphones, tablets, laptop and desktop computers, and the like. When the terminal apparatuses 101, 102, 103, 104 are software, they can be installed in the above-listed terminal apparatuses. It may be implemented as a plurality of software or software modules (e.g., for providing a training method for predicting binding of polypeptide molecules to target proteins and/or a predictive method service for binding of polypeptide molecules to target proteins), or as a single software or software module. The present application is not particularly limited herein.
It should be understood that the number of terminal devices in fig. 1 is merely illustrative. There may be any number of terminal devices, as desired for implementation.
With continued reference to FIG. 2A, there is shown a flow 200 of one embodiment of a method of training a neural network model for predicting binding of a polypeptide molecule to a target protein, the method of training a neural network model for predicting binding of a polypeptide molecule to a target protein, according to the present application, comprising the steps of:
step 201, obtaining a training sample set, wherein the training sample comprises an amino acid sequence of a sample polypeptide molecule and a binding force value of the sample polypeptide molecule and a target protein.
In this embodiment, the execution subject of the training method (for example, the terminal device shown in fig. 1) may first obtain a training sample including the amino acid sequence of the sample polypeptide molecule and the binding force value of the sample polypeptide molecule to the target protein as a training sample set. Wherein, the polypeptide molecules are generally in one-to-one correspondence with the binding force values of the polypeptide molecules to the target protein of interest, i.e. one polypeptide molecule corresponds to only one binding force value.
Here, the sample polypeptide molecule may be a polypeptide molecule contained on a polypeptide chip (e.g., a HealthTell-V16 polypeptide chip).
Here, the sample polypeptide molecule may unbiased cover all cases of amino acid sequences (e.g., 18) in which 18 different amino acids (e.g., A, D, E, F, G, H, I, K, L, N, P, Q, R, S, T, V, W, Y) are formed no more than 13 amino acids in length 13 A seed case). It will be appreciated that the more amino acid species a sample polypeptide molecule encompasses, the more the length of the amino acid sequence is selectable, and the more likely it is to increase the screening of polypeptide molecules for drug design, making drug efficacy, independent design easier. The training method for the neural network model for predicting the binding force of the polypeptide molecules and the target proteins provided by the application obviously improves the types of the covered polypeptide molecules, can realize fitting of amino acid sequence combinations with high complexity, and enables the neural network model obtained by training to master more complex biochemical rules.
Here, the binding force value of the sample polypeptide molecule and the target protein can be specific binding fraction data or can be a quantitatively-characterized fluorescence signal value. The binding force value can be corresponding binding fraction data obtained by the interaction binding of the polypeptide chip with the biomolecules in the upper sample through the polypeptide probe and the reflecting of the interaction binding strength through the fluorescence signal value, and the binding force strength of the target molecules in the sample and the polypeptide probe on the polypeptide chip can be quantitatively represented. The quantitative binding force value is favorable for subsequent analysis of high-flux polypeptide probe binding spectrum signal data, for example, when the upper sample is a disease target protein, the output result of the polypeptide chip can reflect the strength of the binding interaction of the polypeptide probe and the disease target protein, and is favorable for researching a novel medicinal polypeptide for regulating the activity of the disease target protein; when the upper sample is human serum, the output result of the polypeptide chip can reflect the combination relation of the human comprehensive antibody spectrum and the antigen molecular epitope of pathogenic microorganism, and further can perform antibody detection and disease diagnosis of microorganism infection, autoimmune disease and early tumor neogenesis antigen.
Step 202, training a neural network model by using a training sample set to generate a parameter set in the neural network model, wherein the neural network model is a multi-layer perceptron with a residual structure, and the neural network model is used for representing the corresponding relation between the amino acid sequence of the polypeptide molecule and the binding force numerical value of the corresponding polypeptide molecule and the target protein.
In this embodiment, the execution body of the training method has obtained a training sample set from step 201, where the training sample set may cover a large number of high-complexity amino acid sequence combinations (for example, amino acid sequences of all cases where 18 different amino acids are formed with a length of not more than 13 amino acids), and the neural network model provided by the present application is a multi-layer perceptron with a residual structure, so that the problems of insufficient depth, overfitting, model degradation, etc. in the conventional multi-layer perceptron can be solved.
In some alternative embodiments, an mth layer of the hidden layers of the multi-layer perceptron is connected to an mth+kth layer, m is a positive integer greater than 0, and k is a positive integer greater than or equal to 2.
Here, when k is 2, the mth layer and the m+k layer among the hidden layers of the multi-layered perceptron can be distinguished by an odd layer and an even layer, i.e., the odd layer among the hidden layers is connected to the odd layer and the even layer is connected to the even layer. For example, the hidden layer of the multi-layer perceptron is divided into an odd layer and an even layer, the parameter amounts of the odd and even layers are different, and then the residual connection mode is that the m layer is connected to the m+2 layer. The implementation mode greatly reduces the parameter quantity of the overall neural network, improves the calculation time and does not influence the performance of the model.
It can be understood that, in theory, there is no upper limit on the upper cross layer number (for example, layer 1 is connected to layer 4 and layer 2 is connected to layer 5) to which the hidden layer is connected in the residual error connection mode of the present application, according to the specific application scenario of the present application, the upper cross layer number, that is, the k value, can be adjusted accordingly, so as to keep the performance of the neural network model from decreasing.
In some alternative embodiments, the parameter amount of the mth layer in the hidden layers of the multi-layer perceptron is a preset first number, the parameter amount of the (m+1) th layer in the hidden layers of the multi-layer perceptron is a preset second number, and the preset first number is a positive integer less than the preset second number.
Here, when m is an odd number, the preset first number is a positive integer smaller than the preset second number. For example, the first number is preset to 130 and the second number is preset to 520.
Here, when m is an even number, the preset first number is a positive integer smaller than the preset second number. For example, the first number is preset to 130 and the second number is preset to 520.
It can be understood that the adjustment of parameter amounts of different layers in the hidden layers of the multi-layer perceptron is to continuously enlarge-reduce parameters of each layer of neural network, thereby reducing the parameter amount of the whole neural network, improving the calculation time and not affecting the performance of the model. Thus, the number of parameters for different layers can be adjusted based on the trend of "zoom in-out" (as shown in fig. 2C) under the requirements of satisfying the neural network model performance.
The application enables the residual structure to be applied to the single-channel characteristic data by redesigning the residual structure. The method solves the architecture design problem caused by the too deep layer number when the residual structure is added to the multi-layer perceptron, and simultaneously solves the problems of increased computational complexity, data compatibility and the like by grouping hidden layers and connecting and adjusting the parameter number between different layers.
Step 2021, binarizing the amino acid sequence of the sample polypeptide molecule to obtain an amino acid sequence binarization matrix corresponding to the sample polypeptide molecule, wherein the rows and columns of the binarization matrix respectively correspond to the positions of the amino acids and the types of the amino acids in the sample polypeptide molecule, or the rows and columns of the binarization matrix respectively correspond to the positions of the amino acids and the types of the amino acids in the sample polypeptide molecule.
Here, the present application is not limited to the binarization method. The purpose of binarization here is to binarize each amino acid in the linear amino acid sequence corresponding to the sample polypeptide molecule as a discrete feature value that is discontinuous without substantial significance, and illustratively, one-hot encoding (one-hot encoding), target encoding, leave-one-out encoding (Leave-out encoding), bayesian target encoding (Bayesian Target Encoding), evidence weighting (Weight of Evidence), or nonlinear PCA may be used.
In some alternative embodiments, step 202'1 may be performed as follows: the corresponding amino acid sequence of the sample polypeptide molecule is represented as a binary matrix (e.g., a "one-hot" representation) having a size of 13×18, i.e., a maximum length of the sample polypeptide molecule of 13×18 amino acid species in the sample polypeptide molecule. Similarly, the size of the binary matrix may be 18×13, i.e., 18×13, the number of amino acid species in the sample polypeptide molecule is greater than 13, the maximum length of the sample polypeptide molecule. Each row in the binary matrix is a vector consisting of 0, 1 representing the amino acid type at that position in the sequence. For shorter peptides, the unused rows (i.e., amino acid gaps in the sample polypeptide molecule) are filled with 0 s. The sparse 0 and 1 matrix is the mathematical representation of the amino acid sequence of the sample polypeptide molecule.
At step 2022, the neural network model is trained using the binarization matrix.
In some alternative embodiments, step 2022 may also include the following steps a through b:
and a, multiplying the binarization matrix with a matrix of a learnable encoder to obtain a continuous value mathematical representation converted into non-sparsity.
And b, inputting non-sparse continuous value mathematical representation into the neural network model, so as to train the neural network model.
In some alternative embodiments, step b may be performed as follows: inputting non-sparse continuous value mathematical representation into a neural network model, outputting simulated binding force values corresponding to each sample polypeptide molecule, comparing the simulated binding force values with actual binding force values corresponding to the sample polypeptide molecules in the obtained training samples, calculating loss (loss), and then back-propagating the loss so as to update parameters of the neural network model. The method of updating the parameters may be, for example, a gradient descent method. The training method is repeated continuously until the neural network model reaches the ideal index, and the training is completed.
Here, the ideal index of the neural network model may be that the loss is minimized, or that the target result is achieved according to a certain index of the actual application scenario, for example, the calculation speed is fastest, the biological correspondence between the mathematical features and the polypeptide molecules is optimal, etc.
In some alternative embodiments, step 201 may further comprise the following steps as shown in fig. 2B:
step 2011, obtaining fluorescent signals of at least two sample polypeptide molecules on the polypeptide chip in a binding relationship with the target protein.
In this embodiment, the main execution body of the training method extracts the fluorescent signal from the experimental result of the polypeptide chip, which may be specifically executed as follows: for the experimental result after directly fluorescence labeling target protein and incubation reaction with the polypeptide chip, the sample polypeptide molecule with binding relation with the target protein and the corresponding fluorescence signal can be obtained.
Step 2012, quantifying the fluorescent signal of each sample polypeptide molecule to obtain the binding force value of the sample polypeptide molecule and the target protein.
In this embodiment, the execution body of the training method quantifies the fluorescence signal of each sample polypeptide molecule, and the quantified fluorescence signal is embodied as a binding force value, which is used to characterize the strength of the binding force between the sample polypeptide molecule and the target protein.
And 2013, generating training samples in the training sample set by using the amino acid sequence of each sample polypeptide molecule and the binding force value of the sample polypeptide molecule and the target protein.
According to the alternative embodiment, the training sample set generated by the application widely covers tens of thousands of polypeptide molecules (namely the sample polypeptide molecules) in the polypeptide chip, the lengths of the polypeptide molecules are between 5 and 13 amino acids, 18 (A, D, E, F, G, H, I, K, L, N, P, Q, R, S, T, V, W, Y) of 20 natural amino acids are involved, the widely covered polypeptide molecular types are provided for training the neural network model, and fitting of high-complexity amino acid sequence combinations can be realized, so that the trained neural network model can master more complex biochemical rules.
In the field of actual polypeptide drug discovery, compared with the methods known in the prior art, the neural network model generated by the training method can obviously improve the success rate of polypeptide design. Polypeptide design success rates are typically characterized by the ratio of the number of polypeptides that exhibit affinity in a surface plasmon resonance (Surface Plasmon Resonance, SPR) experiment to the total number of polypeptides that are screened for potential for patent drug. Three polypeptide design algorithms RF design, proteinMPNN, masif _seed, etc. are currently on the market with highest success rates of 2%,1.4% and 0.2%, respectively (CAO L, COVENTRY B, GORESHNIK I, etc., 2022.Design of protein-binding proteins from the target structure alone [ J/OL ]. Nature,605 (7910): 551-560.; BENNETT N, COVENTRY B, GORESHNIK I, etc., 2022.Improving de novo Protein Binder Design with Deep Learning[M/OL.; GAINZA P, WEHRLE S, HALL-BEAUVAIS AV, etc., 2022.De novo design of site-specific protein interactions with learned surface fingerprints [ M/OL ]). By adopting the method provided by the application, the polypeptide design success rate can be effectively improved to 60% on the Glp target, the success rate is improved to 40% on the Pvrig target, and the success rate is improved to 70% on Itb3, which is obviously superior to the prior art.
With further reference to fig. 3, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a training device for predicting binding force between a polypeptide molecule and a target protein in a neural network model, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be specifically applied to various electronic devices.
As shown in fig. 3, the training device 300 for predicting binding force of polypeptide molecules to target proteins according to the present embodiment includes: an acquisition module 301 configured to acquire a training sample set, wherein the training sample comprises an amino acid sequence of a sample polypeptide molecule and a binding force value of the sample polypeptide molecule to a target protein of interest; the training module 302 is configured to train the neural network model by using the training sample set to generate a parameter set in the neural network model, wherein the neural network model is a multi-layer perceptron with a residual structure, and the neural network model is used for representing the corresponding relationship between the amino acid sequence of the polypeptide molecule and the binding force value of the corresponding polypeptide molecule and the target protein.
In this embodiment, the specific processing and the technical effects brought by the acquisition module 301 and the training module 302 of the training device 300 for predicting the neural network model of the binding force between the polypeptide molecule and the target protein can refer to the relevant descriptions of the step 201 and the step 202 in the corresponding embodiment of fig. 2, and are not repeated here.
Referring further to FIG. 4, there is shown a flow chart 400 of one embodiment of a method for predicting binding of a polypeptide molecule to a target protein according to the present application, comprising the steps of:
step 401, inputting an amino acid sequence of a polypeptide molecule to be predicted into a binding force prediction model to obtain a binding force value between the polypeptide molecule to be predicted and a target protein, wherein the binding force prediction model is a neural network model trained by the training method.
In this embodiment, the execution body of the prediction method (for example, the terminal device shown in fig. 1) may first obtain a neural network model trained by the above training method as a binding force prediction model, and then input the amino acid sequence of the polypeptide molecule to be predicted into the binding force prediction model to obtain the binding force value between the polypeptide molecule to be predicted and the target protein.
Here, the polypeptide molecules to be predicted may be polypeptide molecules covered on a polypeptide chip (e.g., healthTell-V16 polypeptide chip) or other forms and sources of polypeptide molecules. The length of the polypeptide molecule to be detected may be any amino acid length, and is not limited to polypeptide molecules of 13 amino acids or less in length.
According to the embodiment, the binding force between the polypeptide molecules and the target protein can be predicted by utilizing a virtual means, so that the polypeptide molecules with the potential of patent medicine are screened and the possibility of patent medicine is predicted. The prediction method does not depend on structural data of target proteins, can fit more amino acid sequence combinations, and is high in accuracy and short in calculation time.
With further reference to fig. 5, as an implementation of the method shown in fig. 4, the present application provides an embodiment of a device for predicting binding force between a polypeptide molecule and a target protein, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 4, and the device may be applied to various electronic devices specifically.
As shown in fig. 5, the apparatus 500 for predicting binding force between a polypeptide molecule and a target protein according to the present embodiment includes: the prediction unit 501 is configured to input the amino acid sequence of the polypeptide molecule to be predicted into a binding force prediction model to obtain a binding force value between the polypeptide molecule to be predicted and the target protein, where the binding force prediction model is a neural network model trained by the training method.
In this embodiment, the specific processing of the prediction unit 501 of the apparatus 500 for predicting the binding force between a polypeptide molecule and a target protein and the technical effects thereof can refer to the description of step 401 in the corresponding embodiment of fig. 4, and will not be described herein.
As shown in FIG. 6, a schematic diagram of a computer system 600 suitable for use in implementing the electronic device of the present application is shown. The computer system 600 shown in fig. 6 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present application.
As shown in fig. 6, a computer system 600 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various suitable actions and processes in accordance with programs stored in a Read Only Memory (ROM) 602 or loaded from a storage device 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the computer system 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, etc.; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the computer system 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates a computer system 600 having electronic devices of various kinds, it is to be understood that not all illustrated devices are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the method of the embodiment of the present application are performed when the computer program is executed by the processing means 601.
The computer readable medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement a training method for predicting a neural network model of binding of polypeptide molecules to target proteins as shown in the embodiment of fig. 2 or fig. 4 and alternative implementations thereof.
Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present application may be implemented in software or in hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present application is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.
Examples:
the method is characterized in that a polypeptide molecule on a HealthTell-V16 polypeptide chip is used, part of the polypeptide molecule is used as a training sample polypeptide, the other part of the polypeptide molecule is used as a test sample polypeptide, a traditional deep learning method is used for comparison with the training method of the application, a neural network model built by the traditional deep learning method is originally built based on the HealthTell-V13 polypeptide chip (comprising 13 ten kinds of polypeptides and 16 kinds of amino acids), and the specific implementation process of the traditional deep learning method is as follows:
the polypeptide sequences, expressed as binary matrices (single thermal matrix), were input into a neural network model, with a size of 13 x 18 (maximum length of polypeptide molecules on the array x number of different amino acids). For shorter peptides, unused rows are zero-filled. This binary sequence representation is multiplied by a learnable encoder matrix that linearly converts the binary amino acid representation into a dense continuous representation. The encoding matrix is then flattened to form a real-valued vector representation of the sequence. This vector is then input into a neural network comprising two fully connected layers, each fully connected layer having 100 nodes and a modified linear unit, wherein the output of the second fully connected layer is the binding force value of the polypeptide to the target protein. The model fitting ability results of this method are shown in fig. 7A. It can be seen that the part of the model fitting capacity at high binding forces in fig. 7A is very weak (however this is precisely the part of interest in the application), and that simple neural networks are not suitable for large-scale, complex data training.
Fig. 7B shows the model fitting ability results of the method of the present application, and it can be seen that the model trained by the method of the present application is more suitable for large-scale and high-complexity data than the model obtained by the conventional deep learning method.

Claims (11)

1. A training method for a neural network model for predicting binding force of a polypeptide molecule to a target protein, the method comprising:
obtaining a training sample set, wherein the training sample comprises an amino acid sequence of a sample polypeptide molecule and a binding force value of the sample polypeptide molecule and a target protein;
and training the neural network model by using the training sample set to generate a parameter set in the neural network model, wherein the neural network model is a multi-layer perceptron with a residual structure and is used for representing the corresponding relation between the amino acid sequence of the polypeptide molecule and the binding force numerical value of the corresponding polypeptide molecule and the target protein.
2. The training method of claim 1, wherein an mth layer of the hidden layers of the multi-layer perceptron is connected to an mth+kth layer, m is a positive integer greater than 0, and k is a positive integer greater than or equal to 2.
3. Training method according to claim 1 or 2, wherein the parameter amount of the mth layer in the hidden layers of the multi-layer perceptron is a preset first number, the parameter amount of the (m+1) th layer in the hidden layers of the multi-layer perceptron is a preset second number, and the preset first number is a positive integer smaller than the preset second number.
4. The training method of claim 1, the training the neural network model using the training sample set, comprising:
binarizing the amino acid sequence of the sample polypeptide molecule to obtain an amino acid sequence binarization matrix corresponding to the sample polypeptide molecule, wherein the rows and columns of the binarization matrix respectively correspond to the positions of amino acids and the types of amino acids in the sample polypeptide molecule, or the rows and columns of the binarization matrix respectively correspond to the positions of the amino acids and the types of the amino acids in the sample polypeptide molecule;
and training the neural network model by utilizing the binarization matrix.
5. The training method of claim 4, the training the neural network model using the binarization matrix, comprising:
multiplying the binarization matrix with a matrix of a learnable encoder to obtain a continuous value mathematical representation converted into non-sparsity;
and inputting the non-sparse continuous value mathematical representation into a neural network model, so as to train the neural network model.
6. The training method of claim 1, the acquiring a training sample set comprising:
obtaining fluorescent signals of at least two sample polypeptide molecules on the polypeptide chip in a binding relationship with the target protein;
quantifying the fluorescent signal of each sample polypeptide molecule to obtain the binding force value of the sample polypeptide molecule and the target protein;
and generating training samples in the training sample set by using the amino acid sequence of each sample polypeptide molecule and the binding force value of the sample polypeptide molecule and the target protein.
7. A training device for a neural network model for predicting binding force of a polypeptide molecule to a target protein, comprising:
an acquisition module configured to acquire a training sample set, wherein the training sample comprises an amino acid sequence of a sample polypeptide molecule and a binding force value of the sample polypeptide molecule to a target protein of interest;
the training module is configured to train a neural network model by using the training sample set to generate a parameter set in the neural network model, wherein the neural network model is a multi-layer perceptron with a residual structure and is used for representing the corresponding relation between the amino acid sequence of the polypeptide molecule and the binding force numerical value of the corresponding polypeptide molecule and the target protein.
8. A method for predicting binding force of a polypeptide molecule to a target protein, comprising:
inputting the amino acid sequence of the polypeptide molecule to be predicted into a binding force prediction model to obtain a binding force value between the polypeptide molecule to be predicted and a target protein, wherein the binding force prediction model is a neural network model trained by the training method according to any one of claims 1-6.
9. A device for predicting binding force of a polypeptide molecule to a target protein, comprising:
a prediction unit configured to input an amino acid sequence of a polypeptide molecule to be predicted into a binding force prediction model to obtain a binding force value between the polypeptide molecule to be predicted and a target protein, wherein the binding force prediction model is a neural network model trained by the training method according to any one of claims 1-6.
10. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6 and/or claim 8.
11. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by one or more processors implements the method of any of claims 1-6 and/or claim 8.
CN202310730636.0A 2023-06-19 2023-06-19 Training method and device for neural network model for predicting binding force of polypeptide Pending CN116894472A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310730636.0A CN116894472A (en) 2023-06-19 2023-06-19 Training method and device for neural network model for predicting binding force of polypeptide

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310730636.0A CN116894472A (en) 2023-06-19 2023-06-19 Training method and device for neural network model for predicting binding force of polypeptide

Publications (1)

Publication Number Publication Date
CN116894472A true CN116894472A (en) 2023-10-17

Family

ID=88314154

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310730636.0A Pending CN116894472A (en) 2023-06-19 2023-06-19 Training method and device for neural network model for predicting binding force of polypeptide

Country Status (1)

Country Link
CN (1) CN116894472A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711525A (en) * 2024-02-05 2024-03-15 北京悦康科创医药科技股份有限公司 Activity prediction model training and activity prediction related products
CN117711525B (en) * 2024-02-05 2024-05-10 北京悦康科创医药科技股份有限公司 Activity prediction model training and activity prediction related products

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711525A (en) * 2024-02-05 2024-03-15 北京悦康科创医药科技股份有限公司 Activity prediction model training and activity prediction related products
CN117711525B (en) * 2024-02-05 2024-05-10 北京悦康科创医药科技股份有限公司 Activity prediction model training and activity prediction related products

Similar Documents

Publication Publication Date Title
Meshulam et al. Coarse graining, fixed points, and scaling in a large population of neurons
Sen et al. Deep learning meets metabolomics: A methodological perspective
Tyers et al. From genomics to proteomics
Hlavacek et al. Rules for modeling signal-transduction systems
Gao et al. Hierarchical graph learning for protein–protein interaction
Hutchinson et al. Models and machines: how deep learning will take clinical pharmacology to the next level
Rong et al. NormAE: deep adversarial learning model to remove batch effects in liquid chromatography mass spectrometry-based metabolomics data
Guo et al. Bayesian algorithm for retrosynthesis
CN114333986A (en) Method and device for model training, drug screening and affinity prediction
Guo et al. DeepPSP: a global–local information-based deep neural network for the prediction of protein phosphorylation sites
Cheng et al. PepFormer: end-to-end transformer-based siamese network to predict and enhance peptide detectability based on sequence only
Marchetti et al. Quantum computing algorithms: getting closer to critical problems in computational biology
Wedge et al. FDRAnalysis: a tool for the integrated analysis of tandem mass spectrometry identification results from multiple search engines
Park et al. Transfer learning compensates limited data, batch effects and technological heterogeneity in single-cell sequencing
Diaz-Flores et al. Evolution of artificial intelligence-powered technologies in biomedical research and healthcare
Erdem et al. A scalable, open-source implementation of a large-scale mechanistic model for single cell proliferation and death signaling
US20210043273A1 (en) Methods, systems, and media for predicting functions of molecular sequences
Liu et al. Multi-task learning from multimodal single-cell omics with Matilda
Neely Cloudy with a chance of peptides: accessibility, scalability, and reproducibility with cloud-hosted environments
Luo et al. A Caps-UBI model for protein ubiquitination site prediction
CN116894472A (en) Training method and device for neural network model for predicting binding force of polypeptide
CN114664382B (en) Multi-group association analysis method and device and computing equipment
Goh et al. Contemporary network proteomics and its requirements
Wang et al. DeepSP: a deep learning framework for spatial proteomics
Park et al. Transfer learning compensates limited data, batch-effects, and technical heterogeneity in single-cell sequencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination