CN114283878A

CN114283878A - Method and apparatus for training matching model, predicting amino acid sequence and designing medicine

Info

Publication number: CN114283878A
Application number: CN202110997711.0A
Authority: CN
Inventors: 吴家祥
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2022-04-05

Abstract

The embodiment of the application relates to a method and a device for training a matching model, predicting an amino acid sequence and designing a medicine. The method for training the matching model comprises the following steps: obtaining a sample set, wherein the sample set comprises a three-dimensional structure of a known protein and an amino acid sequence corresponding to the three-dimensional structure of the known protein; and inputting the sample set into a matching function and training to obtain a trained matching model. By using the method according to the embodiment of the application, the prediction accuracy of the matching degree of the predicted amino acid sequence and the three-dimensional structure of the protein can be improved.

Description

Method and apparatus for training matching model, predicting amino acid sequence and designing medicine

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for training a matching model, predicting an amino acid sequence and designing a medicine.

Background

Proteins, which consist of linear chains of amino acids, are among the most widely used molecules in living organisms. They play a crucial role in the general biological mechanisms. Proteins naturally fold into three-dimensional structures according to amino acid sequences, and the structures directly affect the functions of the proteins. Given environmental factors such as solvent, temperature, etc., the amino acid sequence of a protein can substantially uniquely determine its corresponding three-dimensional structure. Therefore, if a three-dimensional structure of a protein that can perform a specific biological function is known, the corresponding amino acid sequence can be found by calculation so that the folded three-dimensional structure matches the requirement to perform the corresponding biological function.

In the existing protein de novo design methods, most of the existing protein de novo design methods are based on artificially designed energy functions to evaluate the degree of matching between the amino acid sequence and the three-dimensional structure of the protein backbone, but such energy functions are often based on physical rules after approximation (to maintain computational efficiency), and the relationship between the amino acid sequence and the three-dimensional structure of the protein backbone is not accurately drawn, so that the deviation of the de novo design structure of the protein is often caused.

Disclosure of Invention

The embodiment of the application provides a method and a device for training a matching model, predicting an amino acid sequence and designing a medicine, so as to improve the prediction precision of the related predictions such as the prediction of the amino acid sequence based on the three-dimensional structure of the protein, reduce the working cost and improve the prediction efficiency.

In a first aspect, embodiments of the present application provide a method for training a matching model for characterizing a degree of matching between an amino acid sequence and a three-dimensional structure of a protein, comprising:

obtaining a sample set, wherein the sample set comprises a three-dimensional structure of a known protein and an amino acid sequence corresponding to the three-dimensional structure of the known protein; and

the sample set is input to a matching function and trained to obtain a trained matching model.

According to some embodiments of the application, the matching model is obtained by training:

inputting a sample set having an actual sample distribution into the matching function;

and training the matching function according to the actual sample distribution to enable the predicted sample distribution of the matching function to be close to the actual distribution, wherein the actual sample distribution and the predicted sample distribution are sample distributions taking the three-dimensional structure of the protein and the amino acid sequence as variables.

inputting a sample set consisting of a known protein three-dimensional structure and a corresponding amino acid sequence into a matching function to obtain a predicted value of the matching probability of the known protein three-dimensional structure and the corresponding amino acid sequence in the sample set;

determining a loss value based on the match probability prediction value;

and carrying out iterative optimization on the matching function according to the loss value so as to increase the predicted value of the matching probability, thereby obtaining a trained matching model.

According to some embodiments of the application, the iteratively optimizing the matching function according to the loss value such that the predicted matching probability value is increased comprises:

iteratively optimizing the matching function such that a total match probability prediction value of the sample set is increased.

sampling from the sample set to obtain a sample subset;

iteratively optimizing the matching function such that a total match probability prediction value of the subset of samples is increased.

According to some embodiments of the present application, the matching function is a normalized index function with a negative value of an energy function prediction value as an index, and the obtaining of the prediction value of the matching probability of the three-dimensional structure of the known protein with the corresponding amino acid sequence in the sample set comprises:

predicting an energy function prediction value according to the three-dimensional structure of the known protein and the initial amino acid sequence based on an energy function;

and calculating the matching probability predicted value according to the energy function predicted value based on the matching function.

According to some embodiments of the application, the determining a loss value based on the match probability prediction value comprises:

carrying out logarithmic operation on the matching probability predicted value;

and determining a loss value according to the logarithm operation result.

According to some embodiments of the application, the matching model is associated with trainable parameters, the iterative optimization of the matching function according to the loss values comprises:

calculating the gradient of the loss value relative to the trainable parameter to obtain a reverse propagation gradient;

and performing iterative optimization of reverse propagation on the matching function according to the reverse propagation gradient.

According to some embodiments of the application, the energy function is a graph neural network containing trainable parameters;

calculating a gradient of the penalty value relative to the trainable parameter to obtain an inverse gradient, which is approximately replaced by:

sampling from the predicted sample distribution corresponding to the matching function according to the corresponding amino acid sequence to obtain a three-dimensional structure of the sampled protein;

predicting an energy function sampling prediction value according to the three-dimensional structure of the sampling protein and the corresponding amino acid sequence based on the energy function;

calculating the difference value of the energy function predicted value and the energy function sampling predicted value;

calculating a gradient of the difference value with respect to the trainable parameter, setting a calculation result approximately as the inverse gradient for the inverse propagation.

According to some embodiments of the application, the sampling from the distribution of predicted samples corresponding to the matching function is performed by a markov chain monte carlo method.

In a second aspect, embodiments of the present application provide a method of predicting an amino acid sequence, comprising:

(a) determining a matching result between the three-dimensional structure of the target protein and the starting amino acid sequence based on a matching model, wherein the matching result characterizes a degree of matching between the starting amino acid sequence and the three-dimensional structure of the target protein, and the matching model is obtained according to the method of the first aspect;

(b) mutating the starting amino acid sequence so as to obtain a mutated amino acid sequence;

(c) determining a match result between the mutant amino acid sequence and the three-dimensional structure of the target protein based on the matching model;

(d) determining whether to retain the mutation in step (b) based on the difference in the match results in step (a) and step (c);

(d) repeating steps (b) to (d) until the final amino acid sequence is obtained.

According to some embodiments of the application, the mutation is a point mutation.

According to some embodiments of the application, the point mutation comprises at least one of a deletion, a substitution, an insertion.

According to some embodiments of the present application, the mutation is determined using a monte carlo sampling method.

According to some embodiments of the present application, the monte carlo sampling method comprises at least one of a simulated annealing monte carlo sampling method and a replica exchange based monte carlo sampling method.

According to some embodiments of the application, step (d) comprises:

retaining the mutation in step (b) when the matching result in step (c) corresponds to a higher degree of matching than the matching result in step (a);

when the matching degree corresponding to the matching result in the step (c) is not higher than the matching degree corresponding to the matching result in the step (a), the mutation in the step (b) is reserved with a preset probability, otherwise, the mutation in the step (b) is abandoned.

According to some embodiments of the application, the preset probability is determined based on a current temperature and a degree of matching change.

In a third aspect, embodiments of the present application provide a method of designing a medicament, comprising:

determining a three-dimensional structure of a target protein based on a target of a known disease, the three-dimensional structure of the target protein being suitable for binding to the target;

according to the method of the second aspect, the amino acid sequence of the target protein is determined based on the three-dimensional structure of the target protein.

In a fourth aspect, embodiments of the present application provide an apparatus for training a matching model for characterizing a degree of matching between an amino acid sequence and a three-dimensional structure of a protein, comprising:

the system comprises a sample set acquisition module, a data processing module and a data processing module, wherein the sample set acquisition module is used for acquiring a sample set, and the sample set comprises a three-dimensional structure of a known protein and an amino acid sequence corresponding to the three-dimensional structure of the known protein; and

and the training module is used for inputting the sample set into a matching function and training the sample set so as to obtain a trained matching model.

In a fifth aspect, embodiments of the present application provide an apparatus for predicting an amino acid sequence, comprising:

a first matching module, configured to determine a matching result between the three-dimensional structure of the target protein and the initial amino acid sequence based on a matching model, wherein the matching result represents a degree of matching between the initial amino acid sequence and the three-dimensional structure of the target protein, and the matching model is obtained by the method of the first aspect;

a mutation module for mutating the starting amino acid sequence so as to obtain a mutated amino acid sequence;

a second matching module for determining a matching result between the mutant amino acid sequence and the three-dimensional structure of the target protein based on the matching model;

a determining module for determining whether to retain the mutation based on a difference in the matching results in the first matching module and the second matching module until a final amino acid sequence is obtained.

In a sixth aspect, embodiments of the present application provide a computing device, comprising: a processor and a memory;

the memory for storing a computer program;

the processor is configured to execute the computer program to implement the method as described above.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the storage medium includes computer instructions, which when executed by a computer, cause the computer to implement the method as described above.

The method and the device for training the matching model, predicting the amino acid sequence, training the energy function model and designing the medicine, which are provided by the embodiment of the application, can improve the prediction precision of the matching degree of the predicted amino acid sequence and the three-dimensional structure of the protein, so that the prediction precision of the prediction of the amino acid sequence and other related predictions based on the three-dimensional structure of the protein can be improved, the working cost is reduced, and the prediction efficiency is improved. In the embodiment of the application, by adopting a matching model, a protein de novo design method based on an energy model can be realized, an energy function capable of accurately evaluating the coincidence degree of an amino acid sequence and a desired protein main chain three-dimensional structure is obtained through data-driven model training, and the energy function is applied to an optimization process of the amino acid sequence, for example, in a simulated annealing Monte Carlo sampling process, the amino acid sequence can be gradually optimized, so that the amino acid sequence capable of completing a specific biological function is obtained, and the purpose of protein de novo design is realized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a system architecture diagram according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram illustrating a method for training a matching model according to an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram illustrating a method for training a matching model according to another embodiment of the present application;

FIG. 4 is a schematic flow chart diagram illustrating a method for training a matching model according to another embodiment of the present application;

FIG. 5 is a schematic flow chart diagram illustrating a method for training a matching model according to another embodiment of the present application;

FIG. 6 is a schematic flow chart diagram illustrating a method for training a matching model according to another embodiment of the present application;

FIG. 7 is a schematic flow chart diagram illustrating a method for training a matching model according to another embodiment of the present application;

FIG. 8 is a schematic flow chart diagram illustrating a method for training a matching model according to another embodiment of the present application;

FIG. 9 is a schematic flow chart diagram illustrating a method for training a matching model according to another embodiment of the present application;

FIG. 10 is a schematic flow chart diagram illustrating a method for training a matching model according to another embodiment of the present application;

FIG. 11 is a schematic diagram of a method for training a matching model according to another embodiment of the present application;

FIG. 12 is a schematic flow chart of a method for predicting an amino acid sequence provided by an embodiment of the present application;

FIG. 13 is a schematic flow chart illustrating a method for training an energy function model according to another embodiment of the present application;

FIG. 14 is a schematic flow chart of a method for designing a drug according to an embodiment of the present application;

FIG. 15 is a schematic diagram of the structure of an apparatus for predicting an amino acid sequence provided in an embodiment of the present application;

FIG. 16 is a schematic diagram of an apparatus for training a matching model according to an embodiment of the present application;

FIG. 17 is a block diagram of a computing device to which embodiments of the present application relate;

FIG. 18 is a block diagram of an attention mechanism provided in accordance with an embodiment of the present application; and

FIG. 19 is a block diagram of a multi-headed attention mechanism provided in accordance with an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be understood that in the embodiment of the present application, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.

In the description of the present application, "plurality" means two or more than two unless otherwise specified.

In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

The embodiment of the application is applied to the technical field of software testing, and particularly applied to the legality check of the requirement data, so that the test case can be stably and efficiently generated according to the legal requirement data.

In order to facilitate understanding of the embodiments of the present application, the related concepts related to the embodiments of the present application are first briefly described as follows:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Neural Networks (NN), in the field of machine learning and cognitive science, is a mathematical or computational model that mimics the structure and function of biological Neural networks (the central nervous system of animals, particularly the brain) and is used to estimate or approximate functions. Neural networks are computed from a large number of artificial neuron connections. In most cases, the neural network can change the internal structure on the basis of external information, and is an adaptive system. Neural networks are usually optimized by a Learning Method (Learning Method) based on mathematical statistics, and are therefore a practical application of mathematical statistics, by which we can obtain a large number of local structure spaces that can be expressed as functions. As with other machine learning methods, neural networks have been used to solve a variety of problems, such as machine vision and speech recognition. These problems are difficult to solve by conventional rule-based programming.

The Attention Mechanism (Attention Mechanism) refers herein to a vector for representing importance weights of features, and in order to predict or infer a target element (e.g., a pixel in an image or a word in a sentence), the Attention vector may be used to estimate how much the target element is associated with other elements, and a sum of values of the elements multiplied by the Attention vector is used as an approximate value of the target element.

Proteins are the most essential and versatile macromolecules in the body, and knowledge of their functions plays a crucial role in the development of scientific, medical and agricultural fields. Protein interactions are closely related to a range of cellular activities and, therefore, they are critical to the health and disease state of the body. Given their indispensable role in a wide range of biological processes, the regulation of protein interactions has a wide space of development in the field of drug development. However, because protein interaction interfaces are generally large and flat and lack distinct structural features, designing drugs that target protein interaction interfaces would be extremely challenging, and this class of important targets has been considered "hard to formulate".

How to predict protein sequences that can fulfill a particular function is one of the major tasks of the large pharmaceutical factories currently producing biopharmaceuticals. In one possible implementation of the present application, the optimization of proteins is mainly based on the manual experience of the pharmacogenist, and is iteratively refined by trial and error (trial-and-error), for example, the determination of binding sites of antibodies at present mainly relies on expensive structure analysis experiments or time-consuming molecular knockout screening experiments. This requires extremely high manpower and material resources.

The AI technology has the greatest advantage that a large amount of learning data can be digested in a short time through a self-learning process, so that the purpose of no teaching and self-learning is realized.

Energy-based models (energy-based models): for a given probability distribution, the corresponding network structure is designed to approximate the negative logarithmic function (or the gradient with respect to the argument) of its probability density function, thereby parameterizing the probability distribution. An energy function is used herein to characterize the relationship between the three-dimensional structure of a protein and the amino acid sequence, and in short, can be understood as the probability that, for a given amino acid sequence and three-dimensional structure of a protein, the amino acid sequence exhibits the three-dimensional structure of the protein.

Three-dimensional structure of protein: proteins are generally composed of tens to thousands of amino acids, each of which is composed of hydrogen, carbon, nitrogen, oxygen and sulfur atoms; the three-dimensional structure of a protein is defined by the three-dimensional coordinates of all its atoms in space.

Based on the method and the device for predicting the amino acid sequence, training the energy function model and designing the medicine, provided by the embodiment of the application, the prediction precision of the related prediction such as the amino acid sequence prediction based on the three-dimensional structure of the protein can be improved, the working cost is reduced, and the prediction efficiency is improved. In the embodiment of the application, the protein de novo design method based on the energy model is realized, the energy function capable of accurately evaluating the coincidence degree of the amino acid sequence and the expected three-dimensional structure of the protein main chain is obtained through data-driven model training, and the energy function is applied to the optimization process of the amino acid sequence, for example, the simulated annealing Monte Carlo sampling process, so that the amino acid sequence can be gradually optimized, the amino acid sequence capable of completing a specific biological function is obtained, and the purpose of de novo design of the protein is realized.

The application scenario of the method includes but is not limited to the fields of medical treatment, biology, scientific research and the like, for example, the method is used for drug production, drug research and development, vaccine research and development and the like, human intervention is not needed in the whole prediction process, data driving can be achieved completely, and the cost is low.

In some embodiments, the system architecture of embodiments of the present application is shown in fig. 1.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application, which includes a user device 101, a data acquisition device 102, a training device 103, an execution device 104, a database 105, and a content library 106.

The data acquisition device 102 is configured to read training data from the content library 106 and store the read training data in the database 105. The training data related to the embodiment of the application comprises a three-dimensional structure of the protein and perturbed data thereof.

The training device 103 trains the pre-trained model based on the training data maintained in the database 105, so that the trained pre-trained model can effectively determine the energy relationship between the protein and the amino acid sequence, and thus, the high-probability amino acid sequence for finally realizing the target three-dimensional structure can be realized through iterative optimization on the given target protein three-dimensional structure. The object prediction model obtained by the training apparatus 103 may be applied to different systems or apparatuses.

In fig. 1, the execution device 104 is configured with an I/O interface 107 for data interaction with an external device. Such as receiving information of the protein to be predicted, e.g. the three-dimensional structure of the protein, sent by the user equipment 101 through the I/O interface. The calculation module 109 in the execution device 104 processes the input protein information using the trained model, outputs the prediction result of the amino acid sequence, and sends the corresponding result to the user device 101 through the I/O interface.

The user device 101 may include a mobile phone, a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), or other terminal devices with a browser installation function.

The execution device 104 may be a server.

For example, the server may be a rack server, a blade server, a tower server, or a rack server. The server may be an independent test server, or a test server cluster composed of a plurality of test servers.

In this embodiment, the execution device 104 is connected to the user device 101 through a network. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a communication network.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided in an embodiment of the present application, and a positional relationship between devices, modules, and the like shown in the diagram does not constitute any limitation. In some embodiments, the data collection device 102 may be the same device as the user device 101, the training device 103, and the performance device 104. The database 105 may be distributed on one server or a plurality of servers, and the content library 106 may be distributed on one server or a plurality of servers.

The technical solutions of the embodiments of the present application are described in detail below with reference to some embodiments. Features and advantages of various aspects described below may be combined, and details of the same or similar concepts or processes may not be repeated in some embodiments.

First, a method for training a matching model according to an embodiment of the present application is described with reference to fig. 2 to 11. The matching model is used for characterizing the matching degree between the amino acid sequence and the three-dimensional structure of the protein.

Fig. 2 is a schematic diagram illustrating a method for training a matching model used for characterizing the matching degree between an amino acid sequence and a three-dimensional structure of a protein according to an embodiment of the present application, and referring to fig. 2, the method for training the matching model includes:

s210: obtaining a sample set, wherein the sample set comprises a three-dimensional structure of a known protein and an amino acid sequence corresponding to the three-dimensional structure of the known protein

According to embodiments of the present application, a matching model is trained by using known protein data, so that the degree of matching between an amino acid sequence and a three-dimensional structure of a protein can be effectively characterized using the trained matching model, in other words, the matching model can be used to predict the probability or likelihood that a protein having a given amino acid sequence will exhibit the three-dimensional structure. According to embodiments of the present application, the known protein data employed includes the three-dimensional structure of a known protein and the corresponding amino acid sequence of the three-dimensional structure of the known protein. It should be noted that the expression "known three-dimensional structure of protein" used herein is to be understood in a broad sense and includes the three-dimensional structure of protein obtained by biochemical experiments, such as purification and crystallization of protein, followed by analysis of three-dimensional structure by X-ray crystallography or cryoelectron microscopy, and may also include the prediction by mathematical model, such as by protein three-dimensional structure prediction software.

According to embodiments of the present application, the three-dimensional conformation of the protein can be obtained by performing structural analysis on the crystal of the protein, for example, by performing three-dimensional structural analysis on the crystal, such as X-ray crystal diffraction analysis, electron microscope three-dimensional reconstruction technology, and nuclear magnetic resonance technology. In addition, Crystal Data of The Protein may be obtained from public databases, such as The Cambridge Structure Database (CSD), The Protein Database (PDB), The Inorganic Crystal Structure Database (ICSD), The Crystal on powder Database of The International Diffraction Data Center (JCPDS-International Center for Diffraction Data, JCPDS-ICDD), and The like.

In addition, after determining the relevant information (e.g., amino acid sequence, structural formula, or partial crystal data) of the protein, three-dimensional structure reconstruction or prediction of the three-dimensional structure of the protein can also be performed by various software. For example, the Rosetta @ home platform (website: https:// www.rosettacommons.org), Foldit: solve Puzzles for Science platform (website: https:// fold. it/portal), The Folding @ Home platform (website: https:// Folding @ Home. org), Template Modeling platform (website: https:// salilab. org/modeler /), Swiss-Model (website: https:// swissmodule. expass. org /), etc.

In addition, the amino acid sequence of a protein can be obtained by searching a known database (for example, UniProt protein sequence database http:// www.uniprot.org /), or by a conventional protein sequencing method.

S220: the sample set is input to a matching function and trained to obtain a trained matching model.

After obtaining the sample sets, machine learning can be performed using these sample sets to obtain a model that can be used to characterize the degree of match between the amino acid sequence and the three-dimensional structure of the protein.

As used herein, "degree of match" is a measure of the probability that an amino acid sequence will assume the three-dimensional structure of a particular protein. Specifically, the higher the degree of matching between an amino acid sequence and a three-dimensional structure of a specific protein, the higher the probability that the protein having the amino acid sequence will exhibit the specific three-dimensional structure under natural conditions, particularly physiological conditions. Because the adopted sample sets have high matching degree between the amino acid sequences and the three-dimensional structures, the sample sets can be used for effectively training the matching functions, and therefore a model which can be used for predicting the matching degree between the amino acid sequences and the three-dimensional structures of the proteins is obtained. In some embodiments of the present application, the result of "degree of match" may be output in a quantitative, semi-quantitative, or qualitative descriptive manner. For example, in some embodiments, quantitative data of the probability of a match may be employed to characterize the output of a "degree of match".

Further, based on the three-dimensional structure of the protein, a person skilled in the art can construct a topological graph G by using each amino acid residue in the protein as a node V (also referred to as a "vertex") of the topological graph and using a pair of adjacent amino acid residues as an edge E, and can mathematically express the topological graph as (V, E). Thus, based on the three-dimensional structure of the protein, a topological map G corresponding to the three-dimensional structure can be constructed.

After the three-dimensional conformational structure is acquired, a construction topology G can be constructed by taking each amino acid residue in the protein as a node V (sometimes also referred to as a "vertex") of the topology, and taking pairs of adjacent amino acid residues as edges E. In some embodiments of the present application, a pair of adjacent amino acid residues refers to a pair of amino acid residues having a distance between alpha carbon atoms not exceeding a predetermined threshold. In other words, for a specific amino acid residue, a neighborhood region (with the predetermined threshold value as a radius) is set around the alpha carbon atom of the amino acid residue, and all other amino acids whose alpha carbon atoms are located in the neighborhood region are considered to form the edge E with the specific amino acid, respectively. Herein, alpha carbon refers to the carbon atom of an amino acid residue that is attached to a carboxyl group, the alpha carbon of an amino acid being important for protein folding. When describing a protein (which is a long chain of amino acids), the position of the alpha carbon in an amino acid is typically considered the position of the amino acid. The predetermined threshold may be about 1-20 angstroms, such as about 1-19 angstroms, about 1-18 angstroms, about 1-17 angstroms, about 1-16 angstroms, about 1-15 angstroms, about 1-14 angstroms, about 1-13 angstroms, about 1-12 angstroms, or about 1-10 angstroms. It is noted that the above ranges cover all values involved in the range. In addition, the term "about" as used herein means up to and down by 10% unless otherwise specified.

In some embodiments, because amino acids are used as nodes, rather than each atom is used as a node, a large amount of background data is avoided, and the training efficiency, the prediction efficiency, the accuracy and the like of machine learning are improved.

According to an embodiment of the application, after obtaining the topological graph, a feature vector may be determined from the topological graph. According to embodiments of the present application, the feature vector used herein may include corner features of the topological map, and may further include features of the amino acid residues involved in the topological map and combination features between the amino acid residues. The related features can be collectively called a multi-dimensional vector matrix, so that quantitative characterization of the access area is realized.

Regarding the corner features of the topological graph, the corner features can be characterized by using an adjacency matrix and a degree matrix, wherein the degree matrix is a diagonal matrix, elements on the diagonal are degrees of each vertex, and the degree of the vertex represents the number of edges associated with the vertex. The adjacency matrix indicates whether or not a relationship exists between vertices. For a given topological graph, one skilled in the art can determine the adjacency matrix and degree matrix characteristics manually, or can perform calculations by some published software, such as RDKit (https:// www.rdkit.org /).

In some embodiments of the present application, the above features and properties may be characterized by a unique hot code.

Additionally, the energy function model may be a neural network model that is capable of handling the above-described features, e.g., some embodiments of the present application provide a graph neural network provided with an attention mechanism.

The Attention Mechanism (Attention Mechanism) refers herein to a vector used to represent each feature importance weight. For example, to predict or infer an element of interest (e.g., an amino acid residue in a protein structure), an attention vector may be used to estimate how well the element of interest is associated with other elements, and the sum of the values of these elements, weighted by the attention vector, is used as an approximation of the element of interest.

Referring to FIG. 18 for an example of an attention mechanism, the input x at the bottom layer₁,x₂,x₃…,x_TRepresenting input sequence data, e.g. x₁May represent the terminal amino acid residue of a protein. First, they are subjected to preliminary embedding by the embedding layer (optional) to obtain a₁,a₂,a₃…,a_T(ii) a Then, three matrices W are used^Q、W^KAnd W^VAre multiplied by the same to obtain q_i,k_i,v_iI ∈ (1,2,3 … T). FIG. 18 shows x being input₁Corresponding output b₁How is it obtained. Namely: using q₁Are respectively connected with k₁,k₂,k₃…,k_TCalculating the vector dot product to obtain alpha_1,1,α_1,2,α_1,3…,α_1,T(ii) a Will be alpha_1,1,α_1,2,α_1,3…,α_1,TEntering softmax layer, resulting in attention weight values that are all between 0-1:

obtained in the last step

V corresponding to the respective position₁,v₂,v₃…,v_TMultiply and then sum, thus obtaining x with the input₁Corresponding output b₁。

Similarly, x is input₂Corresponding output b₂Also obtained according to a similar procedure except that now b is utilized₂Corresponding q₁Are respectively connected with k₁,k₂,k₃…,k_TA vector dot product is calculated.

In some embodiments of the present application, the attention mechanism used is a multi-head attention mechanism, or in other words a multi-head attention layer is placed before the neural network of the figure, and fig. 19 illustrates the framework of the multi-head attention mechanism by way of example. The concrete expression is as follows: if q is obtained as in the preceding paragraph_i,k_i,v_iWhen viewed as a whole as a "head," a "multi-head" refers to a particular x_iIn other words, it is necessary to use multiple sets of W^Q、W^KAnd W^VMultiply with it to obtain multiple q sets_i,k_i,v_i。

A entered in the right-hand diagram of FIG. 19₁For example, three outputs are obtained by a multi-head (here, head-3 is taken as an example) mechanism

To obtain a₁Corresponding output b₁In a multi-head attention mechanism, the results obtained here can be used

Stitching (vector end-to-end) is performed and then b is obtained by transformation, e.g. by linear transformation (i.e. a single-layer fully-connected neural network without nonlinear activation layer)₁. The same is true for other inputs in the sequence and they may share parameters of these networks.

In addition, in terms of model, a graph neural network model, such as an EGNN model with equal degeneration, can be adopted, and the data of the graph structure is taken as input, and a single predicted value is output as an energy function E_θAnd (S, A) taking value. Specifically, the EGNN model defines an EGCL (E (n) -equivalent Graph volumetric Layer) unit with denaturation such as E (n). Thus, according to some embodiments of the present application, the above model takes into full account the isodenaturation of the three-dimensional structure of the protein, i.e., the three-dimensional structure of the proteinThe structure is rotated or translated without affecting the three-dimensional structure of the protein and the related physicochemical properties. The effectiveness of the obtained pre-training features and the prediction accuracy when used to perform downstream tasks can thus be improved. The actual role that a protein plays in an organism (e.g., as an enzyme, a structural protein, an important regulator of a signaling pathway, a regulator of gene expression, and even possibly causing some genetic diseases or having an immunological competence as an antibody against some specific diseases) is largely determined by the three-dimensional structure of the protein. Therefore, the technical scheme provided by the embodiment of the application can provide more effective feature representation for protein data extraction. Therefore, the training method designed based on the three-dimensional structure of the protein in the scheme of the embodiment of the application can extract more effective characteristic data and improve the prediction accuracy of downstream tasks.

In addition, with respect to Graph Neural Networks (GNNs), a node in a Graph may be defined by its features and related nodes, and the goal of GNNs is to learn a state embedding neighbor information that represents each node. State embedding may generate output vectors for distribution as predictive node labels, and the like. One skilled in the art can further nest more neural networks in the respective layers. In each GCN, the following can be employed independently as propagation rules:

wherein the content of the first and second substances,

adjacent matrix A representing topological graph G and unit matrix I representing self connection_N，

Degree matrix representing the topology G, i.e.

H^(l)The matrix of active cells representing the l-th layer (including the 0 layer, i.e. the input layer), W^(l)A convolution kernel parameter matrix representing the l-th layer.

In addition, according to some embodiments of the present application, the neural network of the graph employed has degeneration such as SE (3). The degeneration such as SE (3) is called a degeneration such as SE (3) if a function has an equal degeneration to an arbitrary rotation and translation operation in a three-dimensional space, that is, if an input of the function is subjected to a certain rotation and translation operation and an output of the function is changed accordingly (corresponding to the same set of rotation and translation operations). In some embodiments, the graph neural network comprises at least one selected from the group consisting of EGNN, SE (3) -Transformer, and Lie-Transformer with degeneration such as SE (3).

In some embodiments of the present application, a graph neural network architecture of EGNN may be adopted, in short, EGNN is composed of a plurality of graph convolution layers, and a propagation operation formula between the layers is:

m_ij＝φ_m(h_i,h_j,g_ij,||x_i-x_j||₂)

x′_i＝x_i+Σ_j∈N(i)φ_x(m_ij)(x_i-x_j)

h′_i＝φ_h(h_i,Σ_j∈N(i)m_ij)

wherein h is_iIs characteristic of the ith amino acid residue in the protein (e.g., amino acid type), g_ijIs a combination of the characteristics of the ith and jth amino acid residues in a protein (e.g., distance and angular relationships between different amino acid residues), x_iIs the three-dimensional coordinates of the ith amino acid residue in the protein (e.g., the three-dimensional coordinates of the C-Alpha atom), and N (i) is a neighborhood set of the ith amino acid residue in the protein, including the set of amino acid residues adjacent to the ith amino acid residue.

For convenience of understanding, the above-described processing is described in detail below.

With respect to the formula m_ij＝φ_m(h_i,h_j,g_ij,||x_i-x_j||₂) In some embodiments of the present application, the formula is based on the characteristics h of each of the ith amino acid residue and the jth amino acid residue_iAnd h_jAnd the combination of the i-th amino acid residue and the j-th amino acid residue_ijAdding the three-dimensional coordinate x between the ith and jth amino acid residues_iAnd x_jOf the Euclidean distance between, input to_m() In the function (the function form is not limited, for example, MLP (multi-layer perceptron model) can be used), the information vector m provided by the jth amino acid residue to the ith amino acid residue is obtained_ij。

About formula x'_i＝x_i+∑_j∈N(i)φ_x(m_ij)(x_i-x_j) In some embodiments of the present application, the formula uses an update similar to a residual network for the three-dimensional coordinate x of the ith amino acid residue_iAnd (6) updating. In particular, for each amino acid residue located in the neighborhood set of the ith amino acid residue, two sets of three-dimensional coordinates (x) are considered_iAnd x_j) Difference between them, by phi_x() The output values of the functions (scalar instead of vector) are linearly weighted to obtain a residual term for updating the three-dimensional coordinates of the ith amino acid residue, and then superimposed on the three-dimensional coordinates x of the ith amino acid residue before updating_iObtaining updated three-dimensional coordinates x 'of the ith amino acid residue'_i。

About formula h'_i＝φ_h(h_i,∑_j∈N(i)m_ij) In some embodiments of the present application, the formula uses the characteristic h of the i-th amino acid residue before the update_iAnd the information vector m for all amino acid residues located in the neighborhood set of the ith amino acid residue_ijThe sum is inputted to phi_h() In the function, the updated characteristic h 'of the ith amino acid residue is obtained'_i。

Because the updated characteristics of the amino acid residues and the alpha-carbon atom coordinates are processed by a multi-head attention mechanism, the mutual relations among the residues can be reflected, and more effective information can be provided for subsequent downstream operations. Subsequently, through iterative update of a plurality of EGCL units, the final amino acid residue features can be subjected to Global Pooling (Global Pooling) to obtain a single predicted value as an output result of an energy function, and further through conversion, an output result of a matching degree, for example, a probability that an amino acid sequence presents a specific three-dimensional structure, is obtained.

Specifically, referring to fig. 3, according to some embodiments of the present application, the matching model is obtained by training through the following steps:

s310: inputting a sample set with an actual sample distribution into a matching function;

s320: and training the matching function according to the actual sample distribution to enable the predicted sample distribution of the matching function to be close to the actual distribution, wherein the actual sample distribution and the predicted sample distribution are sample distributions taking the three-dimensional structure of the protein and the amino acid sequence as variables.

According to the above steps, in some embodiments of the present application, p is_θ(S, A) as a function of probability density of distribution of predicted samples when a model parameter theta is used in a matching function model in a machine learning process, p_data(S, A) optimizing the model parameter theta to make p be p as the actual sample distribution probability density function_θ(S, A) and p_data(S, A) are as close as possible, and a matching model capable of effectively predicting the matching degree of the three-dimensional structure of the protein and the amino acid sequence can be obtained.

According to some embodiments of the present application, for a given protein three-dimensional structure and a set of amino acid sequences, it has an actual sample distribution probability density function p_data(S, A), when training the matching model, p can be obtained according to different model parameters theta_θ(S, A) as a function of the probability density of the distribution of the predicted samples, whereby p can be made by continuously optimizing the model parameter θ_θ(S, A) and p_θ(S, A) are as close as possible, thereby completing the training of the matching model.

Additionally, referring to fig. 4, according to some embodiments of the present application, the matching model may be obtained by training through the following steps:

s410: obtaining a predicted value of the match probability of the three-dimensional structure of the known protein in the sample set with the corresponding amino acid sequence;

s420: and performing iterative optimization on the matching function based on the obtained matching probability predicted value to increase the matching probability predicted value so as to obtain a trained matching model.

According to the embodiment of the application, for the known three-dimensional structure of the protein and the corresponding amino acid sequence in the sample set, the true three-dimensional conformation presentation of the amino acid sequence of the three-dimensional structure of the protein may have high matching probability (for example, close to 100%), so that the predicted value of the matching probability of the data in the sample set is increased by iteratively optimizing the model parameters of the matching function, the training of the matching model can be completed, and the matching model capable of effectively predicting the matching degree of the three-dimensional structure of the protein and the amino acid sequence can be obtained.

Referring to fig. 5, in further detail, according to some embodiments of the present application, the matching model is obtained by training through the following steps:

s501: inputting a sample set consisting of a known protein three-dimensional structure and a corresponding amino acid sequence into a matching function to obtain a matching probability prediction value of the known protein three-dimensional structure and the corresponding amino acid sequence in the sample set;

s502: determining a loss value based on the matching probability prediction value;

s503: and carrying out iterative optimization on the matching function according to the loss value so as to increase the predicted value of the matching probability, thereby obtaining a trained matching model.

According to the embodiment of the application, the loss value of the prediction result can be determined through the matching probability prediction value, so that the loss value can be gradually reduced, iterative optimization of the matching function is realized, the matching probability prediction value is increased, and a trained matching model capable of effectively predicting the matching degree of the three-dimensional structure of the protein and the amino acid sequence is obtained.

In some embodiments, the loss function may be:

or

Wherein S is a representation of the three-dimensional structure of a known protein and A is a representation of the corresponding amino acid sequence. Obviously, such a loss function can make the predicted matching probability p of the two predicted by the matching model_θ(S, A) is increased.

Referring to fig. 6, in further detail, according to some embodiments of the present application, iteratively optimizing a matching function according to a loss value to increase a matching probability prediction value includes:

s630: sampling from a sample set to obtain a sample subset;

s640: iterative optimization of the matching function increases the overall match probability prediction for the subset of samples.

Therefore, a part of sample subsets can be used for model training, and the other part of samples can be used for model testing, so that overfitting of the matching model is prevented, and the application range and the prediction efficiency of the matching model are improved.

In some embodiments, the loss function may be:

or

Obviously, minimizing this loss function may result in an overall increase in the probability of matching for all samples in the sample subset. It will be appreciated that there may be some reduction in the match probability predictors for individual samples as the match probability increases overall, but in general the predicted sample distribution of the match function may be made closer to the actual distribution.

Referring to fig. 7, in further detail, according to some embodiments of the present application, the matching function is a normalized index function with a negative value of the predicted energy function value as an index to obtain a predicted matching probability of the three-dimensional structure of the known protein in the sample set with the corresponding amino acid sequence, including:

s701: predicting an energy function prediction value according to a known protein three-dimensional structure and an initial amino acid sequence based on an energy function;

s702: and calculating a matching probability predicted value according to the energy function predicted value based on the matching function.

Thus, according to some embodiments of the present application, a machine learning model can predict the degree of match by predicting an energy function between the three-dimensional structure of a protein and an amino acid sequence.

The energy function is usually written as E (x, y) and is used to measure the compatibility or matching between variables x and y, with smaller energies giving higher matches.

The degree of matching between the three-dimensional structure of the protein and the amino acid sequence was further determined by the following formula:

wherein E is_θ(S, A) is the result of the energy function output for the protein three-dimensional structure S and the amino acid sequence A using the model parameter θ.

Integral of the result after exp () operation, i.e. p_θ(S, A) is the result of the normalization.

And thus may further be based on p_θAnd (S, A) optimizing the model parameter theta to obtain a trained matching model capable of effectively predicting the matching degree of the three-dimensional structure and the amino acid sequence of the protein.

Referring to FIG. 8, in further detail, according to some embodiments of the present application, p is obtained for protein three-dimensional structure S and amino acid sequence A_θ(S, a) after determining a loss value based on the match probability prediction value, comprising:

s801: carrying out logarithmic operation on the matching probability predicted value;

s802: and determining a loss value according to the logarithm operation result.

It will be appreciated by those skilled in the art that for a given protein three-dimensional structure S and its corresponding amino acid sequence a, the match probability can be considered to be close to 1 because of the naturally occurring high correlation, and therefore by calculating the result of the logarithmic operation of the match probability predictors, the loss value that can be used for efficiently performing the iterative optimization of the model parameters can be determined.

Referring to fig. 9 in further detail, according to some embodiments of the present application, a matching model is associated with trainable parameters, and iterative optimization of a matching function according to a loss value includes:

s901: calculating the gradient of the loss value relative to the trainable parameters to obtain a reverse propagation gradient;

s902: and performing iterative optimization of inverse propagation on the matching function according to the inverse propagation gradient.

Referring to fig. 10 and 12, in further detail, according to some embodiments of the present application, the energy function is a graph neural network containing trainable parameters, and a gradient of the penalty value relative to the trainable parameters is calculated, resulting in an inverse gradient, approximately replaced by the following method:

s1001: sampling from the predicted sample distribution corresponding to the matching function according to the corresponding amino acid sequence to obtain a three-dimensional structure of the sampled protein;

s1002: predicting an energy function sampling prediction value according to the three-dimensional structure of the sampled protein and a corresponding amino acid sequence based on an energy function;

s1003: calculating the difference value of the energy function predicted value and the energy function sampling predicted value;

s1004: a gradient of the difference with respect to the trainable parameters is calculated and the calculation result is approximately set to an inverse gradient for back propagation.

According to some embodiments of the present application, when calculating the loss value according to the method described above, it is not easy to calculate, mainly because the calculation of Z (θ) is not easy. To this end, according to some embodiments of the present application, an approximate alternative method of calculating a loss value relative to a trainable parameter gradient when propagating in reverse is presented.

Specifically, in training the model parameters, a maximum Likelihood Estimation (maximum Likelihood Estimation) method may be used, that is, the following loss function is minimized:

since the probability density function itself is difficult to calculate, the gradient of the loss function with respect to the model parameter θ can be approximated, i.e.:

wherein S⁺Is a known three-dimensional structure of a protein, A represents S⁺Corresponding amino acid sequence, in other words, a protein having the amino acid sequence A is known to be capable of assuming a three-dimensional structure S of the protein⁺，S^-Is a protein three-dimensional structure obtained by sampling according to an amino acid sequence A from a sample distribution corresponding to a current model parameter theta of a matching model, thereby aiming at S⁺The combination of A and A calculates an energy function E_θ(S⁺A) gradient with respect to the model parameter θ

And for S^-The combination of A and A calculates an energy function E_θ(S^-A) gradient with respect to the model parameter θ

Further utilize

As the gradient estimation of the loss value relative to the model parameter, the model parameter can be effectively subjected to inverse propagation to carry out iterative optimization on the model parameter theta, the training efficiency can be effectively improved, the calculation amount is reduced, and the model training cost is reduced.

According to an embodiment of the present application, the sampling from the distribution of predicted samples corresponding to the matching function may be performed by a stochastic sampling algorithm such as a Markov Chain Monte Carlo (MCMC) method. Thus, the sampling for the three-dimensional structure of the protein appears random, whereby E_θ(S⁺A) gradient with respect to model parameters and E_θ(S^-And A) the difference between the gradients of the model parameters theta is closer to the gradient of the real loss value relative to the model parameters, namely the efficiency of approximate estimation is higher, so that iterative optimization of reverse propagation can be effectively carried out on the model parameters, the training efficiency can be effectively improved, the calculated amount is reduced, and the model training cost is reduced. MCMC sampling methods that may be employed according to embodiments of the present application include, but are not limited to, Langevin MCMC, Hamiltonian Monte Carlo.

In summary, in the first aspect of the present invention, the present application provides a training method of a matching model capable of effectively predicting the degree of matching between a three-dimensional structure of a protein and an amino acid sequence. The prediction accuracy of the matching degree of the predicted amino acid sequence and the protein three-dimensional structure can be improved, so that the prediction accuracy of the related prediction such as the prediction of the amino acid sequence based on the protein three-dimensional structure can be improved, the working cost is reduced, and the prediction efficiency is improved. In contrast, most of the conventional methods are based on amino acid sequence data of proteins or homologous sequence data thereof, and do not directly utilize three-dimensional structure information of proteins. Therefore, the embodiment of the application avoids the limitation of an energy function based on artificial design by constructing a pure data-driven energy function, and more accurately evaluates the matching degree between the amino acid sequence and the three-dimensional structure of the protein main chain, thereby realizing more efficient and accurate protein de novo design.

The training method for training the matching model has been described in detail above, and the application scenario of the matching model, that is, the method for predicting the amino acid sequence based on the three-dimensional structure of the protein, is described below. Fig. 11 is a schematic flow chart of a method for predicting an amino acid sequence according to an embodiment of the present application, as shown in fig. 1, the method comprising:

s1101: determining a matching result between the three-dimensional structure of the target protein and the initial amino acid sequence based on the matching model;

a protein de novo design method aims at designing a novel protein with better performance than a natural protein according to the corresponding relation between the three-dimensional structure and the biological function of the protein, such as optimizing and customizing a transmembrane protein (transmembrane protein), manufacturing the transmembrane protein which does not exist in the nature to complete a specific task, or improving the specificity and the affinity of the combination of the protein and a small molecule so as to better act on a specific small molecule target. The application aims at a protein de-heading design method based on a matching model (such as an energy model constructed in the foregoing), and can use all protein data sets with three-dimensional structure information to perform training based on data-driven matching, so as to more accurately describe the degree of coincidence between an amino acid sequence and a three-dimensional structure of a protein main chain, and improve the precision and optimization efficiency of protein de-heading design. For example, the matching result is predicted by an energy function obtained by data-driven training.

The molecular structure of proteins can be divided into four levels to describe different aspects thereof. Wherein the primary structure of a protein is a linear amino acid sequence that makes up a polypeptide chain of the protein. The secondary structure of proteins is a stable structure formed by means of hydrogen bonds between C ═ O and N — H groups between different amino acids, mainly alpha helices and beta sheets. The tertiary structure (three-dimensional structure) of a protein is a three-dimensional structure of one protein molecule formed by arrangement of a plurality of secondary structure elements in a three-dimensional space. The quaternary structure of proteins is used to describe the formation of functional protein complex molecules by the interaction of different polypeptide chains (subunits).

The three-dimensional structure of a protein is critical to the performance of a particular biological function, which one skilled in the art can also design by analysis of the specific biological function. For example, for a particular antigen, a particular three-dimensional structure of an antibody can be designed to obtain a three-dimensional structure of an antibody that recognizes the antigen. As described above, the three-dimensional structure of the protein is obtained by the permutation and combination of the secondary structural elements, and therefore, the three-dimensional structure of the protein that realizes specific binding can be designed by the permutation and combination of the α helix and the β sheet in the design. In addition, the three-dimensional structure corresponding to the specific function may be predicted by machine learning.

After the target three-dimensional structure is obtained, in the embodiment of the present application, the final optimal amino acid sequence result is predicted by predicting the matching relationship between the target protein three-dimensional structure and the amino acid sequence, that is, by using a matching model to predict the probability that the amino acid sequence exhibits the three-dimensional structure. Generally, the higher the degree of matching, the greater the probability that the amino acid sequence will assume the three-dimensional structure of the protein.

As used herein, the "starting amino acid sequence" may be generated randomly, may be generated based on the three-dimensional structure of a protein known in the art and the splicing of the corresponding amino acid sequence, or may be based on an existing amino acid sequence modified for desired properties. It should be noted that, for a given three-dimensional structure of a protein, the length of the amino acid sequence can be simply predicted according to the structure, or a machine learning model can be additionally provided to predict the relationship between the three-dimensional structure of the protein and the length of the amino acid sequence.

The matching model that can be used in this context can be trained as described above and will not be described further here.

S1102: mutating the starting amino acid sequence so as to obtain a mutated amino acid sequence;

s1103: determining a match result between the mutant amino acid sequence and the three-dimensional structure of the target protein based on the matching model;

s1104: determining whether the mutation performed in S1102 is retained based on the difference of the matching results in the previous two steps S1101 and S1103;

s1105: repeating the steps of S1102, S1103 and S1104 until the final amino acid sequence is obtained.

In general, the energy output between the starting amino acid sequence and the three-dimensional structure of the target protein is not satisfactory, i.e., a high degree of matching is required to enable the starting amino acid sequence to assume the target three-dimensional structure. Thus, to optimize the starting amino acid sequence, one can determine whether the mutation can be retained by mutating the starting amino acid sequence and then using a matching model to determine whether the mutation has an effect on the match. When the matching model adopts a mode of performing matching prediction through an energy function, whether the mutation can be reserved or not can be judged by judging the influence of the mutation on an energy output result.

Specifically, in some embodiments of the present application, the mutation is a point mutation. The point mutation includes at least one of deletion, substitution and insertion. As described above, the length of the amino acid sequence can be easily predicted, for example, the total length of the protein can be estimated from the positive structure of the three-dimensional structure of the protein, or the length of the amino acid sequence can be predicted by machine learning. According to the examples of the present application, the sequence length of amino acids can be increased appropriately in the starting amino acid sequence, and the option of blank (point mutation corresponding to "deletion") can be added to the option of point mutation, whereby the success rate of prediction can be improved. According to some embodiments of the present application, it is contemplated to extend the length of the starting amino acid sequence by 10% to 20% while the total number of options for controlling gaps in point mutations does not exceed 10%.

Referring to fig. 5, after the point mutation is determined, the mutated amino acid sequence and the three-dimensional structure of the target protein are continuously input into the matching model, and the predicted matching value of the mutated amino acid and the three-dimensional structure of the target protein is predicted. And judging whether to retain the point mutation based on the difference between the matching results before and after mutation, for example, if the mutation causes the matching degree to be increased, the mutation is retained, and if the mutation causes the matching degree to be decreased, the mutation is rejected. In a matching model using an energy function, the mutations that decrease the energy function value may be retained, and the mutations that increase the energy function value may be discarded.

According to some embodiments of the present application, the mutation is determined using a monte carlo sampling method. According to some embodiments of the present application, the monte carlo sampling method comprises at least one of a simulated annealing monte carlo sampling method and a replica exchange based monte carlo sampling method. This can improve the efficiency of sequence optimization. For amino acid sequences, the options for each position are numerous, e.g., 22 for the natural amino acids, the type of contiguous inserted amino acids, deletions, etc., even for amino acid sequences of length 100, with conservative estimates of at least 2 to the 100 th power of the mutation scheme by exhaustive mutations. The Monte Carlo sampling method, especially the simulated annealing Monte Carlo sampling method and the Monte Carlo sampling method based on the copy exchange can effectively improve the prediction accuracy and efficiency. By simulated annealing, the conditions for accepting a mutation may be less desirable early in the iterative update, e.g., the first 10% mutation of the complete amino acid sequence, and may be less desirable late in the iterative update, e.g., the last 10% mutation of the complete amino acid sequence. The efficiency and accuracy of obtaining the final amino acid sequence can be obviously improved.

Thus, in some embodiments of the present application, based on the trained matching model, the amino acid sequence can be iteratively updated by using a simulated annealing monte carlo sampling algorithm to obtain an amino acid sequence conforming to the three-dimensional structure of the target protein. Taking an example of a matching model using an energy function, the amino acid sequence is first randomly initialized, denoted as A1, for the three-dimensional structure of the target protein with reference to FIG. 13. Then, the amino acid sequence is iteratively updated according to a simulated annealing Monte Carlo sampling algorithm, and the temperature of simulated annealing is gradually reduced; during each iteration, a residue position was randomly selected, and the amino acid type and side chain structure were substituted, as a 1'. The amino acid sequences before and after the substitution are respectively input into the energy function, and the corresponding energy function values are compared. If the energy function value is reduced (namely the replaced sequence is more consistent with the three-dimensional structure of the main chain of the target protein), updating; otherwise, the replaced amino acid sequence is accepted with a certain probability, and the acceptance probability is determined by the current temperature and the change value of the energy function. After updating for a certain iteration, the final amino acid sequence Ak is obtained and is output as the result of the de novo protein design.

Therefore, the method provided by the embodiment of the application can improve the prediction accuracy of the related prediction such as the prediction of the amino acid sequence based on the three-dimensional structure of the protein, reduce the working cost and improve the prediction efficiency. In the embodiment of the application, the protein de novo design method based on the energy model is realized, the energy function capable of accurately evaluating the coincidence degree of the amino acid sequence and the expected three-dimensional structure of the protein main chain is obtained through data-driven model training, and the energy function is applied to the optimization process of the amino acid sequence, for example, the simulated annealing Monte Carlo sampling process, so that the amino acid sequence can be gradually optimized, the amino acid sequence capable of completing a specific biological function is obtained, and the purpose of de novo design of the protein is realized. Therefore, the embodiment of the application avoids the limitation of an energy function based on artificial design by constructing a pure data-driven energy function (matching model), and more accurately evaluates the matching degree between the amino acid sequence and the three-dimensional structure of the protein main chain, thereby realizing more efficient and accurate protein de novo design.

The foregoing describes a method for predicting an amino acid sequence using a matching model, and further, with reference to fig. 14, in a third aspect, embodiments of the present application propose a method for designing a drug, which includes:

s1401: determining a three-dimensional structure of a target protein based on a target of a known disease, the three-dimensional structure of the target protein being suitable for binding to the target; and

s1402: according to the method of the second aspect, the amino acid sequence of the target protein is determined based on the three-dimensional structure of the target protein.

As mentioned above, the three-dimensional structure of proteins is crucial for biological functions, and for various diseases, protein targets are also important targets for drug development. These biological targets include, but are not limited to, G protein-coupled receptors, enzymes, ion channels, carrier proteins, nuclear receptors, and other proteins such as structural proteins. By targeting a disease target, one can design a three-dimensional structure of a protein that can bind to the target. Based further on the methods described above, the amino acid sequence of the target protein is predicted, thereby enabling de novo drug design for specific diseases.

Referring to fig. 15, in a fourth aspect, embodiments of the present application propose an apparatus for predicting an amino acid sequence, including:

a first matching module 301, configured to determine a matching degree between the three-dimensional structure of the target protein and the initial amino acid sequence based on a matching model;

a mutation module 302 for mutating the starting amino acid sequence so as to obtain a mutated amino acid sequence;

a second matching module 303 for determining the degree of matching between the mutant amino acid sequence and the three-dimensional structure of the target protein based on the matching model;

a decision block 304 that determines whether to retain the mutation in the mutation block based on the difference in the degree of matching in the first and second matching blocks until a final amino acid sequence is obtained.

Referring to fig. 16, in a fifth aspect, an embodiment of the present application proposes an apparatus for training a matching model, wherein the energy function model is used for evaluating a relationship between a three-dimensional structure of a protein and an amino acid sequence, and the method comprises:

a sample set obtaining module 401, configured to obtain a sample set, where the sample set includes a three-dimensional structure of a known protein and an amino acid sequence corresponding to the three-dimensional structure of the known protein; and

a training module 402, configured to input the sample set into a matching function and train the sample set so as to obtain a trained matching model.

In a sixth aspect, embodiments of the present application provide a computing device, comprising: a processor and a memory; the memory for storing a computer program; the processor is adapted to execute the computer program to implement the method according to the first, second and third aspect.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the storage medium includes computer instructions, which when executed by a computer, cause the computer to implement the method according to the first, second and third aspects.

Those skilled in the art will appreciate that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, each device may correspond to a corresponding main body in executing the method of the embodiment of the present application, and the foregoing and other operations and/or functions of each module in each device are respectively for implementing corresponding flows in each method described above, and are not described herein again for brevity.

The apparatus of the embodiments of the present application is described above in connection with the drawings from the perspective of functional modules. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in a processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Fig. 17 is a block diagram of a computing device according to an embodiment of the present application, where the computing device may be the server shown in fig. 1, and is used to execute the method according to the foregoing embodiment, specifically referring to the description in the foregoing method embodiment.

The computing device 200 shown in fig. 17 includes a memory 201, a processor 202, and a communication interface 203. The memory 201, the processor 202 and the communication interface 203 are connected with each other in communication. For example, the memory 201, the processor 202, and the communication interface 203 may be connected by a network connection. Alternatively, the computing device 200 may also include a bus 204. The memory 201, the processor 202 and the communication interface 203 are connected to each other by a bus 204. Fig. 9 is a computing device 200 with a memory 201, a processor 202, and a communication interface 203 communicatively coupled to each other via a bus 204.

The Memory 201 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM). The memory 201 may store programs, and the processor 202 and the communication interface 203 are used to perform the above-described methods when the programs stored in the memory 201 are executed by the processor 202.

The processor 202 may be implemented as a general purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits.

The processor 202 may also be an integrated circuit chip having signal processing capabilities. In implementation, the method of the present application may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 202. The processor 202 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 201, and the processor 202 reads the information in the memory 201 and completes the method of the embodiment of the application in combination with the hardware thereof.

The communication interface 203 enables communication between the computing device 200 and other devices or communication networks using transceiver modules such as, but not limited to, transceivers. For example, the data set may be acquired through the communication interface 203.

When computing device 200 includes bus 204, as described above, bus 204 may include a pathway to transfer information between various components of computing device 200 (e.g., memory 201, processor 202, communication interface 203).

There is also provided according to the present application a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

There is also provided according to the present application a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of the above-described method embodiment.

In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application occur, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In addition, the method embodiments and the device embodiments may also refer to each other, and the same or corresponding contents in different embodiments may be referred to each other, which is not described in detail.

Claims

1. A method of training a matching model for characterizing the degree of match between an amino acid sequence and a three-dimensional structure of a protein, comprising:

2. The method of claim 1, wherein the matching model is obtained by training:

3. The method of claim 1, wherein the matching model is obtained by training:

determining a loss value based on the match probability prediction value;

4. The method of claim 3, wherein iteratively optimizing the matching function according to the loss value such that the matching probability prediction value is increased comprises:

5. The method according to claim 3, wherein the matching function is a normalized index function with a negative value of an energy function prediction value as an index, and the obtaining of the predicted matching probability of the three-dimensional structure of the known protein in the sample set with the corresponding amino acid sequence comprises:

6. The method of claim 5, wherein determining a loss value based on the match probability prediction value comprises:

carrying out logarithmic operation on the matching probability predicted value;

and determining a loss value according to the logarithm operation result.

7. A method according to claims 3-6, wherein the matching model is associated with trainable parameters, and wherein iteratively optimizing the matching function according to the loss values comprises:

8. The method of claim 7,

the energy function is a graph neural network containing trainable parameters;

9. The method of claim 8, wherein the sampling from the distribution of predicted samples corresponding to the matching function is performed by a markov chain monte carlo method.

10. A method of predicting an amino acid sequence, comprising:

(a) determining a matching result between the three-dimensional structure of the target protein and the initial amino acid sequence based on a matching model, wherein the matching result characterizes the matching degree between the initial amino acid sequence and the three-dimensional structure of the target protein, and the matching model is obtained by the method of any one of claims 1-10;

(d) repeating steps (b) to (d) until the final amino acid sequence is obtained.

11. The method of claim 10, wherein step (d) comprises:

12. A method of designing a drug, comprising:

the method according to claim 10 or 11, wherein the amino acid sequence of the target protein is determined based on the three-dimensional structure of the target protein.

13. An apparatus for training a matching model for characterizing the degree of matching between an amino acid sequence and a three-dimensional structure of a protein, comprising:

14. A computing device, comprising: a processor and a memory;

the memory for storing a computer program;

the processor for executing the computer program to implement the method of any one of claims 1 to 12.

15. A computer-readable storage medium, comprising computer instructions which, when executed by a computer, cause the computer to implement the method of any one of claims 1 to 12.