CN114420197A - Connector design method of protein degradation targeting chimera based on self-encoder - Google Patents
Connector design method of protein degradation targeting chimera based on self-encoder Download PDFInfo
- Publication number
- CN114420197A CN114420197A CN202210002416.1A CN202210002416A CN114420197A CN 114420197 A CN114420197 A CN 114420197A CN 202210002416 A CN202210002416 A CN 202210002416A CN 114420197 A CN114420197 A CN 114420197A
- Authority
- CN
- China
- Prior art keywords
- node
- vector
- atom
- feature vector
- vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/50—Molecular design, e.g. of drugs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Biotechnology (AREA)
- Databases & Information Systems (AREA)
- Medicinal Chemistry (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Bioethics (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Pharmacology & Pharmacy (AREA)
- Mathematical Physics (AREA)
- Physiology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a connector design method of a protein degradation targeting chimera based on an autoencoder, which comprises the following steps: replacing the linker of the existing protein degradation targeting chimera with various small molecules in a molecule library, removing the replacements which do not accord with chemical rules to obtain a large amount of pseudochimera molecules, and dividing the pseudochimera molecules into a training set and a verification set according to a certain proportion; designing a connector design network, training on a training set, and well reconstructing a connector of each input molecule; fine adjustment is carried out on the model trained on the pseudo chimera molecular set on the real chimera molecular set; the final trained network is validated on the test set and the decoder's effect on the generation of random samples from a given distribution is tested.
Description
Technical Field
The invention relates to the technical field of a protein degradation targeting chimera, in particular to a connector design method of a protein degradation targeting chimera based on an autoencoder.
Background
At present, the traditional machine learning usually depends on a data set of a large amount of data, and a prediction model or a regression model fitting the data set is learned, so that the obtained model is well suitable for the data of a training set and a test set. The variational self-encoder is a deep learning generation model based on the variational thought, and consists of an encoder and a decoder. The encoder maps the input to a particular distribution and takes samples of this distribution as input to the decoder, which yields an output that approximates the same distribution as the input data. The protein degradation target chimera is a hybrid dual-functional small molecule compound, which consists of three parts, namely a target protein ligand, a connector and an E3 ligase ligand, and is characterized in that the distance between target protein and intracellular E3 ubiquitin ligase is shortened, the target protein is specifically degraded by a ubiquitin-proteasome pathway, the target protein ligand of the chimera is combined with the target protein, the E3 ligase ligand is combined with the E3 ubiquitin ligase, and the connector connects the two ligands and has certain influence on the combination strength of the two parts.
However, amino acids are by far the most common linking moieties for degradation of targeted chimeras, including aminocaproic acid, glycine and serine, and polyethylene glycol is also commonly used to increase the hydrophilicity of targeted chimeras. Compared with a wide small molecule space, the number of the existing connectors is very rare, and the realization of a method for rapidly discovering a novel connector in the small molecule space is very important; the existing methods for designing a connector by machine learning mainly comprise two methods, one is that each molecule is directly expressed as a character sequence by SMILES, and the structure of the connector is predicted by a related algorithm of sequence prediction, so that the method does not utilize the information of the molecule and the topological structure of the molecule in a chemical principle; the second method is to use a graph neural network to obtain the characteristics of molecules and use a network structure of a conditional variational self-encoder to generate a new chimera, but in the method, 3D structure information of a connector is needed, the maximum number of atoms is determined before atoms are generated, and the interpretability is poor without starting from the topological structure of the connector.
Therefore, designing a linker design network to be applied to the design of linkers of protein degradation targeting chimeras is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a linker design method for protein degradation targeting chimera based on an autoencoder; by giving a targeting chimera molecule needing to design a new connector and inputting a connector design network, the network can extract related information from the input molecule to predict the topological structure of the connector to be designed, the topological structure is a tree representation of the connector molecule, and nodes and edges are predicted one by one through information transmission on the basis of the predicted tree, so that novel connector molecules are obtained.
In order to achieve the purpose, the invention adopts the following technical scheme:
a design method of a connector of a protein degradation targeting chimera based on an autoencoder comprises the following steps:
s1, acquiring a small molecular set, and preprocessing the small molecular set to obtain a molecular set with strong drug-like property;
s2, processing the molecular shear algorithm-based molecular shear subset to obtain a training set, wherein the training set comprises a ligand 1, a ligand 2 and a chimera;
s3, inputting the training set into an encoder, and performing information transmission through a graph neural network to obtain the feature vectors of the ligand 1, the ligand 2 and the chimera;
s4, inputting the feature vectors of the ligand 1, the ligand 2 and the chimera into the full-connection layer to obtain an implicit space vector 1, an implicit space vector 2 and a conditional space vector;
s5, splicing the conditional space vector with the hidden space vector 1 and the hidden space vector 2 respectively to obtain two hidden space vectors to be decoded;
s6, transmitting the two hidden space vectors into a full-connection layer 1, outputting the prediction probability of the corresponding connection body topology result, and initializing a new chimera by taking the connection body topology structure with the highest probability;
s7, transmitting the two hidden space vectors into a full-connection layer 2, respectively obtaining the characteristics of initializing each node by taking the corresponding connection point as a root node and weighting in a predicted connection structure with breadth first;
s8, randomly selecting a connection point as a root node, predicting the node type of each node with breadth first, updating node information once by using a graph neural network when each node is generated until all nodes of a connector are generated, and finishing the training of the network;
s9, inputting the pseudo chimeras of the ligands of two given connection points and a given connector structure into the trained encoder, outputting the implicit space vectors and the conditional implicit space vectors of the two ligand connection points as the input of the decoder, and outputting a new chimera structure by the decoder.
Preferably, the step S2 specifically includes:
marking all the single bonds which are not cyclized and are not connected with hydrogen atoms, forming a cleavable single bond set with the number n, traversing all the possible pairwise combinationsCleavage of each single bond of the combination gives a ligand-linker-ligand triad, i.e.a molecule gives(ii) for each triplet, calculating the average of the druggability of the two ligands as the druggability of that triplet, calculating the average of the synthesizability of the linker and the two ligands as the synthesizability of that triplet, and calculating the fraction of reasonableness of the triplet cut by the formula:
Score=μSd+(1-μ)SA
wherein SdIndicates the patent drug score, SAAnd (3) representing a synthetic score, wherein mu is a hyperparameter for weighing two scores, 0.6 is taken, and after the rationality scores of the triples obtained by all the cutting schemes of each molecule are calculated, the contents of the part with the highest score are obtained and added into a training set.
Preferably, the step S3 specifically includes:
each atom and each bond are represented by a unique hot code, the total number of the atoms and each bond is single, double and triple, the unique hot code of each bond at the side is a three-dimensional vector, if the atom and each bond are single bonds, the first element is 1, and the other two elements are 0; iteratively updating the information of each node through the initial characteristic vector; for the updating of N atoms in the ligand 1, the characteristic vectors of the atoms connected with the N atoms are updated by weighted summation of the characteristic vectors
Representing the value of the next update to the current atom u feature vector,a feature vector representing the current atom u,feature vector, e, representing each neighbor atom to which atom u is connecteduvRepresenting the characteristic vector of the connection of the atom u and the atom v, and the weight value of the characteristic vector of the neighbor atom is euvA number between 0 and 1 is output after a neural network layer nn (), wherein eta is a hyper-parameter;
after each atom is updated for a certain number of times in such a way, the feature vectors of all atoms are fused to the atoms to be connected in a weighted manner
h is the feature vector of the atom to be connected which is to be updated finally, hiAre the feature vectors of the remaining atoms and,for the weighted sum of weights, εiIs the shortest path from the atom to be connected, i.e. the shortest path between two nodes in a graph.
Preferably, the step S4 specifically includes:
setting the dimension of the hidden space vector and a specific distribution obeyed by the hidden space, setting the dimension of the vector, so that each molecule has two distribution parameter vectors with set dimensions, each dimension respectively represents the mean value and the variance of a unitary Gaussian distribution:
μ=nnμ(h
σ=nnσ(h)
the corresponding hidden space feature vector is obtained by sampling in Gaussian distribution determined by the parameter, a vector with a set dimension is obtained by sampling in multivariate normal distribution with a set dimension, multiplied by the corresponding element of the obtained variance vector and added with the corresponding element of the mean vector to obtain the hidden space vector of the molecule:
z=μ+σ·ε。
preferably, the step S6 specifically includes:
and counting the number of topological structures of all chimeras appearing in the training set, and recording the number as N, so that the full-connection layer takes two hidden space vectors as input, outputs an N-dimensional vector normalized by softmax, and the element value of each dimension represents the prediction probability of one topological structure.
Preferably, the step S7 specifically includes:
obtaining initial characteristic vectors of two atoms to be connected by passing the hidden space vectors of the two atoms to be connected through a neural network layer
Initializing the feature vector of each node in the molecules of the connector to be generated based on the initial feature vectors of the two points to be connected, wherein the initialization formula of the feature vector of each node i is
εi1Is the shortest distance, ε, from the node i to the point 1 to be connectedi2Is the shortest distance from the node i to the point 2 to be connected.
Preferably, the step S8 specifically includes:
predicting the specific type of each node by using breadth first principle, and giving a feature vector of one nodeOutputting an m-dimensional vector through a full connection layer, normalizing by softmax to obtain a probability distribution vector for predicting the node type,
after the specific type of the node is predicted, the feature vector of the node is subjected to one-time self-updating:
is the feature vector of the node before prediction, hiIs a unique code of the node type and mu is a hyper-parameter, between 0 and 1.
After predicting a new node type, predicting which two atoms in two nodes should be connected, the specific formula is as follows
When predicting an atom to which the node 1 should be connected, for each atom, using the same full-connection layer, using the feature vector of the atom and the feature vector of the node 2 as input, outputting the probability that the atom is connected in the node 1,a feature vector representing the ith atom in node 1, the feature vector of each atom being set in advance,a feature vector representing node 2;
updating the feature vector of the adjacent unpredicted node once every time a new node type is predicted
WhereinIs the feature vector of the newly connected node, has been updated by itself,is the original feature vector of the neighbor node of the node, and the lambda is a hyper-parameter and is between 0 and 1.
According to the technical scheme, compared with the prior art, the invention discloses a connector design method of a protein degradation targeting chimera based on an autoencoder; by giving a targeting chimera molecule needing to design a new connector and inputting a connector design network, the network can extract related information from the input molecule to predict the topological structure of the connector to be designed, the topological structure is a tree representation of the connector molecule, and nodes and edges are predicted one by one through information transmission on the basis of the predicted tree, so that novel connector molecules are obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic diagram of a process flow structure of the method provided by the present invention.
Fig. 2 is a schematic diagram of a flow structure of data set preparation provided by the present invention.
FIG. 3 is a schematic representation of the molecular shear and triad provided by the present invention.
Fig. 4 is a schematic diagram of a network structure provided by the present invention.
FIG. 5 is a schematic diagram of a new linker generation structure provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a connector design method of a protein degradation targeting chimera based on an autoencoder, which comprises the following steps:
s1, acquiring a small molecular set, and preprocessing the small molecular set to obtain a molecular set with strong drug-like property;
s2, processing the molecular shear algorithm-based molecular shear subset to obtain a training set, wherein the training set comprises a ligand 1, a ligand 2 and a chimera;
s3, inputting the training set into an encoder, and performing information transmission through a graph neural network to obtain the feature vectors of the ligand 1, the ligand 2 and the chimera;
s4, inputting the feature vectors of the ligand 1, the ligand 2 and the chimera into the full-connection layer to obtain an implicit space vector 1, an implicit space vector 2 and a conditional space vector;
s5, splicing the conditional space vector with the hidden space vector 1 and the hidden space vector 2 respectively to obtain two hidden space vectors to be decoded;
s6, transmitting the two hidden space vectors into a full-connection layer 1, outputting the prediction probability of the corresponding connection body topology result, and initializing a new chimera by taking the connection body topology structure with the highest probability;
s7, transmitting the two hidden space vectors into a full-connection layer 2, respectively obtaining the characteristics of initializing each node by taking the corresponding connection point as a root node and weighting in a predicted connection structure with breadth first;
s8, randomly selecting a connection point as a root node, predicting the node type of each node with breadth first, updating node information once by using a graph neural network when each node is generated until all nodes of a connector are generated, and finishing the training of the network;
s9, inputting the pseudo chimeras of the ligands of two given connection points and a given connector structure into the trained encoder, outputting the implicit space vectors and the conditional implicit space vectors of the two ligand connection points as the input of the decoder, and outputting a new chimera structure by the decoder.
To further optimize the above technical solution, step S2 specifically includes:
marking all the single bonds which are not cyclized and are not connected with hydrogen atoms, forming a cleavable single bond set with the number n, traversing all the possible pairwise combinationsCleavage of each single bond of the combination gives a ligand-linker-ligand triad, i.e.a molecule gives(ii) for each triplet, calculating the average of the druggability of the two ligands as the druggability of that triplet, calculating the average of the synthesizability of the linker and the two ligands as the synthesizability of that triplet, and calculating the fraction of reasonableness of the triplet cut by the formula:
Score=μSd+(1-μ)SA
wherein SdIndicates the patent drug score, SAAnd (3) representing a synthetic score, wherein mu is a hyperparameter for weighing two scores, 0.6 is taken, and after the rationality scores of the triples obtained by all the cutting schemes of each molecule are calculated, the contents of the part with the highest score are obtained and added into a training set.
To further optimize the above technical solution, step S3 specifically includes:
each atom and each bond are represented by a unique hot code, the total number of the atoms and each bond is single, double and triple, the unique hot code of each bond at the side is a three-dimensional vector, if the atom and each bond are single bonds, the first element is 1, and the other two elements are 0; iteratively updating the information of each node through the initial characteristic vector; for the updating of N atoms in the ligand 1, the characteristic vectors of the atoms connected with the N atoms are updated by weighted summation of the characteristic vectors
Representing the value of the next update to the current atom u feature vector,features representing the current atom uThe vector of the vector is then calculated,feature vector, e, representing each neighbor atom to which atom u is connecteduvRepresenting the characteristic vector of the connection of the atom u and the atom v, and the weight value of the characteristic vector of the neighbor atom is euvA number between 0 and 1 is output after a neural network layer nn (), wherein eta is a hyper-parameter;
after each atom is updated for a certain number of times in such a way, the feature vectors of all atoms are fused to the atoms to be connected in a weighted manner
h is the feature vector of the atom to be connected which is to be updated finally, hiAre the feature vectors of the remaining atoms and,for the weighted sum of weights, εiIs the shortest path from the atom to be connected, i.e. the shortest path between two nodes in a graph.
To further optimize the above technical solution, step S4 specifically includes:
setting the dimension of the hidden space vector and a specific distribution obeyed by the hidden space, setting the dimension of the vector, so that each molecule has two distribution parameter vectors with set dimensions, each dimension respectively represents the mean value and the variance of a unitary Gaussian distribution:
μ=nnμ(h)
σ=nnσ(h)
the corresponding hidden space feature vector is obtained by sampling in Gaussian distribution determined by the parameter, a vector with a set dimension is obtained by sampling in multivariate normal distribution with a set dimension, multiplied by the corresponding element of the obtained variance vector and added with the corresponding element of the mean vector to obtain the hidden space vector of the molecule:
z=μ+σ·ε。
to further optimize the above technical solution, step S6 specifically includes:
and counting the number of topological structures of all chimeras appearing in the training set, and recording the number as N, so that the full-connection layer takes two hidden space vectors as input, outputs an N-dimensional vector normalized by softmax, and the element value of each dimension represents the prediction probability of one topological structure.
To further optimize the above technical solution, step S7 specifically includes:
obtaining initial characteristic vectors of two atoms to be connected by passing the hidden space vectors of the two atoms to be connected through a neural network layer
Initializing the feature vector of each node in the molecules of the connector to be generated based on the initial feature vectors of the two points to be connected, wherein the initialization formula of the feature vector of each node i is
εi1Is the shortest distance, ε, from the node i to the point 1 to be connectedi2Is the shortest distance from the node i to the point 2 to be connected.
To further optimize the above technical solution, step S8 specifically includes:
predicting the specific type of each node by using breadth first principle, and giving a feature vector of one nodeOutputting an m-dimensional vector through a full connection layer, and normalizing through softmax to obtain the predicted nodeA probability distribution vector for the type(s),
after the specific type of the node is predicted, the feature vector of the node is subjected to one-time self-updating:
is the feature vector of the node before prediction, hiIs a unique code of the node type and mu is a hyper-parameter, between 0 and 1.
After predicting a new node type, predicting which two atoms in two nodes should be connected, the specific formula is as follows
When predicting an atom to which the node 1 should be connected, for each atom, using the same full-connection layer, using the feature vector of the atom and the feature vector of the node 2 as input, outputting the probability that the atom is connected in the node 1,a feature vector representing the ith atom in node 1, the feature vector of each atom being set in advance,a feature vector representing node 2;
updating the feature vector of the adjacent unpredicted node once every time a new node type is predicted
WhereinIs the feature vector of the newly connected node, has been updated by itself,is the original feature vector of the neighbor node of the node, and the lambda is a hyper-parameter and is between 0 and 1.
Downloading a small molecule subset from a small molecule data set website (ChEMBL, ZINC);
calculating five multiplying powers, toxic groups and synthesization of each molecule, and performing primary screening to obtain a subset with strong drug-like property;
generating a training set by utilizing a molecular shearing algorithm, wherein each input data of the training set is two ligands, and a label is a whole original molecule, namely two ligands and a connector; the molecular shearing adopts a molecular shearing algorithm proposed by Hussain, and effective groups of the original molecules are kept as much as possible in the sheared triads;
the whole network mainly consists of an encoder and a decoder.
(1) Encoder for encoding a video signal
Respectively transmitting information of the ligand 1, the ligand 2 and the original molecule by using a graph neural network to obtain a characteristic vector of each node;
the characteristics of the connection point 1 and the connection point 2 are characteristic vectors of nodes needing to be connected in the two ligands, and the characteristics of the chimera are the average value of the characteristic vectors of all the nodes of the chimera;
the three characteristics are respectively processed by a full connection layer to obtain parameters of given distribution, Gaussian distribution is adopted to obtain a mean value and a variance, and sampling is carried out on the obtained corresponding Gaussian distribution to obtain a hidden space vector 1, a vector 2 and a conditional hidden space vector;
(2) decoder
Splicing the conditional implicit space vectors obtained by the decoder with the implicit space vectors of the two connection points respectively to obtain two decoding-wearing implicit space vectors;
respectively transmitting the two hidden space vectors into the same full-connection layer, respectively outputting the prediction probability of a topological structure of a connection body, and initializing a new chimera by taking the topological structure with the highest probability;
respectively transmitting the two hidden space vectors into the same full-connection layer (different from the full-connection layer), respectively outputting the characteristics of each node by taking the corresponding connection point as a root node and weighting and initializing the predicted connection structure by breadth first;
randomly selecting a connection point as a root node, predicting the node type of each node with breadth first, updating node information once by using a graph neural network when a node is newly generated until all nodes of a connector are generated;
inputting two ligands with connection (a connection point needs to be given) and a pseudo-chimera, wherein the structure of a connecting body in the pseudo-chimera is considered to be given;
the encoder outputs the implicit space vectors and the conditional implicit space vectors of the two ligand connection points as the input of the decoder;
the decoder outputs a new mosaic structure;
loss function
Loss function of network is based on reconstruction loss LreconAnd distribution loss LKLThe two parts are as follows:
L=Lrecon+λKLLKL
wherein λKLIs a manually set hyper-parameter.
Loss of reconstruction
The reconstruction loss consists of three parts of a connecting body topological structure prediction loss, a node prediction loss and an edge prediction loss:
one item on the left is the predicted loss of the topology of the link, k is the total number of topologies, ytopu,iAnd (3) representing the probability truth value of the connector in the ith type topological structure, namely the connector belongs to the ith type and is 1, otherwise, the connector is 0.Representing the probability that the network predicted connection is of the ith type of topology;
predicting the loss of the node type in the middle, wherein m is the total number of nodes in the training set, n is the total number of nodes in the predicted topological structure,and representing the probability truth value of each j node of the connecting body in the i-th class node, namely the node belongs to the i-th class and is 1, otherwise, the node is 0.Representing the probability that the network predicts that the jth node belongs to the ith class node;
the one on the right side is the edge type prediction loss, '3' indicates that the types of edges are three in total, single bonds, double bonds and triple bonds, and the type of the edge connecting the existing structure and the new node needs to be predicted after each node is generated, but when all the nodes are predicted, the type of the edge connecting another ligand needs to be predicted, so that the number of predicted edges is n + 1. In the same wayRepresenting the probability truth of the jth edge on the ith class edge,representing the probability that the network predicts that the jth edge belongs to the ith class edge
Loss of distribution
The distribution loss adopts classical KL divergence loss:
where J is an artificially given hyper-parameter representing dimensions of the hidden space, each dimension representing a Gaussian distribution, μjAnd σjRepresenting the mean and variance of the gaussian distribution of the network's prediction in dimension j, the loss function is expected to map the input into a normal distribution, allowing the decoder to generate more novel concatenations by sampling in the gaussian distribution.
The application has the advantages that:
new connectors are generated starting from the topology of the connectors. The influence of the connector on the two ligands is mainly the indirect influence of the topological structures of the two ligands, and the generation of a new connector from the point has more instructive significance on a specific application scene, but the existing two connector design methods do not consider the point;
the method does not need to train 3D structural information of a large number of molecules in a set, and has stronger applicability;
although the background of the method is the design of a chimeric connector, the method can be easily popularized to the applications of framework transition, lead optimization and the like, and has rich functions;
the embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (7)
1. A design method of a connector of a protein degradation targeting chimera based on an autoencoder is characterized by comprising the following steps:
s1, acquiring a small molecular set, and preprocessing the small molecular set to obtain a molecular set with strong drug-like property;
s2, processing the molecular shear algorithm-based molecular shear subset to obtain a training set, wherein the training set comprises a ligand 1, a ligand 2 and a chimera;
s3, inputting the training set into an encoder, and performing information transmission through a graph neural network to obtain the feature vectors of the ligand 1, the ligand 2 and the chimera;
s4, inputting the feature vectors of the ligand 1, the ligand 2 and the chimera into the full-connection layer to obtain an implicit space vector 1, an implicit space vector 2 and a conditional space vector;
s5, splicing the conditional space vector with the hidden space vector 1 and the hidden space vector 2 respectively to obtain two hidden space vectors to be decoded;
s6, transmitting the two hidden space vectors into a full-connection layer 1, outputting the prediction probability of the corresponding connection body topology result, and initializing a new chimera by taking the connection body topology structure with the highest probability;
s7, transmitting the two hidden space vectors into a full-connection layer 2, respectively obtaining the characteristics of initializing each node by taking the corresponding connection point as a root node and weighting in a predicted connection structure with breadth first;
s8, randomly selecting a connection point as a root node, predicting the node type of each node with breadth first, updating node information once by using a graph neural network when each node is generated until all nodes of a connector are generated, and finishing the training of the network;
s9, inputting the pseudo chimeras of the ligands of two given connection points and a given connector structure into the trained encoder, outputting the implicit space vectors and the conditional implicit space vectors of the two ligand connection points as the input of the decoder, and outputting a new chimera structure by the decoder.
2. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S2 specifically includes:
marking all the single bonds which are not cyclized and are not connected with hydrogen atoms, forming a cleavable single bond set with the number n, traversing all the possible pairwise combinationsCleavage of each single bond of the combination gives a ligand-linker-ligand triad, i.e.a molecule gives(ii) for each triplet, calculating the average of the druggability of the two ligands as the druggability of that triplet, calculating the average of the synthesizability of the linker and the two ligands as the synthesizability of that triplet, and calculating the fraction of reasonableness of the triplet cut by the formula:
Score=μSd+(1-μ)SA
wherein SdIndicates the patent drug score, SAAnd (3) representing a synthetic score, wherein mu is a hyperparameter for weighing two scores, 0.6 is taken, and after the rationality scores of the triples obtained by all the cutting schemes of each molecule are calculated, the contents of the part with the highest score are obtained and added into a training set.
3. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S3 specifically includes:
each atom and each bond are represented by a unique hot code, the total number of the atoms and each bond is single, double and triple, the unique hot code of each bond at the side is a three-dimensional vector, if the atom and each bond are single bonds, the first element is 1, and the other two elements are 0; iteratively updating the information of each node through the initial characteristic vector; for the updating of N atoms in the ligand 1, the characteristic vectors of the atoms connected with the N atoms are updated by weighted summation of the characteristic vectors
Representing the value of the next update to the current atom u feature vector,a feature vector representing the current atom u,feature vector, e, representing each neighbor atom to which atom u is connecteduvRepresenting the characteristic vector of the connection of the atom u and the atom v, and the weight value of the characteristic vector of the neighbor atom is euvA number between 0 and 1 is output after a neural network layer nn (), wherein eta is a hyper-parameter;
after each atom is updated for a certain number of times in such a way, the feature vectors of all atoms are fused to the atoms to be connected in a weighted manner
4. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S4 specifically includes:
setting the dimension of the hidden space vector and a specific distribution obeyed by the hidden space, setting the dimension of the vector, so that each molecule has two distribution parameter vectors with set dimensions, each dimension respectively represents the mean value and the variance of a unitary Gaussian distribution:
μ=nnμ(h)
σ=nnσ(h)
the corresponding hidden space feature vector is obtained by sampling in Gaussian distribution determined by the parameter, a vector with a set dimension is obtained by sampling in multivariate normal distribution with a set dimension, multiplied by the corresponding element of the obtained variance vector and added with the corresponding element of the mean vector to obtain the hidden space vector of the molecule:
z=μ+σ·ε。
5. the method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S6 specifically includes:
and counting the number of topological structures of all chimeras appearing in the training set, and recording the number as N, so that the full-connection layer takes two hidden space vectors as input, outputs an N-dimensional vector normalized by softmax, and the element value of each dimension represents the prediction probability of one topological structure.
6. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S7 specifically includes:
obtaining initial characteristic vectors of two atoms to be connected by passing the hidden space vectors of the two atoms to be connected through a neural network layer
Initializing the feature vector of each node in the molecules of the connector to be generated based on the initial feature vectors of the two points to be connected, wherein the initialization formula of the feature vector of each node i is
εi1Is the shortest distance, ε, from the node i to the point 1 to be connectedi2Is the shortest distance from the node i to the point 2 to be connected.
7. The method for designing the linker of the self-encoder-based protein degradation targeting chimera according to claim 1, wherein the step S8 specifically includes:
predicting the specific type of each node by using breadth first principle, and giving a feature vector of one nodeOutputting an m-dimensional vector through a full connection layer, normalizing by softmax to obtain a probability distribution vector for predicting the node type,
after the specific type of the node is predicted, the feature vector of the node is subjected to one-time self-updating:
is the feature vector of the node before prediction, hiIs a unique code of the node type, mu is a hyper-parameter, between 0 and 1;
after predicting a new node type, predicting which two atoms in two nodes should be connected, the specific formula is as follows
When predicting an atom to which the node 1 should be connected, for each atom, using the same full-connection layer, using the feature vector of the atom and the feature vector of the node 2 as input, outputting the probability that the atom is connected in the node 1,a feature vector representing the ith atom in node 1, the feature vector of each atom being set in advance,a feature vector representing node 2;
updating the feature vector of the adjacent unpredicted node once every time a new node type is predicted
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210002416.1A CN114420197A (en) | 2022-01-04 | 2022-01-04 | Connector design method of protein degradation targeting chimera based on self-encoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210002416.1A CN114420197A (en) | 2022-01-04 | 2022-01-04 | Connector design method of protein degradation targeting chimera based on self-encoder |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114420197A true CN114420197A (en) | 2022-04-29 |
Family
ID=81271181
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210002416.1A Pending CN114420197A (en) | 2022-01-04 | 2022-01-04 | Connector design method of protein degradation targeting chimera based on self-encoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114420197A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116312855A (en) * | 2023-02-28 | 2023-06-23 | 杭州生奥信息技术有限公司 | Method for optimizing activity of lead compound |
CN116597892A (en) * | 2023-05-15 | 2023-08-15 | 之江实验室 | Model training method and molecular structure information recommending method and device |
-
2022
- 2022-01-04 CN CN202210002416.1A patent/CN114420197A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116312855A (en) * | 2023-02-28 | 2023-06-23 | 杭州生奥信息技术有限公司 | Method for optimizing activity of lead compound |
CN116312855B (en) * | 2023-02-28 | 2023-09-08 | 杭州生奥信息技术有限公司 | Method for optimizing activity of lead compound |
CN116597892A (en) * | 2023-05-15 | 2023-08-15 | 之江实验室 | Model training method and molecular structure information recommending method and device |
CN116597892B (en) * | 2023-05-15 | 2024-03-19 | 之江实验室 | Model training method and molecular structure information recommending method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11544530B2 (en) | Self-attentive attributed network embedding | |
US11537898B2 (en) | Generative structure-property inverse computational co-design of materials | |
CN103049792B (en) | Deep-neural-network distinguish pre-training | |
CN107103331B (en) | Image fusion method based on deep learning | |
CN114420197A (en) | Connector design method of protein degradation targeting chimera based on self-encoder | |
CN111949865A (en) | Interest point recommendation method based on graph neural network and user long-term and short-term preference | |
US20060212279A1 (en) | Methods for efficient solution set optimization | |
Yu et al. | Evolutionary fuzzy neural networks for hybrid financial prediction | |
EP1818862A1 (en) | Combining model-based and genetics-based offspring generation for multi-objective optimization using a convergence criterion | |
CN113780002A (en) | Knowledge reasoning method and device based on graph representation learning and deep reinforcement learning | |
CN113792110A (en) | Equipment trust value evaluation method based on social networking services | |
CN115659807A (en) | Method for predicting talent performance based on Bayesian optimization model fusion algorithm | |
CN114999565A (en) | Drug target affinity prediction method based on representation learning and graph neural network | |
Li et al. | Foldingzero: Protein folding from scratch in hydrophobic-polar model | |
CN113205181A (en) | Graph combination optimization problem solving method based on deep graph learning | |
Zhou | Explainable ai in request-for-quote | |
CN111126607B (en) | Data processing method, device and system for model training | |
Lin et al. | An efficient hybrid Taguchi-genetic algorithm for protein folding simulation | |
CN106355000A (en) | Scaffolding method based on statistical characteristic of double-end insert size | |
CN116564555A (en) | Drug interaction prediction model construction method based on deep memory interaction | |
CN115544307A (en) | Directed graph data feature extraction and expression method and system based on incidence matrix | |
EP4170556A1 (en) | Control device, method, and program | |
CN110119779B (en) | Cross-network data arbitrary dimension fusion method and device based on self-encoder | |
CN109858520B (en) | Multi-layer semi-supervised classification method | |
Wong et al. | Rainfall prediction using neural fuzzy technique |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |